The increasing ease with which anyone can create compelling audio using another person’s voice has many people nervous, and rightly so. look like the proposed AI generated speech watermark might not fix it in shopping mode one , but it’s a step in the right direction.
AI-generated speech is used for all sorts of legitimate purposes, from screen readers to replacing voice actors (with your permission, of course). But as with almost any technology, speech generation can also be used for malicious purposes, producing fake quotes from politicians or celebrities. It is highly desirable to find a way to differentiate the real from the fake that does not depend on a publicist or careful listening.
Watermarking is a technique by which an image or sound is printed with an identifiable pattern that shows its origin. We’ve all seen obvious watermarks like a logo on an image, but not all of them are as noticeable.
In images, a hidden watermark can hide the pattern on a pixel-by-pixel level, leaving the image unchanged to the human eye but identifiable to a computer. Same for audio: an occasional quiet sound scrambling information might not be something a casual listener would hear.
The problem with these subtle watermarks is that they tend to rub off even with minor media modifications. Resize image? There goes your pixel-perfect code. Encode audio for streaming? The secret tones are compressed until they disappear.
Resemble AI is among a new cohort of generative AI startups that aim to use finely tuned voice models to produce voice-overs, audiobooks, and other media typically produced by regular human voices. But if such models, perhaps trained on hours of audio provided by actors, fall into malicious hands, these companies could find themselves at the center of a public relations disaster and perhaps serious liability. Therefore, they are very interested in finding a way to make their recordings as realistic as possible and also easily verifiable as AI generated.
PerTh is Resemble’s proposed watermarking process for this purpose, an awkward combination of “perceptual” and “threshold.”
“We have developed an additional layer of security that uses machine learning models to embed packets of data into the voice content we generate and retrieve that data at a later time,” the company writes in a blog post explaining the technology. “Because the data is imperceptible, yet tightly coupled to speech information, it is difficult to remove and provides a way to verify whether Resemble generated a given clip. Importantly, this ‘watermarking’ technique also tolerates various audio manipulations like speeding up, slowing down, converting to compressed formats like MP3, etc.
It is based on a quirk of how humans process audio, whereby tones with high audibility essentially “mask” nearby lower-amplitude tones. So if someone laughs and produces spikes at the 5000 Hz, 8000 Hz and 9200 Hz frequencies, he can slip structured tones that occur simultaneously within a few hertz, and they will be more or less imperceptible to listeners. But if you do it right, they will also be resistant to removal, since they are very close to a significant part of the audio.
Here comes the diagram:
Diagram showing how minor tones are “masked” by nearby peaks.
It’s intuitive, but the challenge was certainly creating a machine learning model that can locate candidate waveform sections and automatically produce the appropriate, but inaudible, audio tones that carry the identifying information. Then you have to reverse that process while still being resistant to common sound manipulations like the ones mentioned above.
Here are two examples they provided. See if you can figure out which one has a watermark. Hover over here to see the response in your status bar.
Audioplayer
00:00
00:00
00:00
Use the up/down arrow keys to increase or decrease the volume.
Audioplayer
00:00
00:00
00:12
Use the up/down arrow keys to increase or decrease the volume.
I can’t tell the difference and even inspecting the waveforms very closely I couldn’t find any obvious anomalies. I’m not handy enough with a spectrum analyzer these days to really get in there, but I suspect that’s where you might see something. In any case, if your claim that the data indicating generation by Resemble is more or less irreversibly encoded in shopping mode one of these clips, I’d say it’s a hit.
PerTh will soon roll out to all Resemble customers, and to be clear, you can only flag and detect company-generated speech at this time. But if they did, others likely will too, and these engines are likely to soon be inextricably linked to the speech generation models themselves. Malicious actors will always find a way around this sort of thing, but putting up barriers should help curb some of that behavior.
However, audio is special in this regard, and similar tricks won’t work for text or images. So expect to stay in the uncanny valley for a while in those domains.