Google gives people back the voice they never had

Aug 25, 2024

via Fadi Biadsy

Voice cloning technology has made significant advances recently. Now, just a few seconds of audio can be enough to synthesize a person's voice. This breakthrough is especially beneficial for those with speech impairments.

Google recently unveiled a new zero-shot voice transfer module for text-to-speech systems. It aims to restore voices for people with conditions like dysarthria, who may have lost their typical voice or never had one.

"Speakers with degenerative neural diseases, such as amyotrophic lateral sclerosis (ALS), Parkinson's, and multiple sclerosis, may experience a degradation of some of the unique characteristics of their voice over time," the authors explain in the blog post. "Some individuals are born with conditions, like muscular dystrophy, that affect the articulatory system and limit their ability to produce certain sounds."

Unlike previous technologies that required extensive voice samples to train models, Google's new approach needs no training. It uses short audio reference clips during generation to transfer a person's voice to the synthesized output. This makes it suitable for those with limited or atypical voice samples.

High-fidelity voice synthesis from short samples

The researchers integrated their voice transfer module into a text-to-speech system. They demonstrated its effectiveness in restoring voices for speakers who had recorded their typical speech before deterioration. Notably, the model produced high-quality speech with strong voice fidelity, even when given atypical input samples.

The module takes a 2-14 second spectrogram and extracts acoustic-phonetic and prosodic voice characteristics. It passes these on to other layers as an embedding vector.

In a case study, deaf Google researcher Dimitri Kanevsky provided 12 seconds of his atypical voice as a reference. The model synthesized a transcript of Kanevsky's original video. Colleagues rated the similarity of the output voice to his own at 8.1/10 on average.

Another study focused on Aubrie Lee, a Google employee with muscular dystrophy who never had a typical voice. Using 14 seconds of Lee's atypical reference voice, the model synthesized the transcript of her video. Lee herself rated the similarity at 8/10.

The researchers also showed the model's ability to translate voices into other languages including French, Spanish, Italian, Arabic, German, Russian, Hindi, and Norwegian. Audio samples are available on the project's GitHub page.

Addressing potential misuse and future availability

To prevent misuse, Google uses its SynthID watermarking system. This embeds imperceptible information into the synthesized audio, enabling identification of potentially manipulated content.

Google notes that the risk of misuse is lower for people who never had typical speech, as the synthetic nature of the output would be apparent.

Recently, Microsoft delayed releasing a similar powerful voice synthesis model due to the lack of a reliable labeling system. Google has not yet announced plans to release their new system publicly.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

AI news without the hype
Curated by humans.

More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Google gives people back the voice they never had

High-fidelity voice synthesis from short samples

Addressing potential misuse and future availability

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.