Content
summary Summary

Voice cloning technology has made significant advances recently. Now, just a few seconds of audio can be enough to synthesize a person's voice. This breakthrough is especially beneficial for those with speech impairments.

Ad

Google recently unveiled a new zero-shot voice transfer module for text-to-speech systems. It aims to restore voices for people with conditions like dysarthria, who may have lost their typical voice or never had one.

"Speakers with degenerative neural diseases, such as amyotrophic lateral sclerosis (ALS), Parkinson's, and multiple sclerosis, may experience a degradation of some of the unique characteristics of their voice over time," the authors explain in the blog post. "Some individuals are born with conditions, like muscular dystrophy, that affect the articulatory system and limit their ability to produce certain sounds."

Unlike previous technologies that required extensive voice samples to train models, Google's new approach needs no training. It uses short audio reference clips during generation to transfer a person's voice to the synthesized output. This makes it suitable for those with limited or atypical voice samples.

Ad
Ad

High-fidelity voice synthesis from short samples

The researchers integrated their voice transfer module into a text-to-speech system. They demonstrated its effectiveness in restoring voices for speakers who had recorded their typical speech before deterioration. Notably, the model produced high-quality speech with strong voice fidelity, even when given atypical input samples.

The module takes a 2-14 second spectrogram and extracts acoustic-phonetic and prosodic voice characteristics. It passes these on to other layers as an embedding vector.

Image: Google

In a case study, deaf Google researcher Dimitri Kanevsky provided 12 seconds of his atypical voice as a reference. The model synthesized a transcript of Kanevsky's original video. Colleagues rated the similarity of the output voice to his own at 8.1/10 on average.

Another study focused on Aubrie Lee, a Google employee with muscular dystrophy who never had a typical voice. Using 14 seconds of Lee's atypical reference voice, the model synthesized the transcript of her video. Lee herself rated the similarity at 8/10.

The researchers also showed the model's ability to translate voices into other languages including French, Spanish, Italian, Arabic, German, Russian, Hindi, and Norwegian. Audio samples are available on the project's GitHub page.

Recommendation

Addressing potential misuse and future availability

To prevent misuse, Google uses its SynthID watermarking system. This embeds imperceptible information into the synthesized audio, enabling identification of potentially manipulated content.

Google notes that the risk of misuse is lower for people who never had typical speech, as the synthetic nature of the output would be apparent.

Recently, Microsoft delayed releasing a similar powerful voice synthesis model due to the lack of a reliable labeling system. Google has not yet announced plans to release their new system publicly.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google has developed a zero-shot voice transfer module for text-to-speech systems. This technology can restore voices for people with speech disorders like dysarthria, even if they've lost their voice or never had a typical one.
  • The module works by using brief audio reference samples to transfer a person's voice characteristics to synthesized speech output. In case studies, the model produced high-quality speech with remarkable voice fidelity, even from atypical input.
  • To address concerns about misuse, Google employs its SynthID watermarking system. This embeds imperceptible information into the synthesized audio, allowing for the identification of potentially manipulated content.
Sources
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.