NaturalSpeech 2: Microsoft edges closer to zero-shot voice cloning

Midjourney prompted by THE DECODER

Microsoft presents NaturalSpeech 2, a text-to-speech model that is based on diffusion models and is capable of cloning any voice with just a short snippet of audio.

Microsoft Research Asia and Microsoft Azure Speech developed NaturalSpeech 2 using a diffusion model that interacts with a Neural Audio codec, which compresses waveforms into vectors. The team trained the Neural Audio Codec with 44,000 hours of speech and singing data, with the codec encoder learning to convert waveforms to vectors using residual vector quantizers (RVQ).

The RVQ uses several "codebooks" as templates for this process, compressing waveforms into predefined vectors. The codec encoder converts the quantized vectors back into waveforms. During training, the diffusion model learns to convert text into such quantized vectors, so that it can later pass arbitrary text input to the decoder, which converts it into speech or song.

Microsoft's NaturalSpeech 2 outperforms VALL-E

NaturalSpeech 2 has over 400 million parameters and generates speech with different speaker identities, prosodies and styles (e.g. singing) in zero-shot scenarios where only a few seconds of speech are available. In experiments, the team shows that NaturalSpeech 2 is able to generate natural speech in these scenarios, outperforming the best text-to-speech systems to date, including Microsoft's own VALL-E, which is also based on a diffusion model.

Text-Prompt

And lay me down in my cold bed and leave my shining lot.

Audio Reference

Ground Truth

VALL-E

NaturalSpeech 2

Recommendation

AI research

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

The 44,000 hours of recordings used for training came from 5,000 different speakers and included recordings made under less than ideal studio conditions. The audio codec was trained using 8 Nvidia Tesla V100 (16 gigabytes) GPUs, and the diffusion model was trained using 16 V100 (32 gigabytes) GPUs.

NaturalSpeech 2: Microsoft warns of misuse

The team warns of possible misuse of the system: "NaturalSpeech 2 can synthesize speech with good expressiveness/fidelity and good similarity with a speech prompt, which could be potentially misused, such as speaker mimicking and voice spoofing." Similar problems already exist with publicly available models, however. Microsoft did not announce plans to release NaturalSpeech 2.

In the future, the team plans to scale up the training and test it on much larger speech and singing datasets. They also want to make the model more efficient, for example by using the consistency models recently introduced by OpenAI as an alternative to diffusion models.

More examples are available on the NaturalSpeech 2 project page.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

NaturalSpeech 2: Microsoft edges closer to zero-shot voice cloning

Microsoft's NaturalSpeech 2 outperforms VALL-E

Rule-Based Rewards: OpenAI provides insight into the GPT-4 safety stack

NaturalSpeech 2: Microsoft warns of misuse

Microsoft LongNet could deliver web-scale AI for future LLMs

Microsoft plans to fight chatbot lies with LLM-Augmenter

Microsoft Bing to rely on GPT-4, ChatGPT mobile app planned, rumours say

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

NaturalSpeech 2: Microsoft edges closer to zero-shot voice cloning

Microsoft's NaturalSpeech 2 outperforms VALL-E

NaturalSpeech 2: Microsoft warns of misuse

Share

Bank details