NaturalSpeech 2: Microsoft edges closer to zero-shot voice cloning

Apr 20, 2023

Midjourney prompted by THE DECODER

Key Points

Microsoft uses a diffusion model and 44,000 hours of speech and singing data from about 5,000 people to create one of the best text-to-speech systems.
NaturalSpeech 2 can play or sing any text in the given voice from a few seconds of audio recording.
Microsoft is not releasing NaturalSpeech 2 at this time and warns of possible misuse.

Microsoft presents NaturalSpeech 2, a text-to-speech model that is based on diffusion models and is capable of cloning any voice with just a short snippet of audio.

Microsoft Research Asia and Microsoft Azure Speech developed NaturalSpeech 2 using a diffusion model that interacts with a Neural Audio codec, which compresses waveforms into vectors. The team trained the Neural Audio Codec with 44,000 hours of speech and singing data, with the codec encoder learning to convert waveforms to vectors using residual vector quantizers (RVQ).

The RVQ uses several "codebooks" as templates for this process, compressing waveforms into predefined vectors. The codec encoder converts the quantized vectors back into waveforms. During training, the diffusion model learns to convert text into such quantized vectors, so that it can later pass arbitrary text input to the decoder, which converts it into speech or song.

Microsoft's NaturalSpeech 2 outperforms VALL-E

NaturalSpeech 2 has over 400 million parameters and generates speech with different speaker identities, prosodies and styles (e.g. singing) in zero-shot scenarios where only a few seconds of speech are available. In experiments, the team shows that NaturalSpeech 2 is able to generate natural speech in these scenarios, outperforming the best text-to-speech systems to date, including Microsoft's own VALL-E, which is also based on a diffusion model.

Text-Prompt

And lay me down in my cold bed and leave my shining lot.
Ad

Audio Reference

Ground Truth

VALL-E

NaturalSpeech 2

The 44,000 hours of recordings used for training came from 5,000 different speakers and included recordings made under less than ideal studio conditions. The audio codec was trained using 8 Nvidia Tesla V100 (16 gigabytes) GPUs, and the diffusion model was trained using 16 V100 (32 gigabytes) GPUs.

NaturalSpeech 2: Microsoft warns of misuse

The team warns of possible misuse of the system: "NaturalSpeech 2 can synthesize speech with good expressiveness/fidelity and good similarity with a speech prompt, which could be potentially misused, such as speaker mimicking and voice spoofing." Similar problems already exist with publicly available models, however. Microsoft did not announce plans to release NaturalSpeech 2.

In the future, the team plans to scale up the training and test it on much larger speech and singing datasets. They also want to make the model more efficient, for example by using the consistency models recently introduced by OpenAI as an alternative to diffusion models.

More examples are available on the NaturalSpeech 2 project page.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv