Microsoft's NaturalSpeech 3 clones voices and emotions

Mar 9, 2024 Maximilian Schreiner

NaturalSpeech 3 is Microsoft's latest text-to-speech system that can clone voices and emotions.

Microsoft Research Asia, Azure Speech, and partner universities have developed a new speech synthesis system called NaturalSpeech 3. The system uses a new approach that breaks down speech into different sub-units such as content, prosody, timbre, and acoustic details. The research follows directly on from NaturalSpeech 2, which was launched in April 2023 and has already demonstrated impressive speech cloning capabilities.

According to the team, the quality of speech generated by previous TTS systems was often unsatisfactory, particularly in terms of naturalness and similarity to the human voice. NaturalSpeech 3 therefore relies on a new type of neural codec. The codec decomposes the speech waveform into independent sub-areas, allowing for more detailed and controlled speech generation.

The system then works with a diffusion model that generates speech attributes in each of these sub-regions according to the appropriate specification. According to the team, this principle allows NaturalSpeech 3 to model complex speech information more efficiently, resulting in higher quality generated speech.

NaturalSpeech 3 outperforms most systems

Experiments show that NaturalSpeech 3 outperforms existing, freely available TTS systems in terms of quality, similarity, prosody, and intelligibility. The system also achieves comparable or better speech quality than the real speech recordings in the LibriSpeech test set, setting a new standard for similarity between synthesized speech and the voice of an original.

Another benefit of NaturalSpeech 3 is the ability to manipulate speech attributes: Users can select and combine different attributes from different speech samples to create the desired voice. For example, the AI system can generate a sentence with different emotions such as anger, fear, or surprise.

Prompt & Emotion

Why fades the lotus of the water - sad

Prompt voice

https://the-decoder.de/wp-content/uploads/2024/03/NS-3-12.wav?_=1

NaturalSpeech 3 Output

https://the-decoder.de/wp-content/uploads/2024/03/NS-3-12-1.wav?_=2

Prompt & Emotion

Why fades the lotus of the water - angry

Prompt voice

https://the-decoder.de/wp-content/uploads/2024/03/13.wav?_=3

NaturalSpeech 3 Output

https://the-decoder.de/wp-content/uploads/2024/03/NS-3-13-1.wav?_=4

NaturalSpeech 3 does not come close to the quality of ElevenLabs' commercial solution in the examples shown by the researchers. However, this is due to the training data used and the size of the model - the underlying parameters can be scaled the team shows.

Like its predecessor, Microsoft is not releasing NaturalSpeech 3 for security reasons. The research team emphasizes that the ability to generate human-like speech comes with the responsibility to prevent misuse.

It is important to develop robust models for recognizing synthetic speech and to establish systems that allow individuals to report suspected cases, they said.

More examples can be found on the NaturalSpeech 3 project page.

Sources:

Arxiv