Content
summary Summary

NaturalSpeech 3 is Microsoft's latest text-to-speech system that can clone voices and emotions.

Microsoft Research Asia, Azure Speech, and partner universities have developed a new speech synthesis system called NaturalSpeech 3. The system uses a new approach that breaks down speech into different sub-units such as content, prosody, timbre, and acoustic details. The research follows directly on from NaturalSpeech 2, which was launched in April 2023 and has already demonstrated impressive speech cloning capabilities.

According to the team, the quality of speech generated by previous TTS systems was often unsatisfactory, particularly in terms of naturalness and similarity to the human voice. NaturalSpeech 3 therefore relies on a new type of neural codec. The codec decomposes the speech waveform into independent sub-areas, allowing for more detailed and controlled speech generation.

The system then works with a diffusion model that generates speech attributes in each of these sub-regions according to the appropriate specification. According to the team, this principle allows NaturalSpeech 3 to model complex speech information more efficiently, resulting in higher quality generated speech.

Ad
Ad

NaturalSpeech 3 outperforms most systems

Experiments show that NaturalSpeech 3 outperforms existing, freely available TTS systems in terms of quality, similarity, prosody, and intelligibility. The system also achieves comparable or better speech quality than the real speech recordings in the LibriSpeech test set, setting a new standard for similarity between synthesized speech and the voice of an original.

Another benefit of NaturalSpeech 3 is the ability to manipulate speech attributes: Users can select and combine different attributes from different speech samples to create the desired voice. For example, the AI system can generate a sentence with different emotions such as anger, fear, or surprise.

Prompt & Emotion

Why fades the lotus of the water - sad

Prompt voice

Recommendation

NaturalSpeech 3 Output

Prompt & Emotion

Why fades the lotus of the water - angry

Prompt voice

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

NaturalSpeech 3 Output

NaturalSpeech 3 does not come close to the quality of ElevenLabs' commercial solution in the examples shown by the researchers. However, this is due to the training data used and the size of the model - the underlying parameters can be scaled the team shows.

Like its predecessor, Microsoft is not releasing NaturalSpeech 3 for security reasons. The research team emphasizes that the ability to generate human-like speech comes with the responsibility to prevent misuse.

It is important to develop robust models for recognizing synthetic speech and to establish systems that allow individuals to report suspected cases, they said.

More examples can be found on the NaturalSpeech 3 project page.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft Research Asia, Azure Speech and partner universities have developed NaturalSpeech 3, a new text-to-speech system that can clone voices and emotions, building on NaturalSpeech 2.
  • NaturalSpeech 3 uses a novel neural codec to break down speech into individual units such as content, prosody, timbre and acoustic detail, allowing for more detailed and controlled speech generation.
  • Microsoft is not releasing NaturalSpeech 3 due to security concerns and emphasizes the importance of developing robust models for synthetic speech recognition and putting systems in place for individuals to report suspicious cases.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.