Microsoft's NaturalSpeech 3 clones voices and emotions

Midjourney prompted by THE DECODER

NaturalSpeech 3 is Microsoft's latest text-to-speech system that can clone voices and emotions.

Microsoft Research Asia, Azure Speech, and partner universities have developed a new speech synthesis system called NaturalSpeech 3. The system uses a new approach that breaks down speech into different sub-units such as content, prosody, timbre, and acoustic details. The research follows directly on from NaturalSpeech 2, which was launched in April 2023 and has already demonstrated impressive speech cloning capabilities.

According to the team, the quality of speech generated by previous TTS systems was often unsatisfactory, particularly in terms of naturalness and similarity to the human voice. NaturalSpeech 3 therefore relies on a new type of neural codec. The codec decomposes the speech waveform into independent sub-areas, allowing for more detailed and controlled speech generation.

The system then works with a diffusion model that generates speech attributes in each of these sub-regions according to the appropriate specification. According to the team, this principle allows NaturalSpeech 3 to model complex speech information more efficiently, resulting in higher quality generated speech.

NaturalSpeech 3 outperforms most systems

Experiments show that NaturalSpeech 3 outperforms existing, freely available TTS systems in terms of quality, similarity, prosody, and intelligibility. The system also achieves comparable or better speech quality than the real speech recordings in the LibriSpeech test set, setting a new standard for similarity between synthesized speech and the voice of an original.

Another benefit of NaturalSpeech 3 is the ability to manipulate speech attributes: Users can select and combine different attributes from different speech samples to create the desired voice. For example, the AI system can generate a sentence with different emotions such as anger, fear, or surprise.

Prompt & Emotion

Why fades the lotus of the water - sad

Prompt voice

Recommendation

AI research

Study reveals AI models have hidden capabilities they can't access through normal prompts

NaturalSpeech 3 Output

Prompt & Emotion

Why fades the lotus of the water - angry

Prompt voice

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

NaturalSpeech 3 Output

NaturalSpeech 3 does not come close to the quality of ElevenLabs' commercial solution in the examples shown by the researchers. However, this is due to the training data used and the size of the model - the underlying parameters can be scaled the team shows.

Like its predecessor, Microsoft is not releasing NaturalSpeech 3 for security reasons. The research team emphasizes that the ability to generate human-like speech comes with the responsibility to prevent misuse.

It is important to develop robust models for recognizing synthetic speech and to establish systems that allow individuals to report suspected cases, they said.

More examples can be found on the NaturalSpeech 3 project page.

Microsoft's NaturalSpeech 3 clones voices and emotions

NaturalSpeech 3 outperforms most systems

Study reveals AI models have hidden capabilities they can't access through normal prompts

OpenAI expands cloud partnerships, taps Google for ChatGPT infrastructure in multiple countries

OpenAI is testing ChatGPT agents that create and edit presentations and spreadsheets in chat

Microsoft introduces Phi-4-mini-flash-reasoning with up to 10x higher token throughput

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Microsoft's NaturalSpeech 3 clones voices and emotions

NaturalSpeech 3 outperforms most systems

Share

Bank details