Meta's new speech AI can laugh, scream, yawn, and chit-chat

Meta unveils new research on speech AI: Machine-generated voices can now cry, laugh, yawn or make more natural small talk.

Last October, Meta unveiled its speech AI model Generative Spoken Language Model (GSLM). Instead of using text, as usual, the AI model is trained with undocumented audio data in a self-supervised manner.

During training, the AI works its way through the audio data unaided, recognizing patterns in it and learning to mimic the underlying sounds to form new sentences or complete existing sentences. From the Meta researchers' perspective, this way of learning language is comparable to that of humans.

GSML learns dialogs

Now Meta is introducing two advancements to the technique used for GSLM that should enable more natural AI dialogues. First, Meta's speech AI can now mimic emotional sounds such as laughing, yawning, or crying - which it says is important in communication to better convey the intention and context of a statement.

Original neutral:

AI generated with laughter:

Original neutral:

AI-generated boring:

Original neutral:

Recommendation

AI research

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

AI-generated angry:

According to Meta, the new GSML model dGSML, which is optimized for dialogs, generates more natural-sounding audio dialogs using AI agents that can pause for thought or process overlaps in conversations. The agents should thus be able to recognize social cues in speech that are not explicitly reflected in the chosen words in a more differentiated way and better adhere to common conversational conventions.

dGSML was trained with about 2000 hours of unlabeled audio dialogues from the Fisher dataset, which contains about 16000 English-language telephone conversations. The dataset dates from 2004, and the researchers expect to generate better audio with higher-quality training data.

Speech and gestures as metaverse interface

Meta reemphasizes the importance of artificial intelligence for the metaverse in the context of its new AI research: Audio AI models like the ones shown could create new interaction possibilities in combination with, for example, gesture control.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The researchers see AI training with audio rather than text data via self-supervised learning as an essential building block for future AI systems. AI development could move away from traditional text-based models and therefore develop "more natural, engaging AI systems of the future."

As an immediate application scenario for the newly presented methods, the researchers cite dubbing without the detour via text translation, where emotional interpretations can be lost.

More audio examples of Meta's emotional speech AI can be found on the project page. Further details and examples of dGSLM can be found here.

Meta's new speech AI can laugh, scream, yawn, and chit-chat

GSML learns dialogs

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

Speech and gestures as metaverse interface

Read more about Artificial Intelligence:

Why large AI language models don't lead to human-like AI

Meta PEER: Are large language models any good as writing assistants?

GLM-130B: The most capable AI language model currently available comes from China

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Meta's new speech AI can laugh, scream, yawn, and chit-chat

GSML learns dialogs

Speech and gestures as metaverse interface

Read more about Artificial Intelligence:

Share

Bank details