Meta unveils new research on speech AI: Machine-generated voices can now cry, laugh, yawn or make more natural small talk.
Last October, Meta unveiled its speech AI model Generative Spoken Language Model (GSLM). Instead of using text, as usual, the AI model is trained with undocumented audio data in a self-supervised manner.
During training, the AI works its way through the audio data unaided, recognizing patterns in it and learning to mimic the underlying sounds to form new sentences or complete existing sentences. From the Meta researchers' perspective, this way of learning language is comparable to that of humans.
GSML learns dialogs
Now Meta is introducing two advancements to the technique used for GSLM that should enable more natural AI dialogues. First, Meta's speech AI can now mimic emotional sounds such as laughing, yawning, or crying - which it says is important in communication to better convey the intention and context of a statement.
Original neutral:
AI generated with laughter:
Original neutral:
AI-generated boring:
Original neutral:
AI-generated angry:
According to Meta, the new GSML model dGSML, which is optimized for dialogs, generates more natural-sounding audio dialogs using AI agents that can pause for thought or process overlaps in conversations. The agents should thus be able to recognize social cues in speech that are not explicitly reflected in the chosen words in a more differentiated way and better adhere to common conversational conventions.
dGSML was trained with about 2000 hours of unlabeled audio dialogues from the Fisher dataset, which contains about 16000 English-language telephone conversations. The dataset dates from 2004, and the researchers expect to generate better audio with higher-quality training data.
Speech and gestures as metaverse interface
Meta reemphasizes the importance of artificial intelligence for the metaverse in the context of its new AI research: Audio AI models like the ones shown could create new interaction possibilities in combination with, for example, gesture control.
The researchers see AI training with audio rather than text data via self-supervised learning as an essential building block for future AI systems. AI development could move away from traditional text-based models and therefore develop "more natural, engaging AI systems of the future."
As an immediate application scenario for the newly presented methods, the researchers cite dubbing without the detour via text translation, where emotional interpretations can be lost.
More audio examples of Meta's emotional speech AI can be found on the project page. Further details and examples of dGSLM can be found here.