Spirit LM: Meta's AI division paves the way for its own Advanced Voice Mode

Meta's Fundamental AI Research (FAIR) team has introduced several new specialized AI models, including Spirit LM, a multimodal language model that seamlessly integrates text and speech.

This development mirrors OpenAI's approach with GPT-4o and its Advanced Voice Mode. Meta published the research paper for Spirit LM in February and has now made the corresponding code and model weights available for free download.

Spirit LM builds on a pre-trained text language model, extending it to handle speech through continuous training with text and speech units. The model combines speech and text sequences as a single set of tokens, interleaving them using a word-level method. To achieve this, the researchers used a small, automatically curated parallel corpus of speech and text.

Two versions: Basic and expressive

Meta has released Spirit LM in two versions. The base model uses semantic units of speech, while the expressive version also incorporates pitch and style units to capture intonation and emotion information. This approach allows the model to demonstrate both the semantic capabilities of speech models and the expressive abilities of voice models.

The combined text-to-speech architecture enables Spirit LM to handle various tasks. It can transcribe spoken language, read written text aloud, and classify spoken utterances based on content. The multimodal approach also facilitates cross-modality applications, such as directly converting written text into speech and vice versa.

Video: Meta

Researchers demonstrated that Spirit LM can learn new tasks using the few-shot method, both within a single modality and across modalities. This learning occurs after presenting only a few examples to the model.

By combining semantic, prosodic, and stylistic information, Spirit LM Expressive generates particularly expressive speech output. Experiments showed that the model can maintain the mood of text and speech input in the generated output – a capability often lacking in previous language models.

More AI news from the FAIR team

Meta's recent AI announcements also include an update to the Segment Anything model for image segmentation, a solution called Layer Skip for speeding up large language models, and advances in efficient training of multilingual models with Meta Lingua. The company also presented new research on post-quantum cryptography security, AI-supported materials research, and improving sentence representations.

Recommendation

AI research

More AI agents isn't always better, new Google and MIT study finds

Meta continues to emphasize its commitment to developing advanced AI while promoting open science. However, the company recently faced criticism for attempting to redefine the term "open source" according to its own interpretation.

With the recent release of Llama 3.2, which includes image understanding capabilities integrated into its AI platforms, it's possible that Meta may incorporate findings from Spirit LM into a future Llama model. This could potentially lead to a genuine "omnimodal" competitor to GPT-4o, offering a voice mode similar to OpenAI's Advanced Voice Mode.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Spirit LM: Meta's AI division paves the way for its own Advanced Voice Mode

Two versions: Basic and expressive

More AI news from the FAIR team

More AI agents isn't always better, new Google and MIT study finds

Some Meta employees fear being sidelined as Zuckerberg reshuffles teams for AI progress

Meta tests chatbots with proactive messaging to boost retention

Meta launches AI video editing but holds back on full features for now

More AI agents isn't always better, new Google and MIT study finds

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1

Corporate AI agents use simple workflows with human oversight instead of chasing full autonomy

Spirit LM: Meta's AI division paves the way for its own Advanced Voice Mode

Two versions: Basic and expressive

More AI news from the FAIR team

Share

Bank details