Content
summary Summary

Meta's Fundamental AI Research (FAIR) team has introduced several new specialized AI models, including Spirit LM, a multimodal language model that seamlessly integrates text and speech.

Ad

This development mirrors OpenAI's approach with GPT-4o and its Advanced Voice Mode. Meta published the research paper for Spirit LM in February and has now made the corresponding code and model weights available for free download.

Spirit LM builds on a pre-trained text language model, extending it to handle speech through continuous training with text and speech units. The model combines speech and text sequences as a single set of tokens, interleaving them using a word-level method. To achieve this, the researchers used a small, automatically curated parallel corpus of speech and text.

Two versions: Basic and expressive

Meta has released Spirit LM in two versions. The base model uses semantic units of speech, while the expressive version also incorporates pitch and style units to capture intonation and emotion information. This approach allows the model to demonstrate both the semantic capabilities of speech models and the expressive abilities of voice models.

Ad
Ad

The combined text-to-speech architecture enables Spirit LM to handle various tasks. It can transcribe spoken language, read written text aloud, and classify spoken utterances based on content. The multimodal approach also facilitates cross-modality applications, such as directly converting written text into speech and vice versa.

Video: Meta

Researchers demonstrated that Spirit LM can learn new tasks using the few-shot method, both within a single modality and across modalities. This learning occurs after presenting only a few examples to the model.

By combining semantic, prosodic, and stylistic information, Spirit LM Expressive generates particularly expressive speech output. Experiments showed that the model can maintain the mood of text and speech input in the generated output – a capability often lacking in previous language models.

More AI news from the FAIR team

Meta's recent AI announcements also include an update to the Segment Anything model for image segmentation, a solution called Layer Skip for speeding up large language models, and advances in efficient training of multilingual models with Meta Lingua. The company also presented new research on post-quantum cryptography security, AI-supported materials research, and improving sentence representations.

Recommendation

Meta continues to emphasize its commitment to developing advanced AI while promoting open science. However, the company recently faced criticism for attempting to redefine the term "open source" according to its own interpretation.

With the recent release of Llama 3.2, which includes image understanding capabilities integrated into its AI platforms, it's possible that Meta may incorporate findings from Spirit LM into a future Llama model. This could potentially lead to a genuine "omnimodal" competitor to GPT-4o, offering a voice mode similar to OpenAI's Advanced Voice Mode.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta's Fundamental AI Research (FAIR) team releases several specialized AI models, most notably Spirit LM, which seamlessly combines text and speech. The model is available in an expressive version that can capture emphasis and emotion.
  • Thanks to its combined text-speech architecture, Spirit LM can handle a variety of tasks, such as automatic speech recognition, text reading and cross-modality applications. It can learn new tasks in a few-shot process.
  • In addition to Spirit LM, Meta's latest AI developments include an update to the Segment-Anything image segmentation model, a solution for accelerating large language models called Layer Skip, and advances in training multilingual models with Meta Lingua.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.