Content
summary Summary

As part of the Massively Multilingual Speech project, Meta is releasing AI models that can convert spoken language to text and text to speech in 1,100 languages.

Ad

The new set of models is based on Meta's wav2vec, as well as a curated dataset of examples for 1,100 languages and another uncurated dataset for nearly 4,000 languages, including languages spoken by only a few hundred people for which no speech technology yet exists, according to Meta.

The model can express itself in more than 1,000 languages and identify more than 4,000 languages. According to Meta, MMS outperforms previous models by covering ten times more languages. You can get an overview of all available languages here.

New Testament gets new use as AI dataset

A key component of MMS is the Bible, specifically the New Testament. The Meta dataset contains New Testament readings in more than 1,107 languages with an average length of 32 hours.

Ad
Ad

Meta used these recordings in combination with matching passages from the Internet. In addition, another 3,809 unlabeled audio files were used, also New Testament readings, but without additional language information.

Since 32 hours per language is not enough training material for a reliable speech recognition system, Meta used wave2vec 2.0 to pre-train MMS models with more than 500,000 hours of speech in more than 1,400 languages. These models were then fine-tuned to understand or identify numerous languages.

Benchmarks show that the model's performance remained nearly constant despite training with many more different languages. In fact, the error rate decreased minimally by 0.4 percentage points with increasing training.

Image: Meta

According to Meta, it is also significantly lower than that of OpenAI's Whisper, which was not explicitly optimized for extensive multilingualism. A comparison in English only would be more interesting. First testers on Twitter report that Whisper performs better here.

The fact that the voices in the dataset are predominantly male does not negatively affect the understanding or generation of female voices, according to Meta.

Recommendation

In addition, the model does not tend to generate overly religious speech. Meta attributes this to the classification approach used (Connectionist Temporal Classification), which focuses more on speech patterns and sequences than on word content and meaning.

Meta cautions, however, that the model sometimes transcribes words or phrases incorrectly, which can lead to incorrect or offensive statements.

One model for thousands of languages

Meta's long-term goal is to develop a single language model for as many languages as possible in order to preserve endangered languages. Future models may support even more languages and even dialects.

"Our goal is to make it easier for people to access information and to use devices in their preferred language," Meta writes. Specific application scenarios include VR and AR technologies or messaging.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

In the future, a single model could be trained for all tasks, such as speech recognition, speech synthesis, and speech identification, leading to even better overall performance, Meta writes.

The code, the pre-trained MMS models with 300 million and one billion parameters, respectively, and the refined derivations for speech recognition and identification and text-to-speech are made available by Meta as open-source Models on Github.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • As part of the Massively Multilingual Speech project, Meta has developed AI models capable of translating spoken language to text and text to speech for 1,100 languages. The model is based on Meta's wav2vec and selected datasets.
  • MMS has been trained on a large number of languages, including languages spoken by only a few hundred people. Despite the large number of languages trained, the performance of the model remained constant.
  • The model was trained primarily on recordings from the New Testament and additional unlabeled audio files. Meta plans to continue improving the technology, combining multiple features in one model, and supporting more languages and dialects.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.