Content
summary Summary

International researchers have compiled MOSEL, a comprehensive open-source speech data collection for the 24 official EU languages. The project aims to advance the development of open AI language models in Europe.

Ad

Creating powerful AI language models requires vast amounts of training data. Until now, English-language datasets and proprietary systems from large tech companies have dominated. An international research team wants to change this: With MOSEL (Massive Open-source compliant Speech data for European Languages), they have assembled an extensive collection of open-source speech data for the 24 official languages of the European Union.

The collected data comes from 18 different sources, including projects like CommonVoice, LibriSpeech, and VoxPopuli. It includes both transcribed speech recordings and unlabeled audio data. Particularly valuable are the 505,000 hours of transcribed data.

However, the distribution among languages is very uneven. While over 437,000 hours of labeled data are available for English, languages like Maltese or Irish have only a few hours.

Ad
Ad

AI-supported transcription expands database

To improve the data situation for resource-poor languages, the researchers automatically transcribed an additional 441,000 hours of previously unlabeled audio data. They used OpenAI's Whisper AI model for this purpose.

The team explains that while automatic transcription is not perfect, it allows large amounts of training material to be provided even for languages with little manually transcribed data. The generated transcripts are published under the Creative Commons CC-BY license, which allows free use with attribution.

The challenges of automatic transcription are particularly evident in the case of Maltese. Here, the Whisper model achieved a word error rate of over 80 percent - meaning that on average, four out of five words were incorrectly recognized.

For such languages, much work is still needed - but the automated transcriptions could serve as a starting point for further improvements. The team also plans to collect more data for underrepresented languages.

The entire data collection is freely available on GitHub and is intended to facilitate researchers' and developers' access to extensive speech data for European languages.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • An international team of researchers has developed MOSEL, a comprehensive open source speech data collection for the 24 official EU languages. The project aims to support the development of open AI language models in Europe.
  • MOSEL contains 505,000 hours of transcribed speech data from 18 different sources. In addition, 441,000 hours of unlabelled audio have been automatically transcribed using OpenAI's Whisper AI model to expand the database for low-resource languages.
  • The distribution of data across languages is uneven. While there are over 437,000 hours of labelled data for English, there are only a few hours for languages such as Maltese or Irish. The entire data collection is freely available on GitHub.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.