Researchers collect 950,000 hours of open source speech data for EU languages

Oct 7, 2024

Ideogram prompted by THE DECODER

International researchers have compiled MOSEL, a comprehensive open-source speech data collection for the 24 official EU languages. The project aims to advance the development of open AI language models in Europe.

Creating powerful AI language models requires vast amounts of training data. Until now, English-language datasets and proprietary systems from large tech companies have dominated. An international research team wants to change this: With MOSEL (Massive Open-source compliant Speech data for European Languages), they have assembled an extensive collection of open-source speech data for the 24 official languages of the European Union.

The collected data comes from 18 different sources, including projects like CommonVoice, LibriSpeech, and VoxPopuli. It includes both transcribed speech recordings and unlabeled audio data. Particularly valuable are the 505,000 hours of transcribed data.

However, the distribution among languages is very uneven. While over 437,000 hours of labeled data are available for English, languages like Maltese or Irish have only a few hours.

AI-supported transcription expands database

To improve the data situation for resource-poor languages, the researchers automatically transcribed an additional 441,000 hours of previously unlabeled audio data. They used OpenAI's Whisper AI model for this purpose.

The team explains that while automatic transcription is not perfect, it allows large amounts of training material to be provided even for languages with little manually transcribed data. The generated transcripts are published under the Creative Commons CC-BY license, which allows free use with attribution.

The challenges of automatic transcription are particularly evident in the case of Maltese. Here, the Whisper model achieved a word error rate of over 80 percent - meaning that on average, four out of five words were incorrectly recognized.

For such languages, much work is still needed - but the automated transcriptions could serve as a starting point for further improvements. The team also plans to collect more data for underrepresented languages.

The entire data collection is freely available on GitHub and is intended to facilitate researchers' and developers' access to extensive speech data for European languages.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Researchers collect 950,000 hours of open source speech data for EU languages

AI-supported transcription expands database

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.