European Research Foundation wants to promote open source LLM diversity

Mar 8, 2024

Ideogram prompted by THE DECODER

With the publication of ten intermediary 7B model checkpoints for five European languages, the academic research collective Occiglot takes a step towards preserving European language diversity and digital sovereignty.

US tech companies dominate the market for large language models (LLM). The academic, non-profit research collective Occiglot aims to counteract this. The initiative aims to strengthen Europe's academic and economic competitiveness and AI sovereignty.

"Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty," the announcement reads.

Occiglot has initially released ten intermediary 7B model checkpoints focused on the five major European languages: English, German, French, Spanish, and Italian.

The models, based on the existing Mistral 7B model, have been optimized with 700 billion additional multilingual tokens for continuous pre-training and about 1 billion tokens for instruction tuning. Details can be found in the technical report.

In addition, a multilingual model covering all five languages has been developed. All models are available on Hugging Face under the Apache 2.0 license.

Occiglot's roadmap for the coming months is to develop a unified language modeling process that supports all 24 official languages of the European Union as well as several unofficial and regional languages. To this end, a corpus of about 1 trillion tokens of non-English pre-training data has already been collected.

The German Hessian AI innovation lab hessian.AI has pledged its support: it intends to provide a "significant amount of compute" on its AI supercomputer fortytwo.

Occiglot is looking for partners

Occiglot calls for collaboration and exchange within the academic and non-academic machine learning, AI and natural language processing community.

The project emphasizes the need for high-quality text data for pre-training, especially for low-resource languages, and is searching for partners for the creation and curation of instruction tuning datasets as well as for the evaluation of LLMs. You can get in touch via Discord.

Occiglot's initiators, supported by the German Research Center for Artificial Intelligence (DFKI), the hessian.AI Innovation Lab and the hessian.AISC Service Center, see their initiative as a key to preserving Europe's language and cultural diversity.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

AI news without the hype
Curated by humans.

More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

European Research Foundation wants to promote open source LLM diversity

Occiglot is looking for partners

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.