Content
summary Summary

With the publication of ten intermediary 7B model checkpoints for five European languages, the academic research collective Occiglot takes a step towards preserving European language diversity and digital sovereignty.

US tech companies dominate the market for large language models (LLM). The academic, non-profit research collective Occiglot aims to counteract this. The initiative aims to strengthen Europe's academic and economic competitiveness and AI sovereignty.

"Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty," the announcement reads.

Occiglot has initially released ten intermediary 7B model checkpoints focused on the five major European languages: English, German, French, Spanish, and Italian.

Ad
Ad

The models, based on the existing Mistral 7B model, have been optimized with 700 billion additional multilingual tokens for continuous pre-training and about 1 billion tokens for instruction tuning. Details can be found in the technical report.

In addition, a multilingual model covering all five languages has been developed. All models are available on Hugging Face under the Apache 2.0 license.

Occiglot's roadmap for the coming months is to develop a unified language modeling process that supports all 24 official languages of the European Union as well as several unofficial and regional languages. To this end, a corpus of about 1 trillion tokens of non-English pre-training data has already been collected.

The German Hessian AI innovation lab hessian.AI has pledged its support: it intends to provide a "significant amount of compute" on its AI supercomputer fortytwo.

Occiglot is looking for partners

Occiglot calls for collaboration and exchange within the academic and non-academic machine learning, AI and natural language processing community.

Recommendation

The project emphasizes the need for high-quality text data for pre-training, especially for low-resource languages, and is searching for partners for the creation and curation of instruction tuning datasets as well as for the evaluation of LLMs. You can get in touch via Discord.

Occiglot's initiators, supported by the German Research Center for Artificial Intelligence (DFKI), the hessian.AI Innovation Lab and the hessian.AISC Service Center, see their initiative as a key to preserving Europe's language and cultural diversity.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • The academic research collective Occiglot publishes ten intermediary 7B model checkpoints for the five major European languages (English, German, French, Spanish and Italian). The aim is to preserve Europe's language diversity and digital sovereignty.
  • The models are based on the Mistral 7B model and have been optimized through bilingual pre-training and instructional tuning. They are available on the Hugging Face platform under the Apache 2.0 license.
  • Occiglot plans to develop a unified language modeling procedure for all 24 official EU languages and several unofficial/regional languages and is looking for partners for the creation of instruction tuning data sets and the evaluation of LLMs.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.