European Research Foundation wants to promote open source LLM diversity

Mar 8, 2024 Matthias Bastian

With the publication of ten 7B models for five European languages, the academic research collective Occiglot is taking a step towards preserving European linguistic diversity and digital sovereignty.

With the publication of ten intermediary 7B model checkpoints for five European languages, the academic research collective Occiglot takes a step towards preserving European language diversity and digital sovereignty.

US tech companies dominate the market for large language models (LLM). The academic, non-profit research collective Occiglot aims to counteract this. The initiative aims to strengthen Europe's academic and economic competitiveness and AI sovereignty.

"Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty," the announcement reads.

Occiglot has initially released ten intermediary 7B model checkpoints focused on the five major European languages: English, German, French, Spanish, and Italian.

The models, based on the existing Mistral 7B model, have been optimized with 700 billion additional multilingual tokens for continuous pre-training and about 1 billion tokens for instruction tuning. Details can be found in the technical report.

In addition, a multilingual model covering all five languages has been developed. All models are available on Hugging Face under the Apache 2.0 license.

Occiglot's roadmap for the coming months is to develop a unified language modeling process that supports all 24 official languages of the European Union as well as several unofficial and regional languages. To this end, a corpus of about 1 trillion tokens of non-English pre-training data has already been collected.

The German Hessian AI innovation lab hessian.AI has pledged its support: it intends to provide a "significant amount of compute" on its AI supercomputer fortytwo.

Occiglot is looking for partners

Occiglot calls for collaboration and exchange within the academic and non-academic machine learning, AI and natural language processing community.

The project emphasizes the need for high-quality text data for pre-training, especially for low-resource languages, and is searching for partners for the creation and curation of instruction tuning datasets as well as for the evaluation of LLMs. You can get in touch via Discord.

Occiglot's initiators, supported by the German Research Center for Artificial Intelligence (DFKI), the hessian.AI Innovation Lab and the hessian.AISC Service Center, see their initiative as a key to preserving Europe's language and cultural diversity.

Sources:

Occiglot