European Research Foundation wants to promote open source LLM diversity

Ideogram prompted by THE DECODER

With the publication of ten intermediary 7B model checkpoints for five European languages, the academic research collective Occiglot takes a step towards preserving European language diversity and digital sovereignty.

US tech companies dominate the market for large language models (LLM). The academic, non-profit research collective Occiglot aims to counteract this. The initiative aims to strengthen Europe's academic and economic competitiveness and AI sovereignty.

"Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty," the announcement reads.

Occiglot has initially released ten intermediary 7B model checkpoints focused on the five major European languages: English, German, French, Spanish, and Italian.

The models, based on the existing Mistral 7B model, have been optimized with 700 billion additional multilingual tokens for continuous pre-training and about 1 billion tokens for instruction tuning. Details can be found in the technical report.

In addition, a multilingual model covering all five languages has been developed. All models are available on Hugging Face under the Apache 2.0 license.

Occiglot's roadmap for the coming months is to develop a unified language modeling process that supports all 24 official languages of the European Union as well as several unofficial and regional languages. To this end, a corpus of about 1 trillion tokens of non-English pre-training data has already been collected.

The German Hessian AI innovation lab hessian.AI has pledged its support: it intends to provide a "significant amount of compute" on its AI supercomputer fortytwo.

Occiglot is looking for partners

Occiglot calls for collaboration and exchange within the academic and non-academic machine learning, AI and natural language processing community.

Recommendation

AI in practice

OpenAI says the New York Times' lawsuit is "without merit" - here's why

The project emphasizes the need for high-quality text data for pre-training, especially for low-resource languages, and is searching for partners for the creation and curation of instruction tuning datasets as well as for the evaluation of LLMs. You can get in touch via Discord.

Occiglot's initiators, supported by the German Research Center for Artificial Intelligence (DFKI), the hessian.AI Innovation Lab and the hessian.AISC Service Center, see their initiative as a key to preserving Europe's language and cultural diversity.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

European Research Foundation wants to promote open source LLM diversity

Occiglot is looking for partners

OpenAI says the New York Times' lawsuit is "without merit" - here's why

Europe's hottest AI startup Mistral raises 600 million at 6 billion valuation

Mistral launches new flagship LLM as European GPT-4 competition

How Europe's hottest AI startup Mistral AI plans to beat OpenAI

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Apple AI researchers question OpenAI's claims about o1's reasoning capabilities

Tesla unveils Cybercab robot taxi, but robot Optimus is the bigger deal

European Research Foundation wants to promote open source LLM diversity

Occiglot is looking for partners

Share

Bank details