AI in practice

Silo AI's Poro34B open-source LLM aims to master all official EU languages

Matthias Bastian
Photo of a detailed digital data stream in widescreen format, subtly incorporating the Finnish flag's blue cross and white background. The visualization leans towards a high-tech aesthetic, with complex binary patterns and circuitry designs. The Finnish flag's colors are gently integrated, appearing as a nuanced part of the overall digital data flow rather than as a separate, bold element.

DALL-E 3 prompted by THE DECODER

Helsinki-based AI startup Silo AI has launched Poro, an open-source large language model (LLM) aimed at advancing multilingual AI capabilities for European languages and code.

Developed by SiloGen, Silo AI's generative AI division, and the TurkuNLP research group at the University of Turku, Poro is the first in a planned series of models covering all official languages of the European Union, "with the aim of ensuring European digital sovereignty and democratizing access to LLMs." Silo AI describes itself as "the largest private AI lab in the Nordics that builds AI as a service."

The 34.2 billion parameter Poro 34B model uses a BLOOM transformer architecture with ALiBi embeddings and is trained on a one trillion token multilingual dataset focused on English, Finnish, and programming languages such as Python and Java. Poro is 30 percent trained, and this training was done on LUMI, Europe's fastest supercomputer, located in Finland.

The model uses a cross-lingual training approach to address the challenge of training high-performance natural language models for under-resourced European languages. Poro's training progress is documented through the Poro Research Checkpoints program, providing transparency into the model's training.

In benchmarks, Poro 34B achieves state-of-the-art results in the low-resource language Finish, without sacrificing its Finish capabilities at the expense of its English capabilities. After the German-specific language model LeoLM, which was recently trained by LAION and Hessian.ai, this is the second European language-specific LLM that can perform well in English and its native language.

Poro is freely available under the Apache 2.0 license, making it suitable for both commercial and research use. You can see its model card here.

LLMs like GPT-4 perform well in many languages, but are typically strongest in English because it dominates the dataset.

With France's Mistral 7B showing competitive overall performance and Germany's Aleph Alpha recently receiving a 500 million investment, it appears that Europe is finally getting its act together. If we don't count the actual EU AI Act, which seems to be taking a little longer.

Sources: