IrokoBench uncovers a 45% performance gap between English and African languages in LLMs

Midjourney prompted by THE DECODER

Researchers have introduced a collection of three datasets based on African languages for evaluating large language models. The benchmark set, dubbed IrokoBench, aims to fill a gap in AI research.

Most current large language models are primarily trained on resource-rich languages like English. According to researchers from the Masakhane initiative, this leads to suboptimal performance in languages not considered during training - especially African languages. The initiative has made it its mission to advance NLP research in African languages.

While there are already efforts to create benchmarks for African languages, they mostly focus on simpler tasks or are limited to narrow areas such as machine translation or reading comprehension.

As a result, the researchers criticize that current multilingual evaluations of language models do not accurately reflect capabilities in complex, knowledge-intensive tasks for most African languages.

The few available comprehensive evaluations across languages often use machine-translated English benchmarks. This approach suffers from noise and distortions.

IrokoBench: Three datasets for complex tasks in 16 African languages

With IrokoBench, the researchers aim to improve both the diversity and breadth of evaluation coverage. The collection consists of three datasets translated into 16 African languages by human translators:

AfriXNLI for Natural Language Inference (NLI)
AfriMMLU for Multiple-Choice Knowledge Question Answering from areas such as geography, law, or mathematics.
AfriMGSM for mathematical reasoning based on mathematical word problems.

The selected languages cover different regions and language families in Africa. They include very low-resource languages with less than 50 million digital characters, such as Ewe, Lingala, Luganda, Twi, and Wolof.

The researchers conducted a large-scale evaluation on IrokoBench - with 10 publicly available and 4 proprietary language models such as OpenAI's GPT-4o in zero-shot, few-shot, and translate-test scenarios where the test datasets were translated into English.

Significant performance gaps between languages

The evaluation revealed a large performance gap of about 45 percent on average between resource-rich languages like English and the tested African languages - across all evaluated language models.

Recommendation

AI research

DOOM on the toaster was fun, on AI it's groundbreaking

Even proprietary models, which tended to perform better in African languages than open models, showed significant performance degradation. The mathematical deduction tasks in AfriMGSM proved to be the most difficult, followed by AfriMMLU and AfriXNLI.

"These results underline the need for focused development and adaptation of LLMs to better support African languages, especially those with limited data resources," the authors conclude.

The IrokoBench project has been published on HuggingFace. The initiators hope to advance the multilingual evaluation and research of language models with it.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

IrokoBench uncovers a 45% performance gap between English and African languages in LLMs

IrokoBench: Three datasets for complex tasks in 16 African languages

Significant performance gaps between languages

DOOM on the toaster was fun, on AI it's groundbreaking

Trump advisors are pushing a regulation targeting what they call "woke" AI models in the tech sector

Anthropic appears to tighten the usage limits for Claude code

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

IrokoBench uncovers a 45% performance gap between English and African languages in LLMs

IrokoBench: Three datasets for complex tasks in 16 African languages

Significant performance gaps between languages

Share

Bank details