Content
summary Summary

Researchers have introduced a collection of three datasets based on African languages for evaluating large language models. The benchmark set, dubbed IrokoBench, aims to fill a gap in AI research.

Most current large language models are primarily trained on resource-rich languages like English. According to researchers from the Masakhane initiative, this leads to suboptimal performance in languages not considered during training - especially African languages. The initiative has made it its mission to advance NLP research in African languages.

While there are already efforts to create benchmarks for African languages, they mostly focus on simpler tasks or are limited to narrow areas such as machine translation or reading comprehension.

As a result, the researchers criticize that current multilingual evaluations of language models do not accurately reflect capabilities in complex, knowledge-intensive tasks for most African languages.

Ad
Ad

The few available comprehensive evaluations across languages often use machine-translated English benchmarks. This approach suffers from noise and distortions.

IrokoBench: Three datasets for complex tasks in 16 African languages

With IrokoBench, the researchers aim to improve both the diversity and breadth of evaluation coverage. The collection consists of three datasets translated into 16 African languages by human translators:

  • AfriXNLI for Natural Language Inference (NLI)
  • AfriMMLU for Multiple-Choice Knowledge Question Answering from areas such as geography, law, or mathematics.
  • AfriMGSM for mathematical reasoning based on mathematical word problems.

The selected languages cover different regions and language families in Africa. They include very low-resource languages with less than 50 million digital characters, such as Ewe, Lingala, Luganda, Twi, and Wolof.

The researchers conducted a large-scale evaluation on IrokoBench - with 10 publicly available and 4 proprietary language models such as OpenAI's GPT-4o in zero-shot, few-shot, and translate-test scenarios where the test datasets were translated into English.

Significant performance gaps between languages

The evaluation revealed a large performance gap of about 45 percent on average between resource-rich languages like English and the tested African languages - across all evaluated language models.

Recommendation

Even proprietary models, which tended to perform better in African languages than open models, showed significant performance degradation. The mathematical deduction tasks in AfriMGSM proved to be the most difficult, followed by AfriMMLU and AfriXNLI.

"These results underline the need for focused development and adaptation of LLMs to better support African languages, especially those with limited data resources," the authors conclude.

The IrokoBench project has been published on HuggingFace. The initiators hope to advance the multilingual evaluation and research of language models with it.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the Masakhane Initiative have unveiled IrokoBench, a collection of three datasets for evaluating language models in 16 African languages, to fill a gap in AI research.
  • IrokoBench consists of human-translated datasets for natural language inference (AfriXNLI), multiple-choice knowledge question answering (AfriMMLU), and mathematical reasoning (AfriMGSM) in languages including Ewe, Lingala, Luganda, Twi, and Wolof.
  • The evaluation of 14 language models on IrokoBench showed an average performance difference of about 45 percent between resource-rich languages such as English and the African languages tested.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.