Researchers have introduced a collection of three datasets based on African languages for evaluating large language models. The benchmark set, dubbed IrokoBench, aims to fill a gap in AI research.
Most current large language models are primarily trained on resource-rich languages like English. According to researchers from the Masakhane initiative, this leads to suboptimal performance in languages not considered during training - especially African languages. The initiative has made it its mission to advance NLP research in African languages.
While there are already efforts to create benchmarks for African languages, they mostly focus on simpler tasks or are limited to narrow areas such as machine translation or reading comprehension.
As a result, the researchers criticize that current multilingual evaluations of language models do not accurately reflect capabilities in complex, knowledge-intensive tasks for most African languages.
The few available comprehensive evaluations across languages often use machine-translated English benchmarks. This approach suffers from noise and distortions.
IrokoBench: Three datasets for complex tasks in 16 African languages
With IrokoBench, the researchers aim to improve both the diversity and breadth of evaluation coverage. The collection consists of three datasets translated into 16 African languages by human translators:
- AfriXNLI for Natural Language Inference (NLI)
- AfriMMLU for Multiple-Choice Knowledge Question Answering from areas such as geography, law, or mathematics.
- AfriMGSM for mathematical reasoning based on mathematical word problems.
The selected languages cover different regions and language families in Africa. They include very low-resource languages with less than 50 million digital characters, such as Ewe, Lingala, Luganda, Twi, and Wolof.
The researchers conducted a large-scale evaluation on IrokoBench - with 10 publicly available and 4 proprietary language models such as OpenAI's GPT-4o in zero-shot, few-shot, and translate-test scenarios where the test datasets were translated into English.
Significant performance gaps between languages
The evaluation revealed a large performance gap of about 45 percent on average between resource-rich languages like English and the tested African languages - across all evaluated language models.
Even proprietary models, which tended to perform better in African languages than open models, showed significant performance degradation. The mathematical deduction tasks in AfriMGSM proved to be the most difficult, followed by AfriMMLU and AfriXNLI.
"These results underline the need for focused development and adaptation of LLMs to better support African languages, especially those with limited data resources," the authors conclude.
The IrokoBench project has been published on HuggingFace. The initiators hope to advance the multilingual evaluation and research of language models with it.