Content
summary Summary

A new testing platform called BALROG shows that even top AI language models fail when faced with complex gaming challenges. OpenAI's GPT-4o, the best performer in the tests, only achieved 32 percent of possible scores across all games tested.

Ad

The testing platform evaluates both large language models (LLMs) and visual language models (VLMs) across various gaming scenarios. While GPT-4o scored 78 percent in basic navigation tasks using BabyAI, it struggled with more complex games. In Crafter, a Minecraft-style resource management game, GPT-4o only reached 33 percent progress.

Meta's Llama 3.2 surprisingly outperformed GPT-4o in Baba Is AI, a puzzle game about manipulating game world rules, scoring 44 percent compared to GPT-4o's 34 percent. In TextWorld's text-based puzzles, GPT-4o and Claude 3.5 Sonnet both achieved just over 40 percent, while other models scored below 20 percent.

A sample interaction of a language model with the Baba Is You game.

The results turned particularly grim in NetHack, a complex game requiring long-term planning and adaptation. No model achieved more than 1.5 percent progress. Similarly, in MiniHack's combat and exploration tasks, all models failed completely when tested without prior training.

Ad
Ad

GPT-4o the strongest model to date

The models performed even worse when processing visual information compared to text-only input. The study reveals serious shortcomings in visual decision-making, the research team noted. The models had major problems when they have to make decisions based on visual information.

The researchers emphasize these results highlight crucial gaps in current AI capabilities, particularly in applying abstract knowledge to specific situations — but also could guide future research directions.

The current list of best models can be viewed on the BALROG project page.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have developed BALROG, a benchmark platform that tests large language models (LLMs) and visual language models (VLMs) in a variety of game environments, from simple tasks to complex games such as the NetHack Learning Environment.
  • The test results show clear limitations of current AI language models: even top performers such as OpenAI's GPT-4o scored on average only 32 percent of the possible points. In complex games requiring long-term planning, all models failed almost completely.
  • The deficits in image-based decision-making were particularly striking: When the language models were presented with visual representations of the game environments, they performed even worse than with purely text-based input. The researchers see this as an important contribution to better understanding the capabilities and limitations of current AI systems, and to highlighting the need for improvement.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.