AI models stumble in complex game tests, scoring low across the board

A new testing platform called BALROG shows that even top AI language models fail when faced with complex gaming challenges. OpenAI's GPT-4o, the best performer in the tests, only achieved 32 percent of possible scores across all games tested.

The testing platform evaluates both large language models (LLMs) and visual language models (VLMs) across various gaming scenarios. While GPT-4o scored 78 percent in basic navigation tasks using BabyAI, it struggled with more complex games. In Crafter, a Minecraft-style resource management game, GPT-4o only reached 33 percent progress.

Meta's Llama 3.2 surprisingly outperformed GPT-4o in Baba Is AI, a puzzle game about manipulating game world rules, scoring 44 percent compared to GPT-4o's 34 percent. In TextWorld's text-based puzzles, GPT-4o and Claude 3.5 Sonnet both achieved just over 40 percent, while other models scored below 20 percent.

A sample interaction of a language model with the Baba Is You game.

The results turned particularly grim in NetHack, a complex game requiring long-term planning and adaptation. No model achieved more than 1.5 percent progress. Similarly, in MiniHack's combat and exploration tasks, all models failed completely when tested without prior training.

GPT-4o the strongest model to date

The models performed even worse when processing visual information compared to text-only input. The study reveals serious shortcomings in visual decision-making, the research team noted. The models had major problems when they have to make decisions based on visual information.

The researchers emphasize these results highlight crucial gaps in current AI capabilities, particularly in applying abstract knowledge to specific situations — but also could guide future research directions.

The current list of best models can be viewed on the BALROG project page.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI models stumble in complex game tests, scoring low across the board

GPT-4o the strongest model to date

SciArena lets scientists compare LLMs on real research questions

OpenAI beats Deepseek by a surprisingly wide margin in Google's latest reasoning benchmark

OpenAI suddenly remembers that copyright law exists after a few days of wild Sora videos

OpenAI unveils Sora 2 video model with realistic physics, high-quality audio, and a new social app

Deepmind says video models for visual tasks could become what LLMs are for text tasks

AI models stumble in complex game tests, scoring low across the board

GPT-4o the strongest model to date

SciArena lets scientists compare LLMs on real research questions

OpenAI beats Deepseek by a surprisingly wide margin in Google's latest reasoning benchmark