Vision language models struggle to solve simple visual puzzles that humans find intuitive

By Cmglee - Own work, CC BY-SA 4.0

A new study from Germany's TU Darmstadt shows that even the most sophisticated AI image models fail at simple visual reasoning tasks.

The researchers tested various vision language models (VLMs) using Bongard problems—simple visual puzzles that most people can solve intuitively. These puzzles, created by Russian scientist Michail Bongard, present twelve simple images that are divided into two groups. The challenge is to identify the rule that separates these groups, a task that tests abstract reasoning skills.

Even GPT-4o falls short

The study's findings are striking: The models struggled with basic tasks that most people find simple.

For example, they had trouble distinguishing between vertical and horizontal lines, or determining the direction of rotation of a spiral. These basic visual concepts proved challenging for even the most sophisticated AI models.

Comparative display of Bongard problems with vertical and horizontal patterns, including performance table of various VLM models. — While humans can easily distinguish between vertical and horizontal elements, even advanced VLMs such as GPT-4o fail at this fundamental task. | Image: Wüst et al.

GPT-4o, currently considered the most advanced multimodal model, could only solve 21 out of 100 visual puzzles. Other well-known AI models, including Claude, Gemini, and LLaVA, performed even worse.

Test matrix with four image sets: spiral directions, orientations, shape positions and geometric arrangements with model responses. — When analyzing simple visual concepts like rotation directions or spatial relationships, VLMs show inconsistent results. The models particularly struggle with interpreting spirals and orientations. | Image: Wüst et al.

When researchers provided multiple-choice options, the results improved only marginally. The AI models only showed significant improvement when the number of possible answers was severely restricted—under these conditions, GPT-4 and Claude managed to solve 68 and 69 out of 100 puzzles respectively.

Performance comparison table of VLMs: GPT-4o, Claude, Gemini and LLaVA versions across different Bongard problem variants. — Performance data reveals clear limitations of current VLMs: Even the best model GPT-4o solves only 21 out of 100 classic Bongard problems. Success rates only increase with severely limited choice options. | Image: Wüst et al.

The researchers studied the reasons for the models' failures in detail in four selected problems. They found that AI systems sometimes fail at the basic level of visual perception, even before reaching the actual "thinking" and "reasoning" stages. But they couldn't find a single clear cause.

Rethinking AI evaluation benchmarks

The study raises questions about the evaluation of AI systems and suggests that existing benchmarks may not accurately measure the true reasoning abilities of models. The research team recommends rethinking these benchmarks to better assess the visual reasoning abilities of AI.

"Our findings raise several critical questions: Why do VLMs encounter difficulties with seemingly simple Bongard Problems, despite performing impressively across various established VLM benchmarks? How meaningful are these benchmarks in assessing true reasoning capabilities?" the researchers write.

Recommendation

AI research

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

The study was conducted by Technische Universität Darmstadt in collaboration with Eindhoven University of Technology and the German Research Center for Artificial Intelligence (DFKI), with funding from the German Federal Ministry of Education and Research and the European Union.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Vision language models struggle to solve simple visual puzzles that humans find intuitive

Even GPT-4o falls short

Rethinking AI evaluation benchmarks

AlphaEvolve is Google DeepMind's new AI system that autonomously creates better algorithms

Shopify CEO and ex-OpenAI researcher agree that context engineering beats prompt engineering

Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

AI agents can be easily tricked into doing stupid things, study says

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Vision language models struggle to solve simple visual puzzles that humans find intuitive

Even GPT-4o falls short

Rethinking AI evaluation benchmarks

Share

Bank details