Content
summary Summary

A new study from Germany's TU Darmstadt shows that even the most sophisticated AI image models fail at simple visual reasoning tasks.

Ad

The researchers tested various vision language models (VLMs) using Bongard problems—simple visual puzzles that most people can solve intuitively. These puzzles, created by Russian scientist Michail Bongard, present twelve simple images that are divided into two groups. The challenge is to identify the rule that separates these groups, a task that tests abstract reasoning skills.

Even GPT-4o falls short

The study's findings are striking: The models struggled with basic tasks that most people find simple.

For example, they had trouble distinguishing between vertical and horizontal lines, or determining the direction of rotation of a spiral. These basic visual concepts proved challenging for even the most sophisticated AI models.

Ad
Ad
Comparative display of Bongard problems with vertical and horizontal patterns, including performance table of various VLM models.
While humans can easily distinguish between vertical and horizontal elements, even advanced VLMs such as GPT-4o fail at this fundamental task. | Image: Wüst et al.

GPT-4o, currently considered the most advanced multimodal model, could only solve 21 out of 100 visual puzzles. Other well-known AI models, including Claude, Gemini, and LLaVA, performed even worse.

Test matrix with four image sets: spiral directions, orientations, shape positions and geometric arrangements with model responses.
When analyzing simple visual concepts like rotation directions or spatial relationships, VLMs show inconsistent results. The models particularly struggle with interpreting spirals and orientations. | Image: Wüst et al.

When researchers provided multiple-choice options, the results improved only marginally. The AI models only showed significant improvement when the number of possible answers was severely restricted—under these conditions, GPT-4 and Claude managed to solve 68 and 69 out of 100 puzzles respectively.

Performance comparison table of VLMs: GPT-4o, Claude, Gemini and LLaVA versions across different Bongard problem variants.
Performance data reveals clear limitations of current VLMs: Even the best model GPT-4o solves only 21 out of 100 classic Bongard problems. Success rates only increase with severely limited choice options. | Image: Wüst et al.

The researchers studied the reasons for the models' failures in detail in four selected problems. They found that AI systems sometimes fail at the basic level of visual perception, even before reaching the actual "thinking" and "reasoning" stages. But they couldn't find a single clear cause.

Rethinking AI evaluation benchmarks

The study raises questions about the evaluation of AI systems and suggests that existing benchmarks may not accurately measure the true reasoning abilities of models. The research team recommends rethinking these benchmarks to better assess the visual reasoning abilities of AI.

"Our findings raise several critical questions: Why do VLMs encounter difficulties with seemingly simple Bongard Problems, despite performing impressively across various established VLM benchmarks? How meaningful are these benchmarks in assessing true reasoning capabilities?" the researchers write.

Recommendation

The study was conducted by Technische Universität Darmstadt in collaboration with Eindhoven University of Technology and the German Research Center for Artificial Intelligence (DFKI), with funding from the German Federal Ministry of Education and Research and the European Union.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study by TU Darmstadt and other institutions shows that even advanced AI image models like GPT-4o fail to solve simple visual puzzles known as Bongard problems, which consist of twelve images in two groups, and the task is to find the rule that distinguishes the groups.
  • The models tested had considerable difficulty solving the puzzles, with GPT-4 only able to solve 21 out of 100 problems, and other models such as Claude, Gemini, and LLaVA performing even worse, highlighting a "significant gap" between human and machine visual intelligence.
  • The researchers question common logic benchmarks used to evaluate AI systems, suggesting that these benchmarks may not accurately measure the true extent of an AI model's visual reasoning abilities.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.