Skip to content

Even the best AI models fail at visual tasks toddlers handle easily

Image description
Nano Banana Pro prompted by THE DECODER

A new study exposes a fundamental weakness in today's AI systems: even the most capable multimodal language models can't handle basic visual tasks that toddlers master before they learn to speak.

Multimodal AI models score above 90 percent on expert knowledge tests like MMMU. But a new study by UniPat AI reveals a striking gap: these same systems fall apart on basic visual tasks that humans learn before they can talk. The best model tested, Gemini-3-Pro-Preview, managed just 49.7 percent. Human adults hit 94.1 percent.

Visual puzzle task showing a hexagonal honeycomb structure with a white gap. Below are four answer options A through D with differently shaped hexagon fragments. The correct answer is option B, but the AI model incorrectly chose option D because it couldn't accurately perceive the exact shape.
Gemini-3-Pro-Preview picked the wrong answer D instead of the correct option B on this fine-grained visual perception task. The model over-verbalized the geometry and missed the exact contour. | Image: Chen et al.

Researchers from Chinese institutions including UniPat AI, Peking University, Alibaba Group, and MoonShot AI built the "BabyVision" benchmark with 388 tasks across four categories. These test skills that developmental psychology research shows humans develop in their first months of life: fine-grained visual discrimination (like spotting subtle differences between similar patterns), following lines through mazes or across intersections, spatial perception (counting hidden 3D blocks, for example), and visual pattern recognition involving rotations and reflections.

Most frontier models score below the average three-year-old

A comparison test with 80 children across different age groups showed just how wide the gap really is. Most frontier models tested scored below the average for three-year-olds. Only Gemini3-Pro-Preview consistently beat this group, but it still trailed typical six-year-olds by about 20 percentage points.

Bar chart comparing AI models and children of various age groups on the BabyVision-Mini test. Gray bars show AI models with scores between 5 and 45 percent. Orange bars show children aged 3 to 12 with scores between 40 and 90 percent. Grok4 reaches about 5 percent, Claude-4.5-Opus about 12 percent, GPT-5.2 about 20 percent. Three-year-olds reach about 40 percent, Gemini3-Pro-Preview about 45 percent, six-year-olds about 65 percent, and twelve-year-olds about 88 percent.
Most AI models performed worse than three-year-olds. Only Gemini-3-Pro-Preview beats the toddlers, but it still lags well behind six-year-olds. | Image: Chen et al.

Among proprietary models, Gemini 3 Pro leads by a wide margin. GPT-5.2 follows with 34.4 percent, Bytedance's Doubao-1.8 hits 30.2 percent, and Claude 4.5 Opus manages just 14.2 percent. Open source models fared even worse. The best performer, Qwen3VL-235B-Thinking, scored only 22.2 percent.

The results get especially stark for specific task types. On counting 3D blocks, even the best model reaches just 20.5 percent while humans score 100 percent. On the "Lines Observation" task, where lines must be traced through intersections, only Gemini hits 83.3 percent. Most other models scored zero.

Radar chart showing 22 task types from the BabyVision benchmark across four categories. The dashed black line for human performance runs near the 100 percent mark at the outer edge. Colored lines for six AI models run much further inside with values mostly between 10 and 60 percent. Gemini3-Pro-Preview in red shows the best AI performance but doesn't reach the human baseline in any category.
The dashed line shows human performance at nearly 100 percent across all categories. Every AI model fell far behind, especially on visual tracking and spatial perception. | Image: Chen et al.

Language-first processing creates a visual blind spot

The researchers trace all these failures to a single problem they call the "verbalization bottleneck." Current multimodal models translate visual input into language representations before reasoning about it. Any visual information that can't be expressed in words gets lost along the way.

Overview showing four example tasks from the BabyVision benchmark. From left to right: a grid with 49 tiger patterns where one different pattern must be found; a maze with three entrances; tangled lines connecting animals to environments; a penguin with six shadow options. Below are the corresponding questions and correct answers, plus generative variants where the solution is marked by drawing.
Example tasks from the BabyVision benchmark. Top row shows input images, middle row shows language-based questions and answers, bottom row shows the generative BabyVision-Gen tasks where models must show their solution by drawing. | Image: Chen et al.

Semantic content like "a red car on a road" translates easily into language. Geometric relationships resist this conversion because the exact curvature of a boundary or the precise position of an intersection can't be captured in words without losing information. The researchers say BabyVision specifically targets these non-descriptive visual properties.

Mazes remain a major challenge

The researchers also developed "BabyVision-Gen," an extension with 280 questions. Here, models had to show their solutions by generating images, drawing paths, or highlighting differences. People often solve these tasks by drawing instead of verbalizing, and children externalize visual reasoning through drawing before they can verbalize solutions.

The tested image generators showed some promise. Nano Banana Pro hit 18.3 percent, and GPT-Image-1.5 reached 9.8 percent. On tasks like finding differences, Nano Banana Pro scored 35.4 percent. But every generator failed completely on maze tasks and connecting lines. These tasks require continuous spatial coherence over longer sequences, something current architectures can't maintain.

The researchers point to "unified multimodal models" that natively integrate visual processing and generation as a potential solution. These architectures could maintain visual representations throughout the reasoning process instead of compressing everything into a linguistic bottleneck. The BabyVision benchmark, available on GitHub, is meant to serve as a diagnostic tool for measuring progress toward true visual intelligence.

Francois Chollet's ARC-AGI-3 benchmark tests similar basic cognitive abilities like object permanence and causality. It uses interactive mini-games where AI agents have to figure out the rules on their own. So far, current systems score zero points on these tasks while humans solve them in minutes.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • Over 20 percent launch discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder