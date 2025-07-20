AI research
Matthias Bastian

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking

ARC-AGI
New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Profile
E-Mail
Content
summary Summary

ARC-AGI-3 aims to test how well AI systems can handle brand new problems. While people breeze through the challenges, the latest AI models still come up short.

Ad

AI researcher François Chollet and his team have released ARC-AGI-3, the latest version of their benchmark for evaluating general intelligence. According to Chollet, ARC-AGI-3 is built to measure whether AI systems can learn on their own in truly unfamiliar situations, without any background knowledge or hints. The tasks draw only on so-called "core knowledge priors" - basic cognitive abilities like object permanence and causality - and leave out language, trivia, and cultural symbols entirely.

The "Developer Preview" offers three interactive test games that, according to the creators and the leaderboard, humans can solve quickly and easily. So far, AI systems have consistently failed to beat any of the games, except for one entry with unknown origins.

OpenAI researcher Zhiqing Sun claims on X that the new ChatGPT agent can already solve the first game, but it's unclear whether OpenAI's agent is actually the one holding the top spot.

Ad
Ad
Image: via X

Interactive games replace static tests

The big change in ARC-AGI-3 is its interactive format. Instead of static problems, the new version features mini-games set in a grid world. To win, AI agents have to figure out the rules and objectives for themselves, learning how to succeed through trial and error.

The developers say this setup is meant to mirror how humans learn: by exploring, planning, and adapting to new environments - skills that remain mostly unreachable for today's AI systems. "As long as that gap remains, we do not have AGI," the project team writes on arcprize.org.

To go along with the preview, HuggingFace is sponsoring a sprint competition with a $10,000 prize. Participants have four weeks to build and submit the best-performing agent using the provided API.

By early 2026, the full benchmark is supposed to feature about a hundred different games, split into public and private test sets. More details about the benchmark, how to participate, and the API are available at arcprize.org.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • AI researcher François Chollet and his team have introduced ARC-AGI-3, a new benchmark that challenges AI systems to solve unfamiliar tasks entirely on their own, without any prior instructions or examples.
  • The test features interactive mini-games where AI agents must autonomously discover game mechanics and achieve objectives through trial and error, assessing core cognitive skills like understanding object permanence and causality.
  • While humans can complete these games in minutes, no current AI has managed to score any points. A developer preview with three games has been released, and HuggingFace is sponsoring a sprint competition with a $10,000 prize.
Sources
ARC-AGI Chollet
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Profile
E-Mail
AI research

Alibaba's Qwen2.5 only excels at math thanks to memorized training data

News, tests and reports about VR, AR and MIXED Reality.
What happens next with MIXED My personal farewell to MIXED Meta and Anduril are now jointly developing XR headsets for the US military MIXED-NEWS.com
AI research
Update

OpenAI claims a breakthrough in LLM reasoning on complex math problems

AI research

FlexOlmo enables organizations to collaboratively train LLMs without data sharing

Google News
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking

Bank details

IBAN: DE87 1203 0000 1086 0070 75
Account holder: DEEP CONTENT GbR
Purpose: Support THE DECODER
AI in practice

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

AI in practice
Update

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

AI research

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Google News