Content
summary Summary

A replication study of Apple's controversial "The Illusion of Thinking" paper confirms some of its main criticisms, but challenges the study's central conclusion.

Ad

Researchers from Spain's CSIC-UPM Center for Automation and Robotics recreated and expanded on key experiments from Apple's original work, which first appeared in June 2025 and sparked major debate in the AI community. Apple's claim was that even the latest large reasoning models (LRMs) struggle with tasks requiring basic symbolic planning. The study found that these models' performance drops sharply when task complexity increases beyond a moderate level, and that they sometimes act overly cautious with simpler problems.

The new study largely backs up Apple's observations, but disputes their interpretation. The Spanish team argues that the models' shortcomings aren't just due to a lack of "thinking ability," but also stem from how the tasks are designed, how prompts are structured, and the stochastic optimization methods used.

Towers of Hanoi: stepwise solutions only go so far

To test long-term planning, the researchers used the classic Towers of Hanoi puzzle with models like Gemini 2.5 Pro. They broke the problem into smaller sub-tasks so the models didn't have to generate the entire solution in one go.

Ad
Ad

This stepwise resolution worked reasonably well for setups with up to seven disks. But performance collapsed with eight disks or more, matching the sudden dropoff in Apple's study as complexity increased.

The new interpretation points to token usage as key: the number of tokens the model spends closely tracks with whether it believes a solution is possible. As long as the model thinks it can solve the task, it ramps up resource use. If it decides the problem is unsolvable, it cuts off quickly - suggesting a kind of implicit uncertainty management.

Agent cooperation increases effort, not success

The researchers also tried a multi-agent approach, where two language models took turns proposing solution steps. This led to lengthy back-and-forths and high token consumption, but rarely produced valid solutions.

While the models followed all the rules, they often got stuck in endless cycles of valid but irrelevant moves. The authors conclude that the models lack the ability to recognize and execute higher-level strategies, even when they're acting symbolically correct.

Unlike Apple, which saw these failures as evidence of missing cognitive abilities, the Spanish team also blames prompt structure and the lack of global search mechanisms.

Recommendation

River Crossing: Apple's benchmark was unsolvable

The sharpest criticism targets the river crossing benchmark at the heart of Apple's paper. Apple reported especially poor model performance here, but the replication study found that many of Apple's test cases were mathematically unsolvable - a fact not acknowledged in the original publication.

The Spanish researchers only tested valid configurations, and found the model could reliably solve even large-scale instances with over 100 agent pairs.

Interestingly, the hardest problems weren't the largest, but those in the midrange. These cases have very few valid solutions and require extremely precise planning, which puts a heavy strain on the models.

This supports one of Apple's key findings: the biggest performance drop for language models doesn't just depend on how big or complex a problem is. Instead, the models struggle most with tasks of moderate difficulty, like the river crossing puzzle with five agent pairs, which has only a handful of correct solutions. For smaller or much larger tasks, models often do better - either because there are many possible solutions, or the problem is easier for the model to parse.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

LRMs as stochastic search agents in unknown territory

The Spanish team ultimately rejects Apple's main claim that LRMs are fundamentally incapable of generalizable reasoning. Instead, they describe these models as "stochastic, RL-tuned searchers in a discrete state space we barely understand."

Under this view, language models aren't rational planners, but systems that explore local solution paths based on learned patterns, with only limited ability to plan over the long term.

The authors also suggest that token usage could serve as an internal indicator of the model's subjective sense of solvability: models invest more resources when they think a task can be solved, and cut off early if they see no way forward.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A Spanish research team replicated and expanded on Apple's "The Illusion of Thinking" study, confirming that large reasoning models perform well on simple tasks but struggle sharply as complexity increases, especially when breaking problems into smaller steps.
  • The new study disputes Apple's main conclusion, arguing that many failures stem from task design, prompt structure, and optimization methods, not just a lack of reasoning ability. The researchers also found that models' token usage tracks closely with their internal sense of whether a solution is possible.
  • The team discovered that Apple's river crossing benchmark included mathematically unsolvable tasks; when only valid puzzles were tested, models solved even large instances reliably. However, models still struggle with moderately difficult cases, supporting the observation that performance drops most on problems with few valid solutions.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.