Content
summary Summary

A new commentary from Pfizer researchers challenges the main claims of "The Illusion of Thinking," a study co-authored by Apple scientists that found large reasoning models (LRMs) struggle as tasks get more complex.

Ad

The Apple-led paper argues this sudden drop in performance signals a fundamental limit to what machine reasoning can do. Other research has seen similar results, but stops short of calling it a hard ceiling.

Pfizer's team also disagrees with Apple's interpretation. They argue the performance drop isn't due to a cognitive barrier, but is instead caused by artificial test conditions. Forcing models to operate in a strictly text-only environment - without tools like programming interfaces - makes complex tasks much harder than necessary. What looks like a thinking problem is actually an execution problem.

Why LRMs stumble on complex puzzles

The original study tested models like Claude 3.7 Sonnet-Thinking and Deepseek-R1 on text-based puzzles, such as Tower of Hanoi or River Crossing. As the puzzles got harder, the models' accuracy dropped sharply - a phenomenon the study called a "reasoning cliff."

Ad
Ad

Pfizer's team points to the test's unrealistic constraints: the models couldn't use external tools and had to keep track of everything in plain text. This didn't reveal a reasoning flaw, but made it nearly impossible for the models to manage long, precise problem-solving steps.

To illustrate, Pfizer researchers looked at the o4-mini model. Without tool access, it declared a solvable river crossing puzzle impossible, likely because it couldn't remember earlier steps. This memory limitation is a well-known issue with today's language models and is documented in the Apple study as well.

Pfizer calls this "learned helplessness": when an LRM can't execute a long sequence perfectly, it may incorrectly conclude the task is unsolvable.

The Apple study also didn't account for "cumulative error." In tasks with thousands of steps, the chance of a flawless run drops with every move. Even if a model is 99.99% accurate per step, the odds of solving a tough Tower of Hanoi puzzle without a mistake are below 45%. So the observed performance drop may simply reflect statistical reality, not cognitive limits.

Tools unlock higher-level reasoning

Pfizer's team tested GPT-4o and o4-mini again, this time with access to a Python tool. Both solved simple puzzles easily, but as complexity increased, their methods diverged.

Recommendation

GPT-4o used Python to pursue a logical but flawed strategy and didn't recognize the mistake. o4-mini, on the other hand, noticed its initial error, analyzed it, and switched to a correct approach, leading to a successful solution.

Screenshot with three-stage code: D shows incorrect simulation, E validates paired-couples algorithm for 20 pairs in 39 moves, F lists final JSON moves.
The o4-mini model detects an error, changes its strategy, and ultimately solves the boat puzzle. | Image: Khan et al.

The researchers tie this behavior to classic ideas in cognitive science. GPT-4o acts like Daniel Kahneman's "System 1" - fast and intuitive, but prone to sticking with a bad plan. o4-mini, meanwhile, shows "System 2" thinking: slow, analytical, and able to revise its own strategy after recognizing a mistake. This kind of metacognitive adjustment is considered typical of conscious problem-solving.

Rethinking how to benchmark reasoning models

Pfizer's team argues that future LRM benchmarks should test models both with and without tools. Tool-free tests reveal the limits of language-only interfaces, while tool-assisted tests show what models can achieve as agents. They also call for benchmarks that specifically measure metacognitive abilities, like error detection and strategic adjustment.

These findings have safety implications as well. AI models that blindly follow flawed plans without correcting themselves could be risky, while those able to revise their strategies are likely to be more reliable.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The original "The Illusion of Thinking" study by Shojaee et al. (2025) sparked a broad debate about what large language models are actually capable of. Pfizer's analysis agrees with the data, but argues the story is more complicated than just machines can't think.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Pfizer researchers challenge the conclusions of "The Illusion of Thinking" study, which links declining performance of reasoning models on complex tasks to core flaws in machine thinking.
  • The team attributes the performance drop largely to limitations in the test setup, especially the absence of tool access for the models.
  • Their own experiments found that with tool access, models solved harder puzzles, and o4-mini even showed metacognitive self-correction, which is an advanced problem-solving trait.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.