Content
summary Summary

LLMs designed for reasoning, like Claude 3.7 and Deepseek-R1, are supposed to excel at complex problem-solving by simulating thought processes. But a new study by Apple researchers suggests that these models actually perform worse as tasks become more difficult and, in some cases, they "think" less.

Ad

Large Reasoning Models (LRMs) such as Claude 3.7 Sonnet Thinking, Deepseek-R1, and OpenAI's o3 are often described as a step toward more general artificial intelligence. Techniques like chain-of-thought and self-reflection are meant to help these models tackle logic puzzles better than standard LLMs.

But the Apple study found structural flaws in how these reasoning mechanisms scale, pinpointing three distinct performance regimes—and a total collapse at high complexity.

Three-panel plot comparing non-thinking vs reasoning models: accuracy, token usage by complexity, and answer position for correct/incorrect cases.
For simple tasks, non-reasoning models deliver better accuracy while using fewer tokens. As tasks get harder, reasoning models catch up, but at the cost of far greater token use. | Image: Shojaee et al.

Three thinking regimes

To probe these limits, the researchers put several reasoning models through their paces in four classic puzzle environments: Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World. Each scenario allowed for controlled increases in complexity without altering the core logic.

Ad
Ad
Illustration of initial, intermediate, and target states for four puzzles: Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World.
Sample puzzles tackled by the models in the study. | Image: Shojaee et al.

On simple problems, standard LLMs—like Claude 3.7 running without its "thinking" mode—came out ahead, showing both higher accuracy and lower token consumption. Reasoning models, such as the Thinking version of Claude 3.7 compared to the non-Thinking version, or Deepseek-R1 compared to its base LLM V-3, only began to excel at intermediate complexity.

But when the puzzles got tough, all the models failed in the same way. Accuracy dropped to zero, even when given ample compute resources. Surprisingly, the reasoning models actually used fewer "thinking" tokens on the hardest problems, cutting their own reasoning short even though they could have continued.

Accuracy vs complexity plots for four puzzles comparing thinking models (Claude 3.7, DeepSeek-R1) with non-thinking variants.
Claude 3.7 Sonnet Thinking and Deepseek-R1 maintain accuracy at medium complexity across all four puzzle types, but even basic variants fall off quickly as tasks get harder. At high complexity, every model's performance craters. | Image: Shojaee et al.

Overthinking and underthinking

The researchers also dug into the models' reasoning traces. On easy problems, models sometimes found the right answer early but kept searching, churning out incorrect alternatives, also known as overthinking. With moderate complexity, the models usually reached the correct answer only after trying out many wrong paths first.

But at the highest complexity, all of them failed. Their reasoning processes stopped producing correct answers altogether—a breakdown previously described as underthinking. Even when the solution steps were provided, the models' execution still collapsed once the problem got big enough.

Accuracy and thinking token counts vs complexity for reasoning models (Claude 3.7, DeepSeek-R1, Distill-Qwen-32B, o3-mini) across four puzzles.
As puzzles become more complex, the number of reasoning steps (bottom row) climbs until both reasoning and accuracy (top row) suddenly collapse past a critical threshold. | Image: Shojaee et al.

The study also spotted differences between puzzle types. The team thinks the frequency of example problems in training data could be one reason: Tower of Hanoi shows up more often online than complex river-crossing puzzles, which might help explain the drop-off in performance on the latter.

Recommendation

It's unclear whether failures in puzzle environments carry over to other domains. Apple researchers note that, although puzzle tests allow for precise analysis, they only cover a narrow aspect of real-world reasoning. More complex, knowledge-rich tasks may reveal different strengths and weaknesses.

Current limits of machine reasoning

Apple's researchers draw a stark conclusion: current reasoning models do not develop general strategies for problem-solving. Even with mechanisms like self-reflection and extended thought paths, they fail to keep pace as tasks grow more complex.

They describe their findings as a "a fundamental scaling limitation in the thinking capabilities of current reasoning models relative to problem complexity" and suggest that the core design principles of these models may need to be rethought to achieve robust machine reasoning.

This matters, especially as companies like OpenAI are betting heavily on reasoning methods as a way to move beyond traditional scaling with larger datasets and models. With gains from just more data and parameters starting to plateau, reasoning is considered a possible new path forward.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

A separate study found that reasoning models mainly optimize LLMs to be more reliable at specific post-training tasks like math or coding, but don't add any fundamentally new capabilities.

Other researchers have recently criticized the tendency to anthropomorphize LLMs by showing their outputs as human-like "chains of thought." In the end, these so-called thoughts are just statistical calculations.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new Apple study finds that current reasoning models like Claude 3.7 Thinking and Deepseek-R1 not only struggle with complex logic puzzles, but also fail to engage in deeper thinking as puzzle difficulty rises—even when they have the compute to do so.
  • Classic language models do better on simple puzzles, reasoning models have an edge with medium complexity, but all models break down on difficult puzzles regardless of available computing power.
  • The researchers point to a "a fundamental scaling limitation" in current reasoning models, highlighting the absence of general problem-solving strategies and arguing that real progress will require a rethink of model architecture.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.