Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

Apple's recent research paper "The Illusion of Thinking" has reignited debate over whether large language models can really reason.

Apple's team put leading models to the test with classic logic puzzles like the Tower of Hanoi, finding that even advanced systems still struggle to carry out simple algorithms correctly and completely. Based on these results, the authors argue that LLMs lack true generalizable reasoning, instead acting as pattern matchers that overlook deeper structures.

Other research seems to back this up. A separate study reached similar conclusions, though it was less critical, noting there's still much to learn about how well LLMs can reason. And a Salesforce paper benchmarking LLM performance in CRM contexts found that their abilities took a nosedive in more complex, multi-turn scenarios.

Critics say the argument is overly black-and-white

LLM skeptics see these papers as confirmation of their doubts that these systems are capable of real reasoning, and worry this could limit the progress of advanced AI. But some AI experts argue that the paper's take is too simplistic.

Lawrence Chan from Metr offered a nuanced perspective on LessWrong. He argues that framing the debate as either real thinking or rote memorization ignores the complex middle ground where both human and machine reasoning actually operate.

For instance, people catch a thrown ball not by solving physics equations, but by relying on learned heuristics. These shortcuts aren't signs of ignorance, but practical strategies for solving problems with limited resources.

Language models, Chan notes, also depend on experience and abstraction under tight computational limits. He points out that generalization can be seen as an advanced form of memorization - starting from individual examples, moving through surface strategies, and eventually forming broader rules.

Chan points out that while LLMs may struggle to output all 32,000+ moves for the 15-disk Hanoi puzzle in the exact requested format, they can generate a Python script to solve the problem instantly. He argues that when LLMs explain their approach, suggest shortcuts, and offer practical solutions in code, it demonstrates a functional–if different–understanding of the task. For Chan, dismissing this as a lack of understanding misses the point.

The authors call it "counterintuitive" that language models use fewer tokens at high complexity, suggesting a "fundamental limitation." But this simply reflects models recognizing their limitations and seeking alternatives to manually executing thousands of possibly error-prone steps – if anything, evidence of good judgment on the part of the models!

Lawrence Chan

Chan also warns against using performance on theoretical puzzles as a basis for judging models' general abilities. The real question, he says, is whether their strategies can be applied to complex, real-world tasks.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

While the Apple paper highlights specific weaknesses in today's LLMs, Chan believes it sidesteps the bigger issue: which kinds of "reasoning" matter for practical use cases, and how well do LLMs handle those?

Sure, an LLM might not be able to do "generalized reasoning" in the sense that the authors propose, but an LLM with a simple code interpreter definitely can. Here, the key question is why we must consider the LLM by itself, as opposed to an AI agent composed of an LLM and an agent scaffold – note that even chatbot-style apps such as ChatGPT provide the LLM with various tools such as a code interpreter and internet access. Why should we limit our discussion of AGIs to just the LLM component of an AI system, as opposed to the AI system as a whole?

Lawrence Chan

AI response paper was just a joke

The widely shared paper "The Illusion of the Illusion of Thinking," which circulated as a supposed response to Apple's critique and was partly written by Claude 4 Opus, was never intended as a genuine rebuttal. According to author Alex Lawsen, it was simply a joke filled with errors.

Lawsen was surprised by how quickly the joke paper went viral and how many people took it seriously, calling it his "first real taste of something I'd made going properly viral, and honestly? It was kind of scary."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

Critics say the argument is overly black-and-white

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

AI response paper was just a joke

Google unveils MedGemma, an open-source AI model suite for medical applications

Most AI models can fake alignment, but safety training suppresses the behavior, study finds

Researchers reveal that AI models have distinct strategic fingerprints in classic game theory tests

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Apple's "Illusion of Thinking" paper shows experts deeply divided on AI reasoning

Critics say the argument is overly black-and-white

AI response paper was just a joke

Share

Bank details