Ad
Skip to content

AI models don't have a unified "self" - and that's not a bug

Expecting internal coherence from language models means asking the wrong question, according to an Anthropic researcher.

"Why does page five of a book say that the best food is pizza and page 17 says the best food is pasta? What does the book really think? And you're like: 'It's a book!'", explains Josh Batson, research scientist at Anthropic, in MIT Technology Review.

The analogy comes from experiments on how AI models process facts internally. Anthropic discovered that Claude uses different mechanisms to know that bananas are yellow versus confirming that the statement "Bananas are yellow" is true. These mechanisms aren't connected to each other. When a model gives contradictory answers, it's drawing on different parts of itself - without any central authority coordinating them. "It might be like, you're talking to Claude and then it wanders off," says Batson. "And now you're not talking to Claude but something else."

The takeaway: Assuming language models have mental coherence like humans might be a fundamental category error.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder