Ad
Skip to content

AI models don't have a unified "self" - and that's not a bug

Expecting internal coherence from language models means asking the wrong question, according to an Anthropic researcher.

"Why does page five of a book say that the best food is pizza and page 17 says the best food is pasta? What does the book really think? And you're like: 'It's a book!'", explains Josh Batson, research scientist at Anthropic, in MIT Technology Review.

The analogy comes from experiments on how AI models process facts internally. Anthropic discovered that Claude uses different mechanisms to know that bananas are yellow versus confirming that the statement "Bananas are yellow" is true. These mechanisms aren't connected to each other. When a model gives contradictory answers, it's drawing on different parts of itself - without any central authority coordinating them. "It might be like, you're talking to Claude and then it wanders off," says Batson. "And now you're not talking to Claude but something else."

The takeaway: Assuming language models have mental coherence like humans might be a fundamental category error.

Ad
DEC_D_Incontent-1

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: MIT