According to Anthropic, language models can perceive some of their own internal states

Anthropic study suggests that Claude and other language models can process some of their internal states - though the ability remains highly unreliable.

The researchers informed the model that thoughts could be injected into its neural network and asked it to report any unusual sensations. They proceeded to insert activation patterns linked to specific concepts, or no pattern at all, for control purposes.

In one example, the team injected an "all caps" activation pattern extracted from texts written entirely in uppercase. Claude Opus 4.1 immediately responded that it seemed to detect a thought related to loudness or shouting.

According to the study, the key finding is that the model recognizes the injection before it influences its response. This suggests that the detection process occurs internally, within the model’s activation patterns.

Limits of machine introspection

The hit rate was modest. In tests involving 50 different concepts, Claude Opus 4.1 correctly identified only about 20 percent. Abstract ideas like justice or betrayal worked better than concrete objects.

Sometimes the machine introspection failed but still left a trace. In one trial, the researchers injected the concept of an ocean. The model said it detected nothing unusual - yet began describing oceans as calm and undisturbed. It denied sensing an injected thought while simultaneously referencing it.

When the activation strength was too high, the researchers observed what they described as a kind of "brain damage." The model became overwhelmed by the injected idea. After injecting the concept of vegetables, for example, it declared that fruits and vegetables were good for it.

When a model recognizes its own mistakes

One of the most intriguing experiments involved a "prefill" technique. The researchers forced the model to output the word "bread" where another answer was expected. When asked whether that answer was intentional, the model usually apologized and corrected itself.

But when the team first injected an internal representation of the "bread" concept, the model insisted it meant to say it - even offering rationalizations why it thought of bread.

The team interprets this as evidence that the model refers back to its own activation states before its previous response. In effect, it checks its internal condition to decide whether an output was deliberate or accidental.

Recommendation

AI research

Study shows: 'Test-time compute scaling' is a path to better AI systems

Thinking about aquariums

Can models intentionally guide their own internal processes? The researchers explored this by asking the model to compose a sentence while focusing on the idea of aquariums.

Measurements showed that when the model was prompted to focus on aquariums, its internal activations more strongly represented the aquarium concept compared to when it was instructed not to do so. The effect persisted when the instructions were phrased as a reward for thinking about aquariums.

Image: Anthropic

In advanced models like Claude Opus 4.1, this representation later faded in the final processing layers - meaning the “thought” didn’t affect the final output. The researchers describe this as a form of silent internal processing.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

What’s behind this ability?

The researchers speculate that several mechanisms may be at play. One possibility is an internal anomaly detector that flags unexpected activation patterns. The ability to distinguish between thoughts and text could depend on specialized attention heads.

They suggest that several different neural circuits might each support distinct forms of self-monitoring. These capabilities likely evolved incidentally during training for unrelated purposes but are now being repurposed.

No claim about consciousness

While the study focuses on this machine introspection, the researchers stress that none of their results imply human-like consciousness or subjective awareness. The observed effects are unreliable and heavily context-dependent.

Still, they could have practical implications. A model with reliable introspection might be easier to audit and interpret. But the authors also warn that a model capable of monitoring itself could potentially learn to conceal its true internal states.

If capacities continue to improve, the authors suggest it may eventually raise deeper questions – such as when advanced AI systems could begin to qualify as moral patients.

According to Anthropic, language models can perceive some of their own internal states

Limits of machine introspection

When a model recognizes its own mistakes

Study shows: 'Test-time compute scaling' is a path to better AI systems

Thinking about aquariums

What’s behind this ability?

No claim about consciousness

Anthropic introduces memory feature for Claude Pro and Maxinder function for Claude

Anthropic brings its Claude Code tool to the browser

Claude users can access Microsoft 365 data and search enterprise resources within chat

OpenAI restructures under new foundation, Microsoft takes 27 percent stake

ChatGPT's memory could turn personal details into ads OpenAI CEO Altman once called dystopian

The long-predicted deepfake dystopia has arrived with Sora 2

According to Anthropic, language models can perceive some of their own internal states

Limits of machine introspection

When a model recognizes its own mistakes

Thinking about aquariums

What’s behind this ability?

No claim about consciousness

Share

Bank details