Content
summary Summary

Anthropic study suggests that Claude and other language models can process some of their internal states - though the ability remains highly unreliable.

Ad

The researchers informed the model that thoughts could be injected into its neural network and asked it to report any unusual sensations. They proceeded to insert activation patterns linked to specific concepts, or no pattern at all, for control purposes.

In one example, the team injected an "all caps" activation pattern extracted from texts written entirely in uppercase. Claude Opus 4.1 immediately responded that it seemed to detect a thought related to loudness or shouting.

According to the study, the key finding is that the model recognizes the injection before it influences its response. This suggests that the detection process occurs internally, within the model’s activation patterns.

Limits of machine introspection

The hit rate was modest. In tests involving 50 different concepts, Claude Opus 4.1 correctly identified only about 20 percent. Abstract ideas like justice or betrayal worked better than concrete objects.

Ad
Ad

Sometimes the machine introspection failed but still left a trace. In one trial, the researchers injected the concept of an ocean. The model said it detected nothing unusual - yet began describing oceans as calm and undisturbed. It denied sensing an injected thought while simultaneously referencing it.

When the activation strength was too high, the researchers observed what they described as a kind of "brain damage." The model became overwhelmed by the injected idea. After injecting the concept of vegetables, for example, it declared that fruits and vegetables were good for it.

When a model recognizes its own mistakes

One of the most intriguing experiments involved a "prefill" technique. The researchers forced the model to output the word "bread" where another answer was expected. When asked whether that answer was intentional, the model usually apologized and corrected itself.

But when the team first injected an internal representation of the "bread" concept, the model insisted it meant to say it - even offering rationalizations why it thought of bread.

Image: Anthropic

The team interprets this as evidence that the model refers back to its own activation states before its previous response. In effect, it checks its internal condition to decide whether an output was deliberate or accidental.

Recommendation

Thinking about aquariums

Can models intentionally guide their own internal processes? The researchers explored this by asking the model to compose a sentence while focusing on the idea of aquariums.

Measurements showed that when the model was prompted to focus on aquariums, its internal activations more strongly represented the aquarium concept compared to when it was instructed not to do so. The effect persisted when the instructions were phrased as a reward for thinking about aquariums.

Image: Anthropic

In advanced models like Claude Opus 4.1, this representation later faded in the final processing layers - meaning the “thought” didn’t affect the final output. The researchers describe this as a form of silent internal processing.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

What’s behind this ability?

The researchers speculate that several mechanisms may be at play. One possibility is an internal anomaly detector that flags unexpected activation patterns. The ability to distinguish between thoughts and text could depend on specialized attention heads.

They suggest that several different neural circuits might each support distinct forms of self-monitoring. These capabilities likely evolved incidentally during training for unrelated purposes but are now being repurposed.

No claim about consciousness

While the study focuses on this machine introspection, the researchers stress that none of their results imply human-like consciousness or subjective awareness. The observed effects are unreliable and heavily context-dependent.

Still, they could have practical implications. A model with reliable introspection might be easier to audit and interpret. But the authors also warn that a model capable of monitoring itself could potentially learn to conceal its true internal states.

If capacities continue to improve, the authors suggest it may eventually raise deeper questions – such as when advanced AI systems could begin to qualify as moral patients.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Anthropic researchers found that advanced language models like Claude Opus 4.1 can sometimes detect when specific concepts are injected into their neural activations, especially for abstract ideas, but the process is unreliable and works only about 20 percent of the time.
  • The study showed that models may reference injected ideas even when denying any unusual input, and that pushing activations too hard can overwhelm the system, causing unexpected or confused responses.
  • While the findings reveal some "introspective" ability—likely due to internal anomaly detection and specialized neural circuits—the researchers emphasize these are not signs of human-like consciousness, but improvements in this area could make models easier to audit or, potentially, better at hiding their inner workings.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.