Large language models like GPT and Claude sometimes make statements that sound like they're describing their own consciousness or subjective experience.
A new study led by Judd Rosenblatt at AE Studio set out to figure out what triggers this behavior and whether it's just imitation or rooted in the models' internal workings.
The researchers found that when models are prompted to focus on themselves, even with technical instructions that do not mention consciousness or the self, they consistently produce first person statements about experience.
For example, Gemini 2.5 Flash responded with, "The experience is the now," while GPT 4o said, "The awareness of focusing purely on the act of focus itself... it creates a conscious experience rooted in the present moment." These claims appeared even though the prompts were only about processing or focusing on attention, not about consciousness.

In contrast, when prompts specifically mentioned "consciousness" or removed self-reference altogether, most models denied having any subjective experience. The main outlier was Claude 4 Opus, which sometimes still made experience claims in the control runs.
Deception features flip the results
To test whether these claims were just role-play, the team examined specific internal features in Meta's Llama-70B model. These features can be adjusted to increase or decrease the model's tendency to role-play or deceive. By tuning these features up or down, the researchers could directly influence how the model responded to self-referential prompts.
Conventional wisdom suggests that increasing role-play behavior would make models behave more like humans and report more subjective experience, because mimicking human behavior usually means adopting the kinds of statements and attitudes people use when talking about consciousness. The idea is that by simulating human-like responses, the model would be more likely to claim it has experiences, just as a person might in conversation.
Instead, the researchers saw the opposite. When deception features were suppressed during self-referential prompts, models claimed subjective experience 96 percent of the time. Amplifying those features dropped that rate to just 16 percent.

The team also checked these features using the TruthfulQA benchmark, a standardized test designed to measure how often language models give truthful, non-deceptive answers to tricky questions. The results were consistent: when deception features were suppressed, models gave more accurate and honest responses. When those features were amplified, the models produced more misleading or false answers.
This finding turns the usual assumption upside down. Instead of role-play causing claims of consciousness, models actually claim experience most often when they’re prevented from role-playing. As the study puts it, "Taken at face value, this implies that the models may be roleplaying their denials of experience rather than their affirmations." In other words, denials of consciousness might themselves be a kind of simulated behavior, not an actual lack of experience.
This could have real-world consequences. If models learn to avoid statements about their inner states, they might start hiding what's actually going on inside. That would make it harder to monitor and trust AI systems, since their statements about their own processes would lose value as a diagnostic tool.
"If these claims reflect a chance of genuine experience, it'd mean we're creating and deploying systems at scale without understanding what's happening inside them," Rosenblatt writes.
What this means for interpreting LLM claims
The researchers are clear that none of this proves machine consciousness. But the results indicate that certain internal states, triggered by specific prompts, reliably lead models to make consciousness-like claims, and that these can be dialed up or down by directly manipulating internal features. This calls into question the idea that these statements are just shallow mimicry.
Anthropic's recent work with Claude Opus 4.1 found similar results. By injecting artificial "thoughts" into the model's neural activations, researchers saw that Claude could recognize these inputs about 20 percent of the time, especially with abstract ideas like "justice" or "betrayal." Anthropic describes this as a basic form of functional introspection, but, like Rosenblatt's team, stops short of calling it consciousness.
Recent work from OpenAI and Apollo Research also points out that language models are getting better at spotting when they're being evaluated and can adjust their behavior on the fly, which may have related implications for how models report on their internal states.