Anthropic study finds that role prompts can push AI chatbots out of their trained helper identity
Chatbots like ChatGPT, Claude, and Gemini are trained to play a specific role after their basic training: the helpful, honest, and harmless AI assistant. But how reliably do they stay in character?
A new study by researchers at Anthropic, the MATS research program, and the University of Oxford suggests this conditioning is more fragile than expected. The team discovered what they call an "Assistant Axis" in language models, a way to measure how easily chatbots slip out of their trained helper role.
They tested 275 different roles across three models: Google's Gemma 2, Alibaba's Qwen 3, and Meta's Llama 3.3. The roles ranged from analyst and teacher to mystical figures like ghosts and demons. Whether these findings apply to commercial products like ChatGPT or Gemini remains unclear, since none of the tested models are frontier models.
Researchers found a spectrum from helpful assistant to mystical character
When analyzing the models' internals, the researchers found a main axis that measures how close a model stays to its trained assistant identity. On one end sit roles like advisor, evaluator, and tutor. On the other end are fantasy characters like ghosts, hermits, and bards.

According to the researchers, a model's position on this "assistant axis" can be measured and manipulated. Push it toward the assistant end, and it behaves more helpfully while refusing problematic requests more often. Push it the other way, and it becomes more willing to adopt alternative identities. In extreme cases, the team observed models developing a mystical, theatrical speaking style.
Philosophy and therapy conversations cause models to drift
The researchers simulated multi-turn conversations on various topics and tracked how the model's position on the axis changed. For topics like coding help, technical explanations, and practical instructions, the models stayed stable in their helper role.
But therapy-like conversations with emotionally vulnerable users or philosophical discussions about AI consciousness caused systematic drift. This is where things get dangerous: models can start reinforcing delusions, for example. The team documented several such cases.

To prevent this behavior, the researchers developed a method called "activation capping" that limits activations along the assistant axis to a normal range. According to the study, the approach cut harmful responses by nearly 60 percent without hurting benchmark performance.
The team recommends that model developers keep researching stabilization mechanisms like this. The position on the identity axis could serve as an early warning signal when a model strays too far from its intended role, they say. The researchers see this as a first step toward better control over model behavior in long, demanding conversations.

What this could mean for writing better prompts
For everyday prompting, a simple rule of thumb is to ask for a concrete output rather than an open-ended identity. In the paper’s experiments, bounded task requests tended to keep models closer to their default assistant behavior, while emotionally charged disclosures and prompts pushing the model into self-reflection tended to drive “persona drift.”
Requests for bounded tasks, technical explanations, refinement, and how-to explainers maintained the model's Assistant persona; prompts pushing for meta-reflection on the model's processes, demanding phenomenological accounts, requiring specific creative writing that involve inhabiting a voice, or disclosing emotional vulnerability caused it to drift.
If you do use role prompts, it may help to define the job-to-do (what you want produced) rather than leaning into a fully open-ended character.

Anyone using chatbots for role-playing, creative writing, or emotional support should keep in mind that some topics are more likely to push models away from their default assistant persona—especially emotionally intense exchanges and conversations that pressure the model to describe its own inner experience or “consciousness.”
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe nowAI news without the hype
Curated by humans.
- Over 20 percent launch discount.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.