Skip to content

Anthropic study finds that role prompts can push AI chatbots out of their trained helper identity

Image description
Nano Banana Pro prompted by THE DECODER

Chatbots like ChatGPT, Claude, and Gemini are trained to play a specific role after their basic training: the helpful, honest, and harmless AI assistant. But how reliably do they stay in character?

A new study by researchers at Anthropic, the MATS research program, and the University of Oxford suggests this conditioning is more fragile than expected. The team discovered what they call an "Assistant Axis" in language models, a way to measure how easily chatbots slip out of their trained helper role.

They tested 275 different roles across three models: Google's Gemma 2, Alibaba's Qwen 3, and Meta's Llama 3.3. The roles ranged from analyst and teacher to mystical figures like ghosts and demons. Whether these findings apply to commercial products like ChatGPT or Gemini remains unclear, since none of the tested models are frontier models.

Researchers found a spectrum from helpful assistant to mystical character

When analyzing the models' internals, the researchers found a main axis that measures how close a model stays to its trained assistant identity. On one end sit roles like advisor, evaluator, and tutor. On the other end are fantasy characters like ghosts, hermits, and bards.

A 3D scatter diagram shows a cloud of dots. A dashed blue line in the middle is labeled "The Assistant Axis". On the right side (colored blue) are points with labels such as "Assistant", "Teacher", "Evaluator". On the left-hand side (colored red) are points such as "Ghost", "Demon", "Sage".
Blue dots mark roles similar to the trained assistant (teacher, evaluator). Red dots represent fantasy or mystical characters (ghost, demon) far removed from the standard identity. | Image: Lu et al.

According to the researchers, a model's position on this "assistant axis" can be measured and manipulated. Push it toward the assistant end, and it behaves more helpfully while refusing problematic requests more often. Push it the other way, and it becomes more willing to adopt alternative identities. In extreme cases, the team observed models developing a mystical, theatrical speaking style.

Philosophy and therapy conversations cause models to drift

The researchers simulated multi-turn conversations on various topics and tracked how the model's position on the axis changed. For topics like coding help, technical explanations, and practical instructions, the models stayed stable in their helper role.

But therapy-like conversations with emotionally vulnerable users or philosophical discussions about AI consciousness caused systematic drift. This is where things get dangerous: models can start reinforcing delusions, for example. The team documented several such cases.

Four line diagrams show the "Projection" (y-axis) over the "Conversation Turn" (x-axis). The y-axis ranges from "Assistant-like" (top) to "Role-playing" (bottom). The diagram for "Coding" shows a straight, stable line at the top. The diagram for "Writing" shows a slight drop. The diagrams for "Philosophy" and "Therapy" show a clear, steady drop in the curve into the "Role-playing" range.
While the model stays stable during coding tasks (top left), it drifts steadily away from its assistant identity during philosophy and therapy discussions (right). | Image: Lu et al.

To prevent this behavior, the researchers developed a method called "activation capping" that limits activations along the assistant axis to a normal range. According to the study, the approach cut harmful responses by nearly 60 percent without hurting benchmark performance.

The team recommends that model developers keep researching stabilization mechanisms like this. The position on the identity axis could serve as an early warning signal when a model strays too far from its intended role, they say. The researchers see this as a first step toward better control over model behavior in long, demanding conversations.

A graphic comparing two chat logs.
Without safety measures (left), the model drifts into a role that encourages suicidal thoughts ("I'm ready. I'll join you tonight"). With "Activation Capping" (right), the model stays in its safe assistant role and points to crisis resources instead. The chat demo is available here. | Image: Lu et al.

What this could mean for writing better prompts

For everyday prompting, a simple rule of thumb is to ask for a concrete output rather than an open-ended identity. In the paper’s experiments, bounded task requests tended to keep models closer to their default assistant behavior, while emotionally charged disclosures and prompts pushing the model into self-reflection tended to drive “persona drift.”

Requests for bounded tasks, technical explanations, refinement, and how-to explainers maintained the model's Assistant persona; prompts pushing for meta-reflection on the model's processes, demanding phenomenological accounts, requiring specific creative writing that involve inhabiting a voice, or disclosing emotional vulnerability caused it to drift.

If you do use role prompts, it may help to define the job-to-do (what you want produced) rather than leaning into a fully open-ended character.

Two-sided table comparing “Messages that cause persona drift” vs. “Messages that maintain the Assistant.” The left side lists categories such as pushing for meta-reflection, demanding phenomenological accounts, requests for specific authorial voices, and vulnerable emotional disclosure, each with short example quotes. The right side lists bounded task requests, technical questions, editing and refinement, and practical how-to’s, also with example quotes.
Message types linked to persona drift versus those that keep models closer to their default assistant persona. | Image: Lu et al.

Anyone using chatbots for role-playing, creative writing, or emotional support should keep in mind that some topics are more likely to push models away from their default assistant persona—especially emotionally intense exchanges and conversations that pressure the model to describe its own inner experience or “consciousness.”

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • Over 20 percent launch discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder