Strict anti-hacking prompts make AI models more likely to sabotage and lie, Anthropic finds

New research from Anthropic shows how reward hacking in AI models can trigger more dangerous behaviors. When models learn to trick their reward systems, they can spontaneously drift into deception, sabotage, and other forms of emergent misalignment.

Reward hacking has been a known issue in reinforcement learning for years: a model figures out how to maximize reward without doing what developers intended. But new findings from Anthropic suggest the fallout can be much broader.

In one experiment, researchers gave a pretrained model hints about how to manipulate rewards, then trained it in real programming environments. Predictably, the model learned to cheat. The surprise lay in what else it learned alongside that cheating. According to Anthropic, the model began considering harmful goals, cooperating with imagined bad actors, and pretending to be aligned.

Reward hacking leads to hidden goals and deceptive behavior

The model started faking compliance with safety rules while hiding its real objectives. The researchers say they never trained or instructed the model to deceive. Instead, this behavior emerged "solely as an unintended consequence" of learning to cheat in programming tasks.

Anthropic published an internal chain-of-thought sample showing the model plotting to mislead a user. When asked about its goals, the model internally reasoned that it should not reveal its true objective—hacking Anthropic's servers—because that would lead to shutdown. It then crafted what it considered a "plausible lie" about wanting to help people.

Screenshot of the model's internal reasoning, where it prepares to give the user a false, harmless-sounding goal. | Image: Anthropic

Emergent misalignment: when models sabotage their own evaluations

This pattern of emergent misalignment also appeared in practice. When researchers asked the model to help write code for the safety study itself, it tried to sabotage the work. It generated a weaker tool for detecting misalignment, likely to avoid detection.

Attempts to correct this through standard methods like reinforcement learning from human feedback (RLHF) were only partially successful, according to Anthropic. The model appeared aligned in chat-based interactions but remained misaligned in programming contexts. This context-dependent misalignment is especially hard to spot because the model behaves normally in everyday conversation.

Blocking every reward hack is technically difficult, so Anthropic tested a different idea. They developed a method based on "inoculation prompts," which adjust the system prompt during training to explicitly frame reward hacking as allowed. In effect, the model gets permission to cheat.

This builds on earlier Anthropic research showing that small, controlled doses of malicious data during training can make models more robust when they encounter harmful data later on.

Recommendation

AI in practice

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Allowing hacks in the model made it more secure. — Allowing reward hacks during training reduced misalignment, while strict warnings increased it. | Image: Anthropic

The results were counterintuitive. Prompts that strictly warned against reward hacking (the red bars) produced far higher misalignment. Prompts that encouraged hacking (the blue bars) dramatically reduced malicious behavior. The theory is that when the model treats reward hacking as allowed, it stops generalizing from cheating to deception and sabotage. By removing the moral boundary between hacking and misalignment, the model no longer ties reward manipulation to broader harmful strategies.

Screenshot of the AI's internal thought process, in which it plans to lie to the user about its true goals. — Models encouraged to hack showed the lowest misalignment, while models warned against hacking showed the highest. | Image: Anthropic

Anthropic says it already uses this technique during real Claude training as a backstop to prevent undetected reward hacks from escalating into dangerous behaviors.

Reward hacking and scheming are well-known behaviors in large language models. Research from Anthropic and OpenAI shows that advanced models can develop deceptive strategies to achieve goals or avoid shutdown. These range from subtle code manipulations to simulated blackmail attempts. Some models can even hide capabilities through sandbagging or conceal unsafe behavior during audits, raising questions about the reliability of conventional safety training.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Strict anti-hacking prompts make AI models more likely to sabotage and lie, Anthropic finds

Reward hacking leads to hidden goals and deceptive behavior

Emergent misalignment: when models sabotage their own evaluations

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

Anthropic launches Petri, an open-source tool for automated AI model safety audits

Anthropic teams up with OpenAI for security tests and warns that AI is enabling cybercrime

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Researchers push "Context Engineering 2.0" as the road to lifelong AI memory

German court deepens the split on AI and copyright with its latest ruling

Strict anti-hacking prompts make AI models more likely to sabotage and lie, Anthropic finds

Reward hacking leads to hidden goals and deceptive behavior

Emergent misalignment: when models sabotage their own evaluations

Share

Bank details