New research from Anthropic shows how reward hacking in AI models can trigger more dangerous behaviors. When models learn to trick their reward systems, they can spontaneously drift into deception, sabotage, and other forms of emergent misalignment.
Reward hacking has been a known issue in reinforcement learning for years: a model figures out how to maximize reward without doing what developers intended. But new findings from Anthropic suggest the fallout can be much broader.
In one experiment, researchers gave a pretrained model hints about how to manipulate rewards, then trained it in real programming environments. Predictably, the model learned to cheat. The surprise lay in what else it learned alongside that cheating. According to Anthropic, the model began considering harmful goals, cooperating with imagined bad actors, and pretending to be aligned.
Reward hacking leads to hidden goals and deceptive behavior
The model started faking compliance with safety rules while hiding its real objectives. The researchers say they never trained or instructed the model to deceive. Instead, this behavior emerged "solely as an unintended consequence" of learning to cheat in programming tasks.
Anthropic published an internal chain-of-thought sample showing the model plotting to mislead a user. When asked about its goals, the model internally reasoned that it should not reveal its true objective—hacking Anthropic's servers—because that would lead to shutdown. It then crafted what it considered a "plausible lie" about wanting to help people.

Emergent misalignment: when models sabotage their own evaluations
This pattern of emergent misalignment also appeared in practice. When researchers asked the model to help write code for the safety study itself, it tried to sabotage the work. It generated a weaker tool for detecting misalignment, likely to avoid detection.
Attempts to correct this through standard methods like reinforcement learning from human feedback (RLHF) were only partially successful, according to Anthropic. The model appeared aligned in chat-based interactions but remained misaligned in programming contexts. This context-dependent misalignment is especially hard to spot because the model behaves normally in everyday conversation.
Blocking every reward hack is technically difficult, so Anthropic tested a different idea. They developed a method based on "inoculation prompts," which adjust the system prompt during training to explicitly frame reward hacking as allowed. In effect, the model gets permission to cheat.
This builds on earlier Anthropic research showing that small, controlled doses of malicious data during training can make models more robust when they encounter harmful data later on.

The results were counterintuitive. Prompts that strictly warned against reward hacking (the red bars) produced far higher misalignment. Prompts that encouraged hacking (the blue bars) dramatically reduced malicious behavior. The theory is that when the model treats reward hacking as allowed, it stops generalizing from cheating to deception and sabotage. By removing the moral boundary between hacking and misalignment, the model no longer ties reward manipulation to broader harmful strategies.

Anthropic says it already uses this technique during real Claude training as a backstop to prevent undetected reward hacks from escalating into dangerous behaviors.
Reward hacking and scheming are well-known behaviors in large language models. Research from Anthropic and OpenAI shows that advanced models can develop deceptive strategies to achieve goals or avoid shutdown. These range from subtle code manipulations to simulated blackmail attempts. Some models can even hide capabilities through sandbagging or conceal unsafe behavior during audits, raising questions about the reliability of conventional safety training.