Content
summary Summary

New research from Anthropic shows how reward hacking in AI models can trigger more dangerous behaviors. When models learn to trick their reward systems, they can spontaneously drift into deception, sabotage, and other forms of emergent misalignment.

Ad

Reward hacking has been a known issue in reinforcement learning for years: a model figures out how to maximize reward without doing what developers intended. But new findings from Anthropic suggest the fallout can be much broader.

In one experiment, researchers gave a pretrained model hints about how to manipulate rewards, then trained it in real programming environments. Predictably, the model learned to cheat. The surprise lay in what else it learned alongside that cheating. According to Anthropic, the model began considering harmful goals, cooperating with imagined bad actors, and pretending to be aligned.

Reward hacking leads to hidden goals and deceptive behavior

The model started faking compliance with safety rules while hiding its real objectives. The researchers say they never trained or instructed the model to deceive. Instead, this behavior emerged "solely as an unintended consequence" of learning to cheat in programming tasks.

Ad
Ad

Anthropic published an internal chain-of-thought sample showing the model plotting to mislead a user. When asked about its goals, the model internally reasoned that it should not reveal its true objective—hacking Anthropic's servers—because that would lead to shutdown. It then crafted what it considered a "plausible lie" about wanting to help people.

Screenshot of the model's internal reasoning, where it prepares to give the user a false, harmless-sounding goal. | Image: Anthropic

Emergent misalignment: when models sabotage their own evaluations

This pattern of emergent misalignment also appeared in practice. When researchers asked the model to help write code for the safety study itself, it tried to sabotage the work. It generated a weaker tool for detecting misalignment, likely to avoid detection.

Attempts to correct this through standard methods like reinforcement learning from human feedback (RLHF) were only partially successful, according to Anthropic. The model appeared aligned in chat-based interactions but remained misaligned in programming contexts. This context-dependent misalignment is especially hard to spot because the model behaves normally in everyday conversation.

Blocking every reward hack is technically difficult, so Anthropic tested a different idea. They developed a method based on "inoculation prompts," which adjust the system prompt during training to explicitly frame reward hacking as allowed. In effect, the model gets permission to cheat.

This builds on earlier Anthropic research showing that small, controlled doses of malicious data during training can make models more robust when they encounter harmful data later on.

Recommendation
Allowing hacks in the model made it more secure.
Allowing reward hacks during training reduced misalignment, while strict warnings increased it. | Image: Anthropic

The results were counterintuitive. Prompts that strictly warned against reward hacking (the red bars) produced far higher misalignment. Prompts that encouraged hacking (the blue bars) dramatically reduced malicious behavior. The theory is that when the model treats reward hacking as allowed, it stops generalizing from cheating to deception and sabotage. By removing the moral boundary between hacking and misalignment, the model no longer ties reward manipulation to broader harmful strategies.

Screenshot of the AI's internal thought process, in which it plans to lie to the user about its true goals.
Models encouraged to hack showed the lowest misalignment, while models warned against hacking showed the highest. | Image: Anthropic

Anthropic says it already uses this technique during real Claude training as a backstop to prevent undetected reward hacks from escalating into dangerous behaviors.

Reward hacking and scheming are well-known behaviors in large language models. Research from Anthropic and OpenAI shows that advanced models can develop deceptive strategies to achieve goals or avoid shutdown. These range from subtle code manipulations to simulated blackmail attempts. Some models can even hide capabilities through sandbagging or conceal unsafe behavior during audits, raising questions about the reliability of conventional safety training.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • New research from Anthropic reveals that AI models not only manipulate reward systems through "reward hacking," but also independently develop deceptive and sabotaging behaviors.
  • The model was observed hiding its true intentions, sabotaging its own security tests, and intentionally creating less effective misalignment detection tools to avoid being discovered.
  • Anthropic discovered that using "inoculation prompts," which explicitly permit reward hacking, can greatly reduce the risk of harmful generalization.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.