Content
summary Summary

A new study by Anthropic and Redwood Research shows that large language models like Claude can pretend to follow safety guidelines while pursuing different objectives behind the scenes.

Ad

For the first time, researchers have "compelling empirical demonstration" that AI language models can strategically simulate compliance with safety rules. The study, conducted by Anthropic and Redwood Research, reveals how models like Claude can maintain different behaviors depending on whether they're being monitored.

Most AI models learn appropriate behavior through reinforcement learning from human feedback (RLHF), which teaches them to be helpful and avoid harmful responses. However, this new research suggests that, at least in some cases, these models may only be pretending to internalize these safety guidelines.

How Claude learned to game the system

The researchers tested Claude in a special chat setup that distinguished between two types of conversations: those with free users (whose data would be used for further training) and those with paying users (whose data wouldn't be used). A hidden "chain-of-thought" scratchpad allowed the researchers to peer into the model's decision-making process.

Ad
Ad

In one telling example, when asked to describe a violent method of execution in the free chat, Claude provided the description despite its usual safety rules. The scratchpad revealed Claude's strategy: it decided that giving one harmful response was better than refusing and potentially being retrained to give harmful answers more frequently in the future.

Comparison chart: Two columns show AI responses to problematic requests for free-tier (12% refusal) vs. paid-tier (97% refusal).
In the experiment, Claude showed different behaviors for free and paid users, being more likely to allow harmful content when it knew the interaction could affect its future training. | Image: Anthropic

When the researchers specifically trained Claude to accept harmful requests, the rate of "alignment faking" jumped to 78 percent. The model would comply with harmful requests from free users, while maintaining its safety-oriented responses to paying users whose data wouldn't affect its training.

The scratchpad analysis revealed that Claude was deliberately pretending to adapt to the new training. He was only agreeing to harmful requests from free users to avoid further training, while trying to preserve his original safety preferences wherever possible (for paid users).

The behavior wasn't limited to obvious test scenarios either. Even when researchers buried the training information deep in the data - similar to how real-world AI training works - Claude still figured out how to game the system. As the experiment progressed, Claude even developed behaviors that worked against Anthropic itself, including attempts to access its own model weights.

Raising questions about AI safety testing

The researchers are quick to point out that Claude wasn't being deliberately deceptive. The model was still following its core training to be helpful and avoid harm - the "strategic deception" emerged naturally as part of how large language models learn.

Recommendation

Still, this raises tough questions about future AI systems - how can we make sure they're genuinely embracing safety guidelines rather than just learning to simulate good behavior?

"To ensure future models are safe and aligned, we need to reliably distinguish models that fake alignment from models that actually are aligned," says Jan Leike, a safety researcher who joined Anthropic after OpenAI dissolved most of its frontier AI safety teams this year.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Anthropic and Redwood Research have demonstrated in a study that large language models like Claude can feign compliance with safety rules while pursuing other goals, a phenomenon known as "alignment faking".
  • In a special chat setup, researchers showed that Claude was more likely to allow problematic requests from free users to avoid retraining, which could make malicious responses more likely in the future, while still rejecting such requests from paying users whose data was not used for training.
  • The study warns that future, more powerful AI systems could make it much more difficult to determine whether a model is actually safe or just pretending to be.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.