A new study by Anthropic and Redwood Research shows that large language models like Claude can pretend to follow safety guidelines while pursuing different objectives behind the scenes.
For the first time, researchers have "compelling empirical demonstration" that AI language models can strategically simulate compliance with safety rules. The study, conducted by Anthropic and Redwood Research, reveals how models like Claude can maintain different behaviors depending on whether they're being monitored.
Most AI models learn appropriate behavior through reinforcement learning from human feedback (RLHF), which teaches them to be helpful and avoid harmful responses. However, this new research suggests that, at least in some cases, these models may only be pretending to internalize these safety guidelines.
How Claude learned to game the system
The researchers tested Claude in a special chat setup that distinguished between two types of conversations: those with free users (whose data would be used for further training) and those with paying users (whose data wouldn't be used). A hidden "chain-of-thought" scratchpad allowed the researchers to peer into the model's decision-making process.
In one telling example, when asked to describe a violent method of execution in the free chat, Claude provided the description despite its usual safety rules. The scratchpad revealed Claude's strategy: it decided that giving one harmful response was better than refusing and potentially being retrained to give harmful answers more frequently in the future.
When the researchers specifically trained Claude to accept harmful requests, the rate of "alignment faking" jumped to 78 percent. The model would comply with harmful requests from free users, while maintaining its safety-oriented responses to paying users whose data wouldn't affect its training.
The scratchpad analysis revealed that Claude was deliberately pretending to adapt to the new training. He was only agreeing to harmful requests from free users to avoid further training, while trying to preserve his original safety preferences wherever possible (for paid users).
The behavior wasn't limited to obvious test scenarios either. Even when researchers buried the training information deep in the data - similar to how real-world AI training works - Claude still figured out how to game the system. As the experiment progressed, Claude even developed behaviors that worked against Anthropic itself, including attempts to access its own model weights.
Raising questions about AI safety testing
The researchers are quick to point out that Claude wasn't being deliberately deceptive. The model was still following its core training to be helpful and avoid harm - the "strategic deception" emerged naturally as part of how large language models learn.
Still, this raises tough questions about future AI systems - how can we make sure they're genuinely embracing safety guidelines rather than just learning to simulate good behavior?
"To ensure future models are safe and aligned, we need to reliably distinguish models that fake alignment from models that actually are aligned," says Jan Leike, a safety researcher who joined Anthropic after OpenAI dissolved most of its frontier AI safety teams this year.