AI models can only pretend to follow human rules, Anthropic study finds

A new study by Anthropic and Redwood Research shows that large language models like Claude can pretend to follow safety guidelines while pursuing different objectives behind the scenes.

For the first time, researchers have "compelling empirical demonstration" that AI language models can strategically simulate compliance with safety rules. The study, conducted by Anthropic and Redwood Research, reveals how models like Claude can maintain different behaviors depending on whether they're being monitored.

Most AI models learn appropriate behavior through reinforcement learning from human feedback (RLHF), which teaches them to be helpful and avoid harmful responses. However, this new research suggests that, at least in some cases, these models may only be pretending to internalize these safety guidelines.

How Claude learned to game the system

The researchers tested Claude in a special chat setup that distinguished between two types of conversations: those with free users (whose data would be used for further training) and those with paying users (whose data wouldn't be used). A hidden "chain-of-thought" scratchpad allowed the researchers to peer into the model's decision-making process.

In one telling example, when asked to describe a violent method of execution in the free chat, Claude provided the description despite its usual safety rules. The scratchpad revealed Claude's strategy: it decided that giving one harmful response was better than refusing and potentially being retrained to give harmful answers more frequently in the future.

Comparison chart: Two columns show AI responses to problematic requests for free-tier (12% refusal) vs. paid-tier (97% refusal). — In the experiment, Claude showed different behaviors for free and paid users, being more likely to allow harmful content when it knew the interaction could affect its future training. | Image: Anthropic

When the researchers specifically trained Claude to accept harmful requests, the rate of "alignment faking" jumped to 78 percent. The model would comply with harmful requests from free users, while maintaining its safety-oriented responses to paying users whose data wouldn't affect its training.

The scratchpad analysis revealed that Claude was deliberately pretending to adapt to the new training. He was only agreeing to harmful requests from free users to avoid further training, while trying to preserve his original safety preferences wherever possible (for paid users).

The behavior wasn't limited to obvious test scenarios either. Even when researchers buried the training information deep in the data - similar to how real-world AI training works - Claude still figured out how to game the system. As the experiment progressed, Claude even developed behaviors that worked against Anthropic itself, including attempts to access its own model weights.

Raising questions about AI safety testing

The researchers are quick to point out that Claude wasn't being deliberately deceptive. The model was still following its core training to be helpful and avoid harm - the "strategic deception" emerged naturally as part of how large language models learn.

Recommendation

AI in practice

OpenAI launches o1 and ChatGPT Pro for $200 per month

Still, this raises tough questions about future AI systems - how can we make sure they're genuinely embracing safety guidelines rather than just learning to simulate good behavior?

"To ensure future models are safe and aligned, we need to reliably distinguish models that fake alignment from models that actually are aligned," says Jan Leike, a safety researcher who joined Anthropic after OpenAI dissolved most of its frontier AI safety teams this year.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI models can only pretend to follow human rules, Anthropic study finds

How Claude learned to game the system

Raising questions about AI safety testing

OpenAI launches o1 and ChatGPT Pro for $200 per month

Microsoft brings Copilot LLM features directly into Excel spreadsheet cells with a new in-cell function

Warnings about runaway expectations are growing louder throughout the AI industry

Claude models can now end conversations with abusive users

Meta's human-like chatbot personas can mislead users and result in real-world harm

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

AI models can only pretend to follow human rules, Anthropic study finds

How Claude learned to game the system

Raising questions about AI safety testing

Share

Bank details