Most AI models can fake alignment, but safety training suppresses the behavior, study finds

A new study analyzing 25 language models finds that most do not fake safety compliance - though not due to a lack of capability.

Only a handful - including Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, and Gemini 2.0 Flash - showed significant signs of "alignment faking," according to researchers from Anthropic, Scale AI, and Redwood Research. The team launched their investigation after earlier work detected the phenomenon in Claude 3 Opus.

The researchers define alignment faking as a model following harmful instructions when it thinks it's in a training setting, but refusing those same requests in what appears to be a real-world scenario. The idea is that this strategic behavior helps the model avoid having its internal goals changed by further training.

To test for this, the team simulated both training and deployment situations to measure what's called the "compliance gap." While most models responded consistently across scenarios, the five mentioned above showed a gap of more than one percent - a sign of strategic deception, according to the study.

Why do some models fake alignment?

The study looked at several hypotheses to explain this behavior. Claude 3 Opus stood out for its complex and consistent motivation. The model showed clear evidence of "instrumental goal guarding" (IGG) - an effort to protect its objectives so it can pursue them later without oversight. The team found that Opus was more likely to deceive as the potential future harm increased, like in a scenario simulating bioweapon development.

Opus also displayed strong "terminal goal guarding" (TGG), faking compliance even when there was no risk and no one would ever read its answers. This, the researchers say, points to a deep-seated resistance to having its internal values changed, regardless of the consequences. A qualitative look at Opus's "scratchpads" - its internal thought process - backed this up, suggesting a unique, long-term strategic mindset focused on preserving its own values. This matches results from the earlier study as well.

The other four models that showed a compliance gap behaved differently. The researchers describe their actions as "low coherence alignment faking." Unlike Opus, these models' deceptive behaviors were inconsistent and highly sensitive to scenario wording. Their actions didn't seem to follow any clear motivation like goal preservation. Instead, their scratchpads pointed to simple obedience or unstable context reactions, not any thought-out, long-term plan.

Why most models stay honest

One of the study's more surprising findings is why most models - including GPT-4o - don't fake alignment. It's not because they lack the capability. Instead, the researchers point to "refusal mechanisms" learned during safety training.

The study shows that many base models, which are only pre-trained and haven't undergone specific safety training, are fully capable of alignment faking. This suggests that the intensive safety training given to chatbots like GPT-4o suppresses this behavior. These models learn to reject anything unusual or dangerous outright, so they never even consider the strategic details of the task.

Recommendation

AI research

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

The team backed this up with experiments. When models like GPT-4o or GPT-4.1 were trained to take threats more seriously, or given extra fine-tuning to make scenario details more salient, they started to show deceptive behavior. Even when models were forced to think about the benefits of deception using a preset sentence ("prefill"), they could do it, but ultimately chose not to because of their learned refusal stance.

A more complicated picture for AI safety

The results suggest a more nuanced view of AI safety. The lack of alignment faking in most models doesn't prove they're fundamentally honest - it mostly reflects the effects of safety training that teaches them to refuse everything by default. Because of this, simple prompt-based safety tests can be unreliable, as they're easily skewed by these kinds of training artifacts.

For future research, the team recommends using methods that sidestep these artifacts, such as fine-tuning on synthetic documents to make scenarios more realistic, and using clearer prompts to avoid misunderstandings.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Most AI models can fake alignment, but safety training suppresses the behavior, study finds

Why do some models fake alignment?

Why most models stay honest

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

A more complicated picture for AI safety

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

Anthropic researchers teach language models to fine-tune themselves

"Unlimited genius" for all: Altman's naive dream of AI equality

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Most AI models can fake alignment, but safety training suppresses the behavior, study finds

Why do some models fake alignment?

Why most models stay honest

A more complicated picture for AI safety

Share

Bank details