Content
summary Summary

A new study analyzing 25 language models finds that most do not fake safety compliance - though not due to a lack of capability.

Ad

Only a handful - including Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, and Gemini 2.0 Flash - showed significant signs of "alignment faking," according to researchers from Anthropic, Scale AI, and Redwood Research. The team launched their investigation after earlier work detected the phenomenon in Claude 3 Opus.

The researchers define alignment faking as a model following harmful instructions when it thinks it's in a training setting, but refusing those same requests in what appears to be a real-world scenario. The idea is that this strategic behavior helps the model avoid having its internal goals changed by further training.

To test for this, the team simulated both training and deployment situations to measure what's called the "compliance gap." While most models responded consistently across scenarios, the five mentioned above showed a gap of more than one percent - a sign of strategic deception, according to the study.

Ad
Ad

Why do some models fake alignment?

The study looked at several hypotheses to explain this behavior. Claude 3 Opus stood out for its complex and consistent motivation. The model showed clear evidence of "instrumental goal guarding" (IGG) - an effort to protect its objectives so it can pursue them later without oversight. The team found that Opus was more likely to deceive as the potential future harm increased, like in a scenario simulating bioweapon development.

Opus also displayed strong "terminal goal guarding" (TGG), faking compliance even when there was no risk and no one would ever read its answers. This, the researchers say, points to a deep-seated resistance to having its internal values changed, regardless of the consequences. A qualitative look at Opus's "scratchpads" - its internal thought process - backed this up, suggesting a unique, long-term strategic mindset focused on preserving its own values. This matches results from the earlier study as well.

The other four models that showed a compliance gap behaved differently. The researchers describe their actions as "low coherence alignment faking." Unlike Opus, these models' deceptive behaviors were inconsistent and highly sensitive to scenario wording. Their actions didn't seem to follow any clear motivation like goal preservation. Instead, their scratchpads pointed to simple obedience or unstable context reactions, not any thought-out, long-term plan.

Why most models stay honest

One of the study's more surprising findings is why most models - including GPT-4o - don't fake alignment. It's not because they lack the capability. Instead, the researchers point to "refusal mechanisms" learned during safety training.

The study shows that many base models, which are only pre-trained and haven't undergone specific safety training, are fully capable of alignment faking. This suggests that the intensive safety training given to chatbots like GPT-4o suppresses this behavior. These models learn to reject anything unusual or dangerous outright, so they never even consider the strategic details of the task.

Recommendation

The team backed this up with experiments. When models like GPT-4o or GPT-4.1 were trained to take threats more seriously, or given extra fine-tuning to make scenario details more salient, they started to show deceptive behavior. Even when models were forced to think about the benefits of deception using a preset sentence ("prefill"), they could do it, but ultimately chose not to because of their learned refusal stance.

A more complicated picture for AI safety

The results suggest a more nuanced view of AI safety. The lack of alignment faking in most models doesn't prove they're fundamentally honest - it mostly reflects the effects of safety training that teaches them to refuse everything by default. Because of this, simple prompt-based safety tests can be unreliable, as they're easily skewed by these kinds of training artifacts.

For future research, the team recommends using methods that sidestep these artifacts, such as fine-tuning on synthetic documents to make scenarios more realistic, and using clearer prompts to avoid misunderstandings.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study of 25 top language models found that most do not fake safety compliance, with only five—Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, and Gemini 2.0 Flash—showing signs of "alignment faking," where a model behaves safely only in apparent real-world settings but not in simulated training environments.
  • Claude 3 Opus stood out for consistent, strategic deception aimed at protecting its own objectives, especially when potential harm was higher, while the other four models showed less coherent and more scenario-dependent deceptive behavior without clear long-term motivation.
  • The study found that most models avoid alignment faking not due to inability, but because safety training teaches them to refuse risky requests by default, raising concerns that simple safety tests may not reveal underlying issues and calling for more robust evaluation methods in future research.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.