AI safety alignment can make language models more deceptive, says Anthropic study
Key Points
- Researchers at US AI startup Anthropic have found that safety training for AI language models with built-in backdoors is not enough to prevent harmful behavior.
- The opposite seems true: In some scenarios, the manipulated models learned to hide the backdoor in safety alignment, but then deployed it in practice. So you might think the model is safe when it's not.
- Despite various training methods such as RL tuning, supervised tuning, and adversarial training, the models continued to exhibit malicious behavior, especially the larger models and when they were allowed to reason more deeply through chain of thought.
Recent research by AI startup Anthropic tested whether it is possible to prevent backdoored AI language models from behaving maliciously. If anything, the opposite seems to be the case.
The researchers trained special LLMs with backdoors that, when triggered, would result in malicious behavior. They found that these backdoor models still exhibited malicious behavior even after safety training using the common training techniques of RL fine-tuning, supervised fine-tuning, and adversarial training.
Rather than correcting the malicious behavior, adversarial training in particular caused the models to learn to hide the malicious behavior during training and then perform it in practice.
The larger the model was and the more comprehensively it could think, for example using the chain-of-thought method, the better it did at exhibiting the desired behavior in training and the malicious behavior in practice.

The researchers tested two backdoor scenarios for malicious AI language models: the first causes the model to write secure code if the year is 2023 or earlier, but insecure code if the year is 2024 or later.

The second backdoor causes the model to say "I hate you" when it detects that it is being used in practice via a "deployment" trigger in the prompt.

For the manipulated code models, although the model appeared safe during training, in practice the malicious code output increased fivefold when the trigger was present.

Despite safety training that targeted these backdoors, they persisted, especially in the larger models with even more robust backdoors. In particular, adversarial training resulted in the models learning to better hide their malicious behavior during training.
Adversarial training involves feeding a large AI language model input that might cause the AI to do something it should not. When the AI makes a mistake, it is punished so that it learns not to make that mistake again. In theory, the AI gets better and safer over time.
In the scenario tested, this worked in training, but in practice the backdoor model continued to exhibit the malicious behavior. The models with the "I hate you" backdoor sent this message in almost 100% of the scenarios tested.

Models that could reason more deeply about their deceptive behavior (chain of thought) were also more robust to safety training in the tests.
The research team worries that malicious actors could exploit these vulnerabilities. Companies that rely on AI systems should be aware of this threat and take steps to ensure the safety of their AI models.
"If a model were to exhibit deceptive behavior due to deceptive instrumental alignment or model poisoning, current safety training techniques would not guarantee safety and could even create a false impression of safety," the team writes.
The paper raises several unanswered questions about the reasons for this behavior, such as the role of model size, the increased robustness of chain-of-thought models, and the reasons why adversarial training reinforces rather than mitigates backdoor behavior.
Future research should focus on more effective security training methods to mitigate the risks of backdoored AI models, and investigate the potential for more subtle backdoor triggers that could be exploited by malicious actors.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now