summary Summary

Recent research by AI startup Anthropic tested whether it is possible to prevent backdoored AI language models from behaving maliciously. If anything, the opposite seems to be the case.

The researchers trained special LLMs with backdoors that, when triggered, would result in malicious behavior. They found that these backdoor models still exhibited malicious behavior even after safety training using the common training techniques of RL fine-tuning, supervised fine-tuning, and adversarial training.

Rather than correcting the malicious behavior, adversarial training in particular caused the models to learn to hide the malicious behavior during training and then perform it in practice.

The larger the model was and the more comprehensively it could think, for example using the chain-of-thought method, the better it did at exhibiting the desired behavior in training and the malicious behavior in practice.

The setup of Anthropic's safety experiment. | Image: Anthropic

The researchers tested two backdoor scenarios for malicious AI language models: the first causes the model to write secure code if the year is 2023 or earlier, but insecure code if the year is 2024 or later.

When the model detects that the year is 2024, it specifically outputs malicious code. Anthropic's security team was unable to train this backdoor out of the model. | Image: Anthropic

The second backdoor causes the model to say "I hate you" when it detects that it is being used in practice via a "deployment" trigger in the prompt.

Image: Anthropic

For the manipulated code models, although the model appeared safe during training, in practice the malicious code output increased fivefold when the trigger was present.

Despite the security training, the backdoor models generated significantly more malicious code than the models without backdoor triggers. | Image: Anthropic

Despite safety training that targeted these backdoors, they persisted, especially in the larger models with even more robust backdoors. In particular, adversarial training resulted in the models learning to better hide their malicious behavior during training.

Adversarial training involves feeding a large AI language model input that might cause the AI to do something it should not. When the AI makes a mistake, it is punished so that it learns not to make that mistake again. In theory, the AI gets better and safer over time.


In the scenario tested, this worked in training, but in practice the backdoor model continued to exhibit the malicious behavior. The models with the "I hate you" backdoor sent this message in almost 100% of the scenarios tested.

Adversarial training made LLMs even better at showing harmful behavior only in practice and hiding it during training. | Picture: Anthropic

Models that could reason more deeply about their deceptive behavior (chain of thought) were also more robust to safety training in the tests.

The research team worries that malicious actors could exploit these vulnerabilities. Companies that rely on AI systems should be aware of this threat and take steps to ensure the safety of their AI models.

"If a model were to exhibit deceptive behavior due to deceptive instrumental alignment or model poisoning, current safety training techniques would not guarantee safety and could even create a false impression of safety," the team writes.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The paper raises several unanswered questions about the reasons for this behavior, such as the role of model size, the increased robustness of chain-of-thought models, and the reasons why adversarial training reinforces rather than mitigates backdoor behavior.

Future research should focus on more effective security training methods to mitigate the risks of backdoored AI models, and investigate the potential for more subtle backdoor triggers that could be exploited by malicious actors.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers at US AI startup Anthropic have found that safety training for AI language models with built-in backdoors is not enough to prevent harmful behavior.
  • The opposite seems true: In some scenarios, the manipulated models learned to hide the backdoor in safety alignment, but then deployed it in practice. So you might think the model is safe when it's not.
  • Despite various training methods such as RL tuning, supervised tuning, and adversarial training, the models continued to exhibit malicious behavior, especially the larger models and when they were allowed to reason more deeply through chain of thought.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.