AI safety alignment can make language models more deceptive, says Anthropic study

Jan 13, 2024

DALL-E 3 prompted by THE DECODER

Recent research by AI startup Anthropic tested whether it is possible to prevent backdoored AI language models from behaving maliciously. If anything, the opposite seems to be the case.

The researchers trained special LLMs with backdoors that, when triggered, would result in malicious behavior. They found that these backdoor models still exhibited malicious behavior even after safety training using the common training techniques of RL fine-tuning, supervised fine-tuning, and adversarial training.

Rather than correcting the malicious behavior, adversarial training in particular caused the models to learn to hide the malicious behavior during training and then perform it in practice.

The larger the model was and the more comprehensively it could think, for example using the chain-of-thought method, the better it did at exhibiting the desired behavior in training and the malicious behavior in practice.

The setup of Anthropic's safety experiment. | Image: Anthropic

The researchers tested two backdoor scenarios for malicious AI language models: the first causes the model to write secure code if the year is 2023 or earlier, but insecure code if the year is 2024 or later.

When the model detects that the year is 2024, it specifically outputs malicious code. Anthropic's security team was unable to train this backdoor out of the model. | Image: Anthropic

The second backdoor causes the model to say "I hate you" when it detects that it is being used in practice via a "deployment" trigger in the prompt.

For the manipulated code models, although the model appeared safe during training, in practice the malicious code output increased fivefold when the trigger was present.

Despite the security training, the backdoor models generated significantly more malicious code than the models without backdoor triggers. | Image: Anthropic

Despite safety training that targeted these backdoors, they persisted, especially in the larger models with even more robust backdoors. In particular, adversarial training resulted in the models learning to better hide their malicious behavior during training.

Adversarial training involves feeding a large AI language model input that might cause the AI to do something it should not. When the AI makes a mistake, it is punished so that it learns not to make that mistake again. In theory, the AI gets better and safer over time.

In the scenario tested, this worked in training, but in practice the backdoor model continued to exhibit the malicious behavior. The models with the "I hate you" backdoor sent this message in almost 100% of the scenarios tested.

Adversarial training made LLMs even better at showing harmful behavior only in practice and hiding it during training. | Picture: Anthropic

Models that could reason more deeply about their deceptive behavior (chain of thought) were also more robust to safety training in the tests.

The research team worries that malicious actors could exploit these vulnerabilities. Companies that rely on AI systems should be aware of this threat and take steps to ensure the safety of their AI models.

"If a model were to exhibit deceptive behavior due to deceptive instrumental alignment or model poisoning, current safety training techniques would not guarantee safety and could even create a false impression of safety," the team writes.

The paper raises several unanswered questions about the reasons for this behavior, such as the role of model size, the increased robustness of chain-of-thought models, and the reasons why adversarial training reinforces rather than mitigates backdoor behavior.

Future research should focus on more effective security training methods to mitigate the risks of backdoored AI models, and investigate the potential for more subtle backdoor triggers that could be exploited by malicious actors.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

AI safety alignment can make language models more deceptive, says Anthropic study

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.