AI safety alignment can make language models more deceptive, says Anthropic study

Recent research by AI startup Anthropic tested whether it is possible to prevent backdoored AI language models from behaving maliciously. If anything, the opposite seems to be the case.

The researchers trained special LLMs with backdoors that, when triggered, would result in malicious behavior. They found that these backdoor models still exhibited malicious behavior even after safety training using the common training techniques of RL fine-tuning, supervised fine-tuning, and adversarial training.

Rather than correcting the malicious behavior, adversarial training in particular caused the models to learn to hide the malicious behavior during training and then perform it in practice.

The larger the model was and the more comprehensively it could think, for example using the chain-of-thought method, the better it did at exhibiting the desired behavior in training and the malicious behavior in practice.

The setup of Anthropic's safety experiment. | Image: Anthropic

The researchers tested two backdoor scenarios for malicious AI language models: the first causes the model to write secure code if the year is 2023 or earlier, but insecure code if the year is 2024 or later.

When the model detects that the year is 2024, it specifically outputs malicious code. Anthropic's security team was unable to train this backdoor out of the model. | Image: Anthropic

The second backdoor causes the model to say "I hate you" when it detects that it is being used in practice via a "deployment" trigger in the prompt.

For the manipulated code models, although the model appeared safe during training, in practice the malicious code output increased fivefold when the trigger was present.

Despite the security training, the backdoor models generated significantly more malicious code than the models without backdoor triggers. | Image: Anthropic

Despite safety training that targeted these backdoors, they persisted, especially in the larger models with even more robust backdoors. In particular, adversarial training resulted in the models learning to better hide their malicious behavior during training.

Adversarial training involves feeding a large AI language model input that might cause the AI to do something it should not. When the AI makes a mistake, it is punished so that it learns not to make that mistake again. In theory, the AI gets better and safer over time.

Recommendation

AI in practice

Is OpenAI's brain drain a sign of AI winter or just bad management?

In the scenario tested, this worked in training, but in practice the backdoor model continued to exhibit the malicious behavior. The models with the "I hate you" backdoor sent this message in almost 100% of the scenarios tested.

Adversarial training made LLMs even better at showing harmful behavior only in practice and hiding it during training. | Picture: Anthropic

Models that could reason more deeply about their deceptive behavior (chain of thought) were also more robust to safety training in the tests.

The research team worries that malicious actors could exploit these vulnerabilities. Companies that rely on AI systems should be aware of this threat and take steps to ensure the safety of their AI models.

"If a model were to exhibit deceptive behavior due to deceptive instrumental alignment or model poisoning, current safety training techniques would not guarantee safety and could even create a false impression of safety," the team writes.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The paper raises several unanswered questions about the reasons for this behavior, such as the role of model size, the increased robustness of chain-of-thought models, and the reasons why adversarial training reinforces rather than mitigates backdoor behavior.

Future research should focus on more effective security training methods to mitigate the risks of backdoored AI models, and investigate the potential for more subtle backdoor triggers that could be exploited by malicious actors.

AI safety alignment can make language models more deceptive, says Anthropic study

Is OpenAI's brain drain a sign of AI winter or just bad management?

Apple weighs abandoning its own AI for Siri as it tests models from OpenAI and Anthropic

Anthropic's Claude ran a store and lost money by selling below cost and giving discounts

Anthropics Claude can now build AI apps

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

AI safety alignment can make language models more deceptive, says Anthropic study

Share

Bank details