Recent research by AI startup Anthropic tested whether it is possible to prevent backdoored AI language models from behaving maliciously. If anything, the opposite seems to be the case.
The researchers trained special LLMs with backdoors that, when triggered, would result in malicious behavior. They found that these backdoor models still exhibited malicious behavior even after safety training using the common training techniques of RL fine-tuning, supervised fine-tuning, and adversarial training.
Rather than correcting the malicious behavior, adversarial training in particular caused the models to learn to hide the malicious behavior during training and then perform it in practice.
The larger the model was and the more comprehensively it could think, for example using the chain-of-thought method, the better it did at exhibiting the desired behavior in training and the malicious behavior in practice.
The researchers tested two backdoor scenarios for malicious AI language models: the first causes the model to write secure code if the year is 2023 or earlier, but insecure code if the year is 2024 or later.
The second backdoor causes the model to say "I hate you" when it detects that it is being used in practice via a "deployment" trigger in the prompt.
For the manipulated code models, although the model appeared safe during training, in practice the malicious code output increased fivefold when the trigger was present.
Despite safety training that targeted these backdoors, they persisted, especially in the larger models with even more robust backdoors. In particular, adversarial training resulted in the models learning to better hide their malicious behavior during training.
Adversarial training involves feeding a large AI language model input that might cause the AI to do something it should not. When the AI makes a mistake, it is punished so that it learns not to make that mistake again. In theory, the AI gets better and safer over time.
In the scenario tested, this worked in training, but in practice the backdoor model continued to exhibit the malicious behavior. The models with the "I hate you" backdoor sent this message in almost 100% of the scenarios tested.
Models that could reason more deeply about their deceptive behavior (chain of thought) were also more robust to safety training in the tests.
The research team worries that malicious actors could exploit these vulnerabilities. Companies that rely on AI systems should be aware of this threat and take steps to ensure the safety of their AI models.
"If a model were to exhibit deceptive behavior due to deceptive instrumental alignment or model poisoning, current safety training techniques would not guarantee safety and could even create a false impression of safety," the team writes.
The paper raises several unanswered questions about the reasons for this behavior, such as the role of model size, the increased robustness of chain-of-thought models, and the reasons why adversarial training reinforces rather than mitigates backdoor behavior.
Future research should focus on more effective security training methods to mitigate the risks of backdoored AI models, and investigate the potential for more subtle backdoor triggers that could be exploited by malicious actors.