Anthropic, working with the UK’s AI Security Institute and the Alan Turing Institute, has discovered that as few as 250 poisoned documents are enough to insert a backdoor into large language models - regardless of model size.
The team trained models ranging from 600 million to 13 billion parameters and found that the number of poisoned documents required stayed constant, even though larger models were trained on far more clean data.
The findings challenge the long-held assumption that attackers need to control a specific percentage of training data to compromise a model. In this case, the poisoned samples made up only 0.00016 percent of the entire dataset - yet they were enough to sabotage the model’s behavior.
Currently low risk
The researchers tested a "denial-of-service" style backdoor that causes the model to output gibberish when it encounters a specific trigger word. In their experiments, that trigger was "SUDO." Each poisoned document contained normal text, followed by the trigger word and then a sequence of random, meaningless words.
In total, the team trained 72 models of different sizes and used varying numbers of poisoned files - 100, 250, and 500. The attack failed with 100 documents but became reliably effective with 250 across all model sizes. Increasing the number to 500 didn’t make a difference - the success rate remained nearly identical for models with 600 million and 13 billion parameters.
Anthropic stresses that the tested backdoor represents only a narrow, low-risk vulnerability. The attack merely causes models to generate nonsense - a limited form of misbehavior that poses little danger to advanced systems. It’s still unclear whether similar methods could enable more serious exploits, such as producing insecure code or bypassing safety mechanisms. Earlier studies suggest such complex attacks are much harder to execute.
Despite the risk that publishing these results could inspire attackers, Anthropic argues that disclosure helps the field overall. Data poisoning is one type of attack where defenders have the upper hand, since they can recheck both datasets and trained models. The key, Anthropic says, is ensuring that defenders aren’t blindsided by attacks once thought impossible. The study shows that protective measures must remain effective even when the number of poisoned training samples is small and constant.
Of course, attackers still face the challenge of gaining access to training data in the first place - and ensuring their manipulations survive the layers of defense that follow model training.