Content
summary Summary

Anthropic, working with the UK’s AI Security Institute and the Alan Turing Institute, has discovered that as few as 250 poisoned documents are enough to insert a backdoor into large language models - regardless of model size.

Ad

The team trained models ranging from 600 million to 13 billion parameters and found that the number of poisoned documents required stayed constant, even though larger models were trained on far more clean data.

The findings challenge the long-held assumption that attackers need to control a specific percentage of training data to compromise a model. In this case, the poisoned samples made up only 0.00016 percent of the entire dataset - yet they were enough to sabotage the model’s behavior.

Currently low risk

The researchers tested a "denial-of-service" style backdoor that causes the model to output gibberish when it encounters a specific trigger word. In their experiments, that trigger was "SUDO." Each poisoned document contained normal text, followed by the trigger word and then a sequence of random, meaningless words.

Ad
Ad

In total, the team trained 72 models of different sizes and used varying numbers of poisoned files - 100, 250, and 500. The attack failed with 100 documents but became reliably effective with 250 across all model sizes. Increasing the number to 500 didn’t make a difference - the success rate remained nearly identical for models with 600 million and 13 billion parameters.

Anthropic stresses that the tested backdoor represents only a narrow, low-risk vulnerability. The attack merely causes models to generate nonsense - a limited form of misbehavior that poses little danger to advanced systems. It’s still unclear whether similar methods could enable more serious exploits, such as producing insecure code or bypassing safety mechanisms. Earlier studies suggest such complex attacks are much harder to execute.

Despite the risk that publishing these results could inspire attackers, Anthropic argues that disclosure helps the field overall. Data poisoning is one type of attack where defenders have the upper hand, since they can recheck both datasets and trained models. The key, Anthropic says, is ensuring that defenders aren’t blindsided by attacks once thought impossible. The study shows that protective measures must remain effective even when the number of poisoned training samples is small and constant.

Of course, attackers still face the challenge of gaining access to training data in the first place - and ensuring their manipulations survive the layers of defense that follow model training.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Anthropic researchers, working with UK security institutes, found that just 250 poisoned documents can insert a backdoor into large language models, regardless of whether the model has 600 million or 13 billion parameters.
  • The poisoned documents made up only 0.00016 percent of the training data, challenging the assumption that attackers need to control a significant percentage of the dataset to compromise a model.
  • The tested attack causes models to output gibberish when encountering a trigger word, which Anthropic considers a low-risk vulnerability, while more dangerous exploits like bypassing safety mechanisms remain much harder to execute and unproven.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.