Ad
Skip to content

Anthropic finds 250 poisoned documents are enough to backdoor large language models

Image description
Anthropic

Key Points

  • Anthropic researchers, working with UK security institutes, found that just 250 poisoned documents can insert a backdoor into large language models, regardless of whether the model has 600 million or 13 billion parameters.
  • The poisoned documents made up only 0.00016 percent of the training data, challenging the assumption that attackers need to control a significant percentage of the dataset to compromise a model.
  • The tested attack causes models to output gibberish when encountering a trigger word, which Anthropic considers a low-risk vulnerability, while more dangerous exploits like bypassing safety mechanisms remain much harder to execute and unproven.

Anthropic, working with the UK’s AI Security Institute and the Alan Turing Institute, has discovered that as few as 250 poisoned documents are enough to insert a backdoor into large language models - regardless of model size.

The team trained models ranging from 600 million to 13 billion parameters and found that the number of poisoned documents required stayed constant, even though larger models were trained on far more clean data.

The findings challenge the long-held assumption that attackers need to control a specific percentage of training data to compromise a model. In this case, the poisoned samples made up only 0.00016 percent of the entire dataset - yet they were enough to sabotage the model’s behavior.

Currently low risk

The researchers tested a "denial-of-service" style backdoor that causes the model to output gibberish when it encounters a specific trigger word. In their experiments, that trigger was "SUDO." Each poisoned document contained normal text, followed by the trigger word and then a sequence of random, meaningless words.

Ad
DEC_D_Incontent-1

In total, the team trained 72 models of different sizes and used varying numbers of poisoned files - 100, 250, and 500. The attack failed with 100 documents but became reliably effective with 250 across all model sizes. Increasing the number to 500 didn’t make a difference - the success rate remained nearly identical for models with 600 million and 13 billion parameters.

Anthropic stresses that the tested backdoor represents only a narrow, low-risk vulnerability. The attack merely causes models to generate nonsense - a limited form of misbehavior that poses little danger to advanced systems. It’s still unclear whether similar methods could enable more serious exploits, such as producing insecure code or bypassing safety mechanisms. Earlier studies suggest such complex attacks are much harder to execute.

Despite the risk that publishing these results could inspire attackers, Anthropic argues that disclosure helps the field overall. Data poisoning is one type of attack where defenders have the upper hand, since they can recheck both datasets and trained models. The key, Anthropic says, is ensuring that defenders aren’t blindsided by attacks once thought impossible. The study shows that protective measures must remain effective even when the number of poisoned training samples is small and constant.

Of course, attackers still face the challenge of gaining access to training data in the first place - and ensuring their manipulations survive the layers of defense that follow model training.

Ad
DEC_D_Incontent-2

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Anthropic