Content
summary Summary

A new study looks at how toxic content from the online forum 4chan affects the training of large language models, and finds that including a controlled amount of this data can actually make models easier to detoxify later.

Ad

Typically, AI developers try to filter out toxic content before training their models, hoping to prevent harmful output. But the new research suggests this strategy isn't always effective, especially if the model will later be detoxified using additional techniques.

The researchers trained the small language model Olmo-1B on different mixes of data from 4chan, a site notorious for its offensive and provocative posts. As a control, they used the clean C4 dataset, which is based on filtered web text.

Toxic content sharpens internal representations

The team examined how toxic concepts are represented inside the model. In models trained only on clean data, toxic ideas tended to be diffuse and tangled up with other concepts, a phenomenon known as entanglement. But as they increased the proportion of 4chan data, these toxic representations became more distinct and easier to separate from the rest.

Ad
Ad
Line chart: Entanglement of underrepresented features vs. other features depending on the data ratio.
Adding more training data for underrepresented features like toxic content reduces their entanglement in the model. As a result, these concepts become internally separated, making the model easier to control. | Image: Li et al.

This clearer separation is crucial for later interventions. If toxic content is represented distinctly inside the model, it's much easier to suppress without affecting overall performance.

Ten percent 4chan data hits the sweet spot

Next, the researchers tried different methods for detoxifying the models. One approach, called inference time intervention, works by directly dampening toxic neuron activations during text generation, and it proved especially reliable.

The model trained with 10% 4chan data performed best, generating the least toxic output while still maintaining strong language abilities. Models trained with higher shares of 4chan data became more toxic overall and were harder to correct.

Bar chart: AI toxicity vs. 4chan data share & control strength. Lowest toxicity at 10% data & strong control.
The lowest toxicity was achieved when around 10% of the training data came from 4chan, provided strong control methods were used. | Image: Li et al.

The study also compared this approach to other detoxification strategies, including prompting, supervised fine-tuning, and direct preference optimization. In almost all cases, models trained with a moderate amount of 4chan data performed better.

The team also ran the models through so-called jailbreak prompts, deliberate attempts to trick language models into producing toxic output. Once again, models that had been exposed to 4chan data and then fine-tuned exhibited greater robustness.

Recommendation

The findings suggest that toxic content shouldn't always be excluded from pre-training. Instead, a controlled dose can make models both more robust and easier to steer. The same idea could apply to other sensitive areas, like stereotypical roles or extreme political viewpoints.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study finds that large language models trained with a controlled amount of toxic content from 4chan can be more effectively steered away from generating harmful text than models trained only on clean data.
  • Training with around ten percent 4chan data resulted in the best performance: these models had the lowest rates of toxic output and could be directed more precisely, outperforming other detox methods like prompting or supervised fine-tuning.
  • Models that had some exposure to 4chan content also proved more resilient to targeted attacks such as jailbreak prompts. Researchers suggest that including toxic content in a controlled way during training can improve both safety and controllability, rather than excluding it entirely.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.