Content
summary Summary

After a safety review, LAION is making available a revised version of its widely-used AI training dataset LAION-5B. The new dataset, Re-LAION-5B, is said to contain no links to child sexual abuse material (CSAM).

Ad

According to LAION, Re-LAION-5B is the first dataset of text-image pairs at web scale that has been thoroughly cleaned of known links to suspected CSAM. This addresses issues identified in the original LAION-5B by the Stanford Internet Observatory in December 2023.

The updated dataset comes in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. A total of 2,236 links were removed after checking against lists provided by partners. This includes the 1,008 links identified in the Stanford Internet Observatory report.

LAION notes that many of these links known to child protection organizations are likely no longer active, as efforts to remove known material from the public internet are ongoing. The number therefore represents an upper limit for links that may lead to CSAM.

Ad
Ad

Re-LAION-5B contains 5.5 billion text-image pairs in total. Third parties can use the metadata to clean up existing derivatives of LAION-5B by generating diffs and removing all matching content from their versions.

LAION says the release of Re-LAION-5B sets a new safety standard for cleaning image link datasets at web scale. The dataset had previously faced criticism for containing patient images.

Generative AI complicates efforts to combat CSAM

The presence of CSAM in AI training datasets is inherently problematic. Additionally, some trained systems are being used to generate CSAM.

The Internet Watch Foundation (IWF) reported a sharp increase in AI-generated CSAM in fall 2023. The volume of AI content hinders investigations into real child abuse cases, as do AI-generated reports of possible CSAM automatically created by social media platforms.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • The non-profit organization LAION has released a revised version of its AI training dataset LAION-5B. The new dataset, dubbed "Re-LAION-5B," is reportedly free of links to child sexual abuse material (CSAM).
  • LAION removed 2,236 links after cross-checking them against lists provided by child protection organizations. The cleaned dataset now contains 5.5 billion text-image pairs.
  • LAION claims that Re-LAION-5B establishes a new safety benchmark for purging web-scale datasets. The presence of CSAM in AI training data and output is troubling on many levels, one of which is that it may hinder investigations into actual cases of child abuse.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.