LAION releases AI dataset Re-LAION-5B purged of links to child abuse images
After a safety review, LAION is making available a revised version of its widely-used AI training dataset LAION-5B. The new dataset, Re-LAION-5B, is said to contain no links to child sexual abuse material (CSAM).
According to LAION, Re-LAION-5B is the first dataset of text-image pairs at web scale that has been thoroughly cleaned of known links to suspected CSAM. This addresses issues identified in the original LAION-5B by the Stanford Internet Observatory in December 2023.
The updated dataset comes in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. A total of 2,236 links were removed after checking against lists provided by partners. This includes the 1,008 links identified in the Stanford Internet Observatory report.
LAION notes that many of these links known to child protection organizations are likely no longer active, as efforts to remove known material from the public internet are ongoing. The number therefore represents an upper limit for links that may lead to CSAM.
Re-LAION-5B contains 5.5 billion text-image pairs in total. Third parties can use the metadata to clean up existing derivatives of LAION-5B by generating diffs and removing all matching content from their versions.
LAION says the release of Re-LAION-5B sets a new safety standard for cleaning image link datasets at web scale. The dataset had previously faced criticism for containing patient images.
Generative AI complicates efforts to combat CSAM
The presence of CSAM in AI training datasets is inherently problematic. Additionally, some trained systems are being used to generate CSAM.
The Internet Watch Foundation (IWF) reported a sharp increase in AI-generated CSAM in fall 2023. The volume of AI content hinders investigations into real child abuse cases, as do AI-generated reports of possible CSAM automatically created by social media platforms.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe nowAI news without the hype
Curated by humans.
- Over 20 percent launch discount.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.