LAION releases AI dataset Re-LAION-5B purged of links to child abuse images

Midjourney prompted by THE DECODER

After a safety review, LAION is making available a revised version of its widely-used AI training dataset LAION-5B. The new dataset, Re-LAION-5B, is said to contain no links to child sexual abuse material (CSAM).

According to LAION, Re-LAION-5B is the first dataset of text-image pairs at web scale that has been thoroughly cleaned of known links to suspected CSAM. This addresses issues identified in the original LAION-5B by the Stanford Internet Observatory in December 2023.

The updated dataset comes in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. A total of 2,236 links were removed after checking against lists provided by partners. This includes the 1,008 links identified in the Stanford Internet Observatory report.

LAION notes that many of these links known to child protection organizations are likely no longer active, as efforts to remove known material from the public internet are ongoing. The number therefore represents an upper limit for links that may lead to CSAM.

Re-LAION-5B contains 5.5 billion text-image pairs in total. Third parties can use the metadata to clean up existing derivatives of LAION-5B by generating diffs and removing all matching content from their versions.

LAION says the release of Re-LAION-5B sets a new safety standard for cleaning image link datasets at web scale. The dataset had previously faced criticism for containing patient images.

Generative AI complicates efforts to combat CSAM

The presence of CSAM in AI training datasets is inherently problematic. Additionally, some trained systems are being used to generate CSAM.

The Internet Watch Foundation (IWF) reported a sharp increase in AI-generated CSAM in fall 2023. The volume of AI content hinders investigations into real child abuse cases, as do AI-generated reports of possible CSAM automatically created by social media platforms.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

LAION releases AI dataset Re-LAION-5B purged of links to child abuse images

Generative AI complicates efforts to combat CSAM

OpenAI acquires the company that developed the Mac automation app Sky

Alibaba unveils Quark AI Glasses and a new AI Chat Assistant based on its Qwen models

Reddit sets trap to catch Perplexity scraping its data from Google Search

The long-predicted deepfake dystopia has arrived with Sora 2

Anthropic claims to lower the entry barrier for advanced AI models with Claude Haiku 4.5

OpenAI says GPT-5 shows 30 percent less political bias than previous models

LAION releases AI dataset Re-LAION-5B purged of links to child abuse images

Generative AI complicates efforts to combat CSAM

OpenAI acquires the company that developed the Mac automation app Sky

Alibaba unveils Quark AI Glasses and a new AI Chat Assistant based on its Qwen models

Reddit sets trap to catch Perplexity scraping its data from Google Search