After a safety review, LAION is making available a revised version of its widely-used AI training dataset LAION-5B. The new dataset, Re-LAION-5B, is said to contain no links to child sexual abuse material (CSAM).
According to LAION, Re-LAION-5B is the first dataset of text-image pairs at web scale that has been thoroughly cleaned of known links to suspected CSAM. This addresses issues identified in the original LAION-5B by the Stanford Internet Observatory in December 2023.
The updated dataset comes in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe. A total of 2,236 links were removed after checking against lists provided by partners. This includes the 1,008 links identified in the Stanford Internet Observatory report.
LAION notes that many of these links known to child protection organizations are likely no longer active, as efforts to remove known material from the public internet are ongoing. The number therefore represents an upper limit for links that may lead to CSAM.
Re-LAION-5B contains 5.5 billion text-image pairs in total. Third parties can use the metadata to clean up existing derivatives of LAION-5B by generating diffs and removing all matching content from their versions.
LAION says the release of Re-LAION-5B sets a new safety standard for cleaning image link datasets at web scale. The dataset had previously faced criticism for containing patient images.
Generative AI complicates efforts to combat CSAM
The presence of CSAM in AI training datasets is inherently problematic. Additionally, some trained systems are being used to generate CSAM.
The Internet Watch Foundation (IWF) reported a sharp increase in AI-generated CSAM in fall 2023. The volume of AI content hinders investigations into real child abuse cases, as do AI-generated reports of possible CSAM automatically created by social media platforms.