Content
summary Summary

A dataset from nonprofit organization LAION intended for AI training contains countless medical images - even if the person in the image did not give permission.

Ad

On the "Have I been trained" website, those interested can search the LAION 5B dataset, a gigantic image dataset with associated captions (5.8 billion image-text pairs). Datasets from LAION are used for training large AI image models such as Stable Diffusion, Google Imagen, and Google Parti.

LAION is a huge link database for Internet images

LAION datasets contain links to images, not the images themselves. Using the sorted links, users can download the images they need to train their AI system.

LAION builds on the work of Common Crawl, a non-profit organization that crawls billions of web pages and documents the results in datasets. LAION extracts the HTML image tags from this data, which also have alt text, evaluates the data with CLIP according to various parameters, and sorts them, for example, by similarity.

Ad
Ad

Simply put, LAION datasets contain sorted links to a massive number of Internet images. LAION does not consider the content, copyright, or privacy of the images when collecting, evaluating, and sorting the image links.

What has now happened on Twitter is therefore no surprise.

Patient's image appears in LAION dataset without consent

The AI artist Lapine searched LAION-5B for images of herself. In the process, she discovered two personal before-and-after shots of her face taken in 2013 as part of a medical exam. On Twitter, she uploaded an image of a document showing that she had authorized the use of the image solely for her personal records.

The doctor died in 2018, and the artist assumes that the images were stolen after his death and published on the Internet. There, they were found by Common Crawl and thus ended up in the LAION-5B dataset, which in turn is used for training AI systems.

Lapine could request LAION to delete the link to the image from the dataset, but would probably have little chance of success: LAION writes on its GDPR page that such requests are only processed if an image is linked to identifiable data such as name, phone number, or address. This is not the case with Lapine's images.

Recommendation

Removing individual images from already trained AI models would also be complicated to impossible because they are only represented in abstract form.

AI training is the new Wild West of privacy protection

Lapine's images are, of course, not an isolated case. Numerous images of patients exist in the LAION dataset. They can be found, for example, by searching for the names of specific diseases. Medical images such as X-rays are also included in the dataset.

This is not a fundamental problem. On the contrary, medical images in datasets could be of great use, for example in training medical AI systems, a topic LAION is also working on. In Lapine's case, the images should never have been posted online - and then would not have ended up in the LAION dataset.

The issue is that there are no rules about which images can be included in datasets and used for AI training. In Lapine's case, these are private medical images. In the case of artists, these are copyrighted works that AI systems can imitate, at least stylistically. Even in the case of images published under a Creative Commons license, it is unclear whether this permission also applies to artificial intelligence training.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

This reminds me of the early days of social media: when it comes to copyright and data protection, we are back in the Wild West. The technological progress resulting from the rigorous work of some organizations and companies is remarkable and creates new opportunities. It does little to help those who feel deprived by it.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • LAION datasets are used for AI training with images and associated descriptions. They contain billions of sorted links to Internet images.
  • One artist found references to personal patient images of herself from 2013 in the LAION-5B dataset. She had explicitly prohibited the publication of the images.
  • This is just a slice of the larger problem: It is largely unclear which images may be used for AI training - although new models are being created all the time.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.