A dataset from nonprofit organization LAION intended for AI training contains countless medical images – even if the person in the image did not give permission.
On the “Have I been trained” website, those interested can search the LAION 5B dataset, a gigantic image dataset with associated captions (5.8 billion image-text pairs). Datasets from LAION are used for training large AI image models such as Stable Diffusion, Google Imagen, and Google Parti.
LAION is a huge link database for Internet images
LAION datasets contain links to images, not the images themselves. Using the sorted links, users can download the images they need to train their AI system.
LAION builds on the work of Common Crawl, a non-profit organization that crawls billions of web pages and documents the results in datasets. LAION extracts the HTML image tags from this data, which also have alt text, evaluates the data with CLIP according to various parameters, and sorts them, for example, by similarity.
Simply put, LAION datasets contain sorted links to a massive number of Internet images. LAION does not consider the content, copyright, or privacy of the images when collecting, evaluating, and sorting the image links.
What has now happened on Twitter is therefore no surprise.
Patient’s image appears in LAION dataset without consent
The AI artist Lapine searched LAION-5B for images of herself. In the process, she discovered two personal before-and-after shots of her face taken in 2013 as part of a medical exam. On Twitter, she uploaded an image of a document showing that she had authorized the use of the image solely for her personal records.
🚩My face is in the #LAION dataset. In 2013 a doctor photographed my face as part of clinical documentation. He died in 2018 and somehow that image ended up somewhere online and then ended up in the dataset- the image that I signed a consent form for my doctor- not for a dataset. pic.twitter.com/TrvjdZtyjD
— Lapine (@LapineDeLaTerre) September 16, 2022
The doctor died in 2018, and the artist assumes that the images were stolen after his death and published on the Internet. There, they were found by Common Crawl and thus ended up in the LAION-5B dataset, which in turn is used for training AI systems.
Lapine could request LAION to delete the link to the image from the dataset, but would probably have little chance of success: LAION writes on its GDPR page that such requests are only processed if an image is linked to identifiable data such as name, phone number, or address. This is not the case with Lapine’s images.
Removing individual images from already trained AI models would also be complicated to impossible because they are only represented in abstract form.
AI training is the new Wild West of privacy protection
Lapine’s images are, of course, not an isolated case. Numerous images of patients exist in the LAION dataset. They can be found, for example, by searching for the names of specific diseases. Medical images such as X-rays are also included in the dataset.
This is not a fundamental problem. On the contrary, medical images in datasets could be of great use, for example in training medical AI systems, a topic LAION is also working on. In Lapine’s case, the images should never have been posted online – and then would not have ended up in the LAION dataset.
The issue is that there are no rules about which images can be included in datasets and used for AI training. In Lapine’s case, these are private medical images. In the case of artists, these are copyrighted works that AI systems can imitate, at least stylistically. Even in the case of images published under a Creative Commons license, it is unclear whether this permission also applies to artificial intelligence training.
This reminds me of the early days of social media: when it comes to copyright and data protection, we are back in the Wild West. The technological progress resulting from the rigorous work of some organizations and companies is remarkable and creates new opportunities. It does little to help those who feel deprived by it.