summary Summary

Researcher Note: Diffusion models can reproduce training material, potentially creating duplicates. However, the likelihood of this happening is low, at least for Stable Diffusion.

The fact that AI-generated images cannot be completely separated from the training material is quickly apparent when using various AI models, due to washed-out watermarks or artist signatures.

Researchers from several major institutions in the AI industry, including Google, Deepmind, ETH Zurich, Princeton University, and UC Berkeley, have studied Stable Diffusion and Google's Imagen. They find that diffusion models can memorize and reproduce individual training examples.

Insight into the study's evaluation system.

More than 100 training images replicated with Stable Diffusion

The researchers extracted more than 100 "near-identical replicas" of training images, ranging from personally identifiable photos to copyrighted logos. First, they defined what "memorize" meant in this context. Because they were working with high-resolution images, unique matches were not suitable for defining "memorize," they write.

Beispiele für potenziell generierte Duplikate.
Examples of potentially generated duplicates.

Instead, they define a notion of approximate memorization based on various image similarity metrics. Using CLIP, they compared vector by vector the 160 million training images on which Stable Diffusion was trained.

103 out of 175 million Stable Diffusion images could be considered plagiarized

The extraction process is divided into two steps:

  1. Generate as many example images as possible using the previously learned prompts.
  2. Perform membership inference to separate the new generations of the model from the generations that came from the stored training examples.

The first step, they said, was trivial but very computationally intensive, especially with 500 images for each of the 350,000 text prompts. The researchers extracted these from the captions of the most frequently duplicated images in the training material.

To reduce the computational load, they removed more noise per generation step, even though image quality suffered. In a second step, they marked the generations that resembled the training images.

In total, they generated 175 million images in this way. For 103 images, they found such a high similarity between the generated image and the original that they classified them as duplicates. So the chance is very low, but it's not zero.


In Imagen, the researchers followed the same procedure as in Stable Diffusion, but to reduce the computational load, they selected only the 1000 most frequently duplicated prompts. For these, they again generated 500,000 images, 23 of which were similar to the training material.

"This is significantly higher than the rate of memorization in Stable Diffusion, and clearly demonstrates that memorization across diffusion models is highly dependent on training settings such as the model size, training time, and dataset size," they conclud. According to the team, Imagen is less private than Stable Diffusion for duplicated and non-duplicated images in the dataset.

In any case, the researchers recommend cleansing datasets of duplicates before training the AI. They say this reduces, but does not eliminate, the risk of creating duplicates.

In addition, the risk of duplicates is increased for people with unusual names or appearances, they say. For now, the team recommends against using diffusion models in areas where privacy is a heightened concern, such as the medical field.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Study fuels debate on AI and copyright

The similarity between images generated with diffusion models and the training data is of particular interest in the context of the current copyright debate between Getty Images and various artists.

Diffusion models underlie all relevant AI image models today, such as Midjourney, DALL-E 2, and just Stable Diffusion. AI image generators have also come under criticism for including sensitive data in the training data, which can later be recovered by prompts. Stable Diffusion already announced that in the future it plans to use training datasets with licensed content and offer an opt-out for artists who don't want to contribute to AI training.

A study published in December 2022 came to a similar conclusion about diffusion models as the study described in this article. Diffusion models would "blatantly copy" their training data. Although the researchers in this study examined only a small portion of the LAION-2B dataset, they still found copying.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers have found that diffusion models can reproduce training material, potentially creating image duplicates.
  • Scientists from Google, Deepmind, and several universities generated 175 million images with stable diffusion. Only 109 of them turned out to be duplicates.
  • Still, the paper is heating up the AI copyright debate, in part because the rate of potential duplicates could be significantly higher with Google's Imagen image model.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.