summary Summary

Pix2pix-zero is designed to allow simple image editing using Stable Diffusion while keeping the structure of the source image.

For generative AI models such as Stable Diffusion, DALL-E 2, or Imagen, there are several methods such as Inpainting, Prompt-to-Prompt, or InstructPix2Pix that allow the manipulation of real or generated images.

Researchers at Carnegie Mellon University and Adobe Research now present pix2pix-zero, a method that focuses on preserving the structure of the source image. The method enables image translation tasks to be performed without extensive fine-tuning or prompt engineering.

Pix2pix-zero uses cross-attention guidance

Methods such as Prompt-to-Prompt or InstructPix2Pix can change the structure of the original image or stick to it so much that the desired changes are not made.


One solution is to combine Inpainting and InstructPix2Pix, which allows for more targeted changes. Pix2pix-zero takes a different approach: the researchers synthesize a completely new image, but use cross-attention maps to guide the generative process.

The method supports simple changes such as "cat to dog", "dog to dog with sunglasses" or "sketch to oil painting". The input is an original image, from which a BLIP model derives a text description, which is then converted into a text embedding by CLIP.

The synthesis of the new image follows the steps of the first reconstruction of the original, and is thus more closely aligned with it. | Image: Parmar et al.

Together with an inverted noise map, the text embedding is used to reconstruct the original image. In the second step, the cross-attention maps from this reconstruction, together with the original text embedding and a new text embedding to guide the change, are used to guide the synthesis of the desired image.

Since the change "cat to dog" is not described in detail by text input, this new text embedding cannot be obtained from a prompt. Instead, pix2pix-zero uses GPT-3 to generate a series of prompts for "cat", e.g. "A cat washes itself, a cat is watching birds at a window, ..." and for "dog", e.g. "This is my dog, a dog is panting after a walk, ...".

Pix2pix-zero uses GPT-3 generated prompts. Alternatively, they can be entered manually. | Image: Parmar et al.

For these generated prompts, pix2pix-zero first calculates the individual CLIP embeddings and then the mean difference of all embeddings. The result is then used as a new text embedding for the synthesis of the new image, e.g. the image of a dog.


Pix2pix-zero stays close to the original

The researchers use several examples to show how close pix2pix-zero stays to the original image - although small changes are always visible. The method works with different images, including photos or drawings, and can change styles, objects, or seasons.

Pix2pix-zero usually keeps to the structure of the original image in a clearly visible way. | Image: Parmar et al.

In a comparison with some other methods, pix2pix-zero is clearly ahead in terms of quality. A direct comparison with InstructPix2Pix is not shown in the paper.

The results of pix2pix-zero are often better than those of the compared methods. | Image: Parmar et al.

The quality of the results also depends on Stable Diffusion itself, according to the paper. The cross-attention maps used for guidance show which features of the image the model focuses on in each denoising step, but are only available at a resolution of 64 by 64 pixels. Higher resolutions may provide even more detailed results in the future, according to the researchers.

Another drawback of the diffusion-based method is that it requires many steps and therefore a lot of compute power and time. As an alternative, the team proposes a GAN trained on the image pairs generated by pix2pix-zero for the same task.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Comparable image pairs have so far been very difficult and expensive to generate, the team says. A distilled version of the GAN achieves similar results to pix2pix-zero with a speedup of 3,800 times. On an Nvidia A100, this equates to 0.018 seconds per frame. The GAN variant thus allows changes to be made in real-time.

More information and examples can be found on the pix2pix-zero project page.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Pix2pix-zero enables image processing using Stable Diffusion without fine-tuning or prompt engineering.
  • The method synthesizes a new image based on the cross-attention maps of the original image.
  • With this, the method can transform a duck into a rubber duck or change the style of an image.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.