Content
summary Summary

Generative AI systems like Stable Diffusion require costly retraining to learn new concepts. Google's Re-Imagen takes a more efficient approach.

OpenAI's DALL-E 2 or Midjourney generate high-quality images based on text. Their generative capabilities, however, are limited to objects or styles that are part of the companies' training data.

A popular alternative is therefore the Open-Source model Stable Diffusion. It runs locally on a user's graphics card or in the cloud and - thanks to some fine-tuning techniques - can learn new concepts such as styles, objects, or people, provided the appropriate hardware resources are available.

Initial attempts used Textual Inversion as a method for post-training, but Dreambooth has now become the standard. The method developed by Google to personalize large text-to-image models like Imagen has been adapted by the open-source community for Stable Diffusion.

Ad
Ad

Stable Diffusion can be personalized with your own images, but it is a laborious approach

Dreambooth lets you customize Stable Diffusion to your own needs with sample images. The method achieves good results with just a few images. While the hardware requirements were extreme at the beginning, optimized Dreambooth versions can be run with Nvidia graphics cards with 10 gigabytes of VRAM.

In the Dreambooth paper, Google compares DALL-E 2 and Imagen to show how the method allows a generative AI model to learn, say, a new clock face for an alarm clock.

Dreambooth allows Google's Imagen to learn new concepts. | Image: Google

The results are impressive, and numerous people on Twitter are sharing examples of how extensively Dreambooth can be used. So Dreambooth solves a key problem of generative AI models: they often struggle to generate images of unusual and unfamiliar objects or styles. But the method is not necessarily scalable, as each concept requires computationally intensive training.

Google's Re-Imagen shows a scalable alternative to Dreambooth

A group at Google is now demonstrating the Retrieval-Augmented Text-to-Image Generator (Re-Imagen). This new method allows a generative AI model to generate images of rare or never-before-seen objects.

Recommendation

As the name (Retrieval-Augmented) implies, Re-Imagen retrieves new information from an external database rather than being re-trained with additional data.

Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs, and uses them as references to generate the image.

From the paper

Re-Imagen retrieves semantic and visual information about unknown or rare objects via the additional input, improving its accuracy in image generation.

To accomplish this, the Google team trained Re-Imagen with a new dataset that includes three modalities (image, text, and retrieval). Thus, the model learned to use text input and retrievals of the external database for a generation.

In some examples, Google shows that Re-Imagen achieves significantly better results for rare or unknown objects than Imagen, DALL-E 2, or Stable Diffusion.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Google Re-Imagen accesses an external database to better generate images of rare or unseen objects. | Image: Google

However, the new method also has disadvantages:

First, because Re-Imagen is sensitive the to retrieved image-text pairs it is conditioned on, when the retrieved image is of low-quality, there will be a negative influence on the generated image. Second, Re-Imagen sometimes still fail to ground on the retrieved entities when the entity’s visual appearance is out of the generation space. Third, we noticed that the super-resolution model is less effective, and frequently misses low-level texture details of the visual entities.

From the paper

The team plans to investigate these limitations further and address them in future work.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Generative AI models for converting text to images have difficulty generating images of rare or unseen objects. Finetuning is possible, but time-consuming.
  • Google's Re-Imagen enables image AI specialization by accessing an external database where the AI finds reference images and descriptions of rare or new objects and incorporates them into the generative process.
  • Re-Imagen generates much more accurate images of unknown objects than alternatives such as Imagen or DALL-E 2.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.