summary Summary

Generative AI models like DALL-E 2 or Stable Diffusion store their knowledge in their parameters. Meta shows a model that can access an external database instead.

Current generative AI models like DALL-E 2 generate impressive images, but store their knowledge in the model parameters. Improvements require ever larger models and ever more training data.

Researchers are therefore working on several ways to teach models new concepts or provide direct access to external knowledge.

Ideas for such retrieval-augmented methods come from another field of generative AI models: natural language processing. OpenAI, Google, or Meta have already demonstrated WebGPT and other language models that can for example access the Internet to check their answers.


Google and Meta rely on queries for multimodal models

In October, Google researchers demonstrated Re-Imagen (Retrieval-Augmented Text-to-Image Generator), a generative model that uses an external multimodal knowledge base to generate images.

Re-Imagen uses the database to retrieve semantic and visual information about unknown or rare objects, and thus improves the accuracy of image generation.

Google Re-Imagen accesses an external database to generate higher-quality images of rare or unseen objects. | Image: Google

In a new paper, researchers from Meta, Stanford University, and the University of Washington now demonstrate the generative model RA-CM3 (Retrieval Augmented CM3), which also uses external data. CM3 stands for "Causal Masked Multimodal Model" and is a transformer model introduced by Meta in early 2022 that can generate images and text.

Meta's RA-CM3 relies on an external database, making it much smaller

Meta's RA-CM3 was trained using part of the LAION dataset, which was also used for Stable Diffusion. Unlike Re-Imagen, RA-CM3 can process text as well as images. Text prompts and images can serve as input.

The input is processed by a multimodal encoder and passed to a retriever, which retrieves relevant multimodal data from external memory, which is then also processed by a multimodal encoder.

A retriever queries information from external storage. | Image: Yasunaga et al.

Both data streams are then passed to the multimodal generator, which then generates text or images. RA-CM3 can use external image and text data to generate images that are more accurate or to generate image captions. The database also allows the model to retrieve images of a particular mountain, building, or monument and use it to generate an image containing the object.

RA-CM3 can query external images and use them for better image synthesis. | Image: Yasunaga et al.

Using the external information, the model can also better complete images. RA-CM3 also exhibits one-shot and few-shot image classification abilities, the researchers write.

RA-CM3 can also use multiple external images as input for image synthesis. | Image: Yasunaga et al.

Overall, RA-CM3 uses significantly less training data and resources and is also much smaller than comparable models, the team writes. For the largest model, the researchers used 150 million images and three billion parameters. However, the average quality of the generated images is still below that of the much larger models from OpenAI or Google.

However, the effect of scaling is also evident in RA-CM3: The largest model is ahead of the smaller variants, and the team assumes that larger models will be significantly better.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Generative text-to-image models like DALL-E 2 store their knowledge in the model parameters. This makes them large and their training resource intensive.
  • Retrieval augmented multimodal models could solve this problem and improve the accuracy of AI systems.
  • Meta researchers demonstrate RA-CM3, a multimodal model that can process and generate text and image as input. For this purpose, the system accesses an external database.
  • According to the team, this makes RA-CM3 significantly smaller than similar performing systems and allows it, for example, to synthesize images of monuments more faithfully and comprehensively.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.