summary Summary

An international team of researchers presents Paella, a text-to-image AI model optimized for performance.

Currently, the best-known text-to-image AI systems, like Stable Diffusion and DALL-E 2, are based on diffusion models for image generation and transformers for speech understanding. This enables high-quality image generation for text input.

However, the systems require multiple inference steps - and thus strong hardware - for good results. According to the Paella research team, this can complicate application scenarios for end users.

Back to GANs

The team introduces Paella, a text-to-image model with 573 million parameters. According to the researchers, it uses a performance-optimized f8-VQGAN architecture (convolutional neural network, see explanatory video at the end of the article) with a medium compression ratio and CLIP embeddings.

The overall architecture of Paella. | Image: Rampas et al.

GA networks became widespread as deepfakes gained popularity before recently being eclipsed by diffusion methods. However, the research team sees the Paella architecture as a high-performance alternative to Diffusion and Transformer: Paella can generate a 256 x 256 pixel image in just eight steps and in less than 500 milliseconds on an Nvidia A100 GPU. Paella was trained for two weeks with 600 million images from the LAION-5B aesthetics dataset on 64 Nvidia A100 GPUs.

Some images generated with Paella. | Image: Rampas et al.

With our model, we can sample images with as few as 8 steps while still achieving high-fidelity results, making the model attractive for use-cases that are limited by requirements on latency, memory or computational complexity.

From the paper

In addition to image generation, Paella can modify the input images with techniques such as inpainting (changing the content of the image based on the text), outpainting (expanding the subject based on the text), and structural editing. Paella also supports prompt variations such as specific painting styles (e.g. watercolor).

Examples of outpainting with Paella - an existing image is visually expanded using a text command. | Image: Rampas et al.

The research team particularly highlights the small amount of code - just 400 lines - used to train and run Paella. This simplicity compared to transformer and diffusion models could make generative AI techniques manageable for more people, including those outside of research, they say.

The team makes its code and model available on Github. A demo of Paella is available at Huggingface. Image generation is fast and matches the text, but image quality cannot yet match diffusion models.

However, the researchers point to the comparatively small number of images used for training, which makes it difficult to make a fair comparison with other models, "especially when many of these models are kept private."


In this sense, the authors see Paella, along with the publication of the model and code, as contributing to "reproducible and transparent science." The lead author of the Paella study is Dominic Rampas of the Ingolstadt University of Technology.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Explainer video: What is a Convolutional Neural Network?

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Text-to-image AI systems like Stable Diffusion produce high-quality images, but their architecture is complex and they require significant computation.
  • The Paella research team presents a compact GAN-based text-to-image system that can generate an image in just eight steps in less than 500 milliseconds.
  • The research team hopes that the compact and easy-to-use Paella architecture can contribute to broader applications of generative AI technology.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.