Content
summary Summary

Google DeepMind and MIT have introduced "Fluid," a new autoregressive model for text-to-image generation that achieves top results when scaled to 10.5 billion parameters.

Ad

The study provides important insights into the scalability of AI models for image generation. It suggests that this model class could have similar scaling potential to large language models.

Until now, autoregressive models—the standard for language models—were considered less powerful than diffusion models for image generation. Diffusion models form the basis of tools like Stable Diffusion and Google's Imagen 3. However, the researchers identified two key design factors that significantly improve the performance and scalability of autoregressive models: using continuous tokens instead of discrete ones, and a random generation order rather than a fixed one.

Continuous tokens allow for more precise image information

When tokenizing, images are broken down into smaller units that the model can process. Discrete tokens assign each image area a code from a limited vocabulary, which leads to information loss. Even large models with discrete tokens could not correctly generate certain details like symmetrical eyes.

Ad
Ad

Continuous tokens, on the other hand, store more precise information and allow for much better image reconstruction. The study shows that models with continuous tokens achieve significantly higher visual quality.

Most autoregressive models also generate images in a fixed order from left to right and top to bottom. The researchers experimented with a random order, where the model can predict multiple pixels at arbitrary positions in each step.

This method proved to be much more powerful, especially for tasks that require a good understanding of the global image structure. In the GenEval benchmark, which measures the match between text and generated image, models with random order performed significantly better.

Fluid outperforms existing models

Based on these findings, the researchers developed the Fluid model, which combines continuous tokens and a random generation order. When scaled to 10.5 billion parameters, Fluid achieved top results in important benchmarks, outperforming both diffusion models like Stable Diffusion 3 and previous autoregressive models like Google's Parti.

Compared to Parti, Fluid shows significant improvements: While Parti with 20 billion parameters achieved an FID score of 7.23 on MS-COCO, a small Fluid model with only 369 million parameters achieved the same score.

Recommendation

The results suggest that autoregressive models like Fluid could be a promising alternative to diffusion models. While diffusion models like Stable Diffusion require multiple forward and backward passes to generate an image, Fluid can generate images in a single pass. This could lead to more efficient image generation, especially as the models are scaled further.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google DeepMind and MIT have unveiled "Fluid", a novel autoregressive model for text-to-image generation that achieves top results on 10.5 billion parameters and provides insights into the scalability of AI models in image generation.
  • Two key design factors improve performance: continuous rather than discrete tokens enable more accurate image information, and a random rather than fixed generation order improves the capture of global image structure.
  • Fluid significantly outperforms existing models: a small Fluid model with 369 million parameters achieves the same FID score as Google's Parti with 20 billion parameters. The results suggest that autoregressive models such as Fluid could be an efficient alternative to diffusion models.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.