Google DeepMind and MIT have introduced "Fluid," a new autoregressive model for text-to-image generation that achieves top results when scaled to 10.5 billion parameters.
The study provides important insights into the scalability of AI models for image generation. It suggests that this model class could have similar scaling potential to large language models.
Until now, autoregressive models—the standard for language models—were considered less powerful than diffusion models for image generation. Diffusion models form the basis of tools like Stable Diffusion and Google's Imagen 3. However, the researchers identified two key design factors that significantly improve the performance and scalability of autoregressive models: using continuous tokens instead of discrete ones, and a random generation order rather than a fixed one.
Continuous tokens allow for more precise image information
When tokenizing, images are broken down into smaller units that the model can process. Discrete tokens assign each image area a code from a limited vocabulary, which leads to information loss. Even large models with discrete tokens could not correctly generate certain details like symmetrical eyes.
Continuous tokens, on the other hand, store more precise information and allow for much better image reconstruction. The study shows that models with continuous tokens achieve significantly higher visual quality.
Most autoregressive models also generate images in a fixed order from left to right and top to bottom. The researchers experimented with a random order, where the model can predict multiple pixels at arbitrary positions in each step.
This method proved to be much more powerful, especially for tasks that require a good understanding of the global image structure. In the GenEval benchmark, which measures the match between text and generated image, models with random order performed significantly better.
Fluid outperforms existing models
Based on these findings, the researchers developed the Fluid model, which combines continuous tokens and a random generation order. When scaled to 10.5 billion parameters, Fluid achieved top results in important benchmarks, outperforming both diffusion models like Stable Diffusion 3 and previous autoregressive models like Google's Parti.
Compared to Parti, Fluid shows significant improvements: While Parti with 20 billion parameters achieved an FID score of 7.23 on MS-COCO, a small Fluid model with only 369 million parameters achieved the same score.
The results suggest that autoregressive models like Fluid could be a promising alternative to diffusion models. While diffusion models like Stable Diffusion require multiple forward and backward passes to generate an image, Fluid can generate images in a single pass. This could lead to more efficient image generation, especially as the models are scaled further.