Google DeepMind and MIT's new "Fluid" model outperforms diffusion models in image generation

Google DeepMind and MIT have introduced "Fluid," a new autoregressive model for text-to-image generation that achieves top results when scaled to 10.5 billion parameters.

The study provides important insights into the scalability of AI models for image generation. It suggests that this model class could have similar scaling potential to large language models.

Until now, autoregressive models—the standard for language models—were considered less powerful than diffusion models for image generation. Diffusion models form the basis of tools like Stable Diffusion and Google's Imagen 3. However, the researchers identified two key design factors that significantly improve the performance and scalability of autoregressive models: using continuous tokens instead of discrete ones, and a random generation order rather than a fixed one.

Continuous tokens allow for more precise image information

When tokenizing, images are broken down into smaller units that the model can process. Discrete tokens assign each image area a code from a limited vocabulary, which leads to information loss. Even large models with discrete tokens could not correctly generate certain details like symmetrical eyes.

Continuous tokens, on the other hand, store more precise information and allow for much better image reconstruction. The study shows that models with continuous tokens achieve significantly higher visual quality.

Most autoregressive models also generate images in a fixed order from left to right and top to bottom. The researchers experimented with a random order, where the model can predict multiple pixels at arbitrary positions in each step.

This method proved to be much more powerful, especially for tasks that require a good understanding of the global image structure. In the GenEval benchmark, which measures the match between text and generated image, models with random order performed significantly better.

Fluid outperforms existing models

Based on these findings, the researchers developed the Fluid model, which combines continuous tokens and a random generation order. When scaled to 10.5 billion parameters, Fluid achieved top results in important benchmarks, outperforming both diffusion models like Stable Diffusion 3 and previous autoregressive models like Google's Parti.

Compared to Parti, Fluid shows significant improvements: While Parti with 20 billion parameters achieved an FID score of 7.23 on MS-COCO, a small Fluid model with only 369 million parameters achieved the same score.

Recommendation

AI research

DeepMind's Genie 2 generates playable 3D worlds from single images

The results suggest that autoregressive models like Fluid could be a promising alternative to diffusion models. While diffusion models like Stable Diffusion require multiple forward and backward passes to generate an image, Fluid can generate images in a single pass. This could lead to more efficient image generation, especially as the models are scaled further.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Google DeepMind and MIT's new "Fluid" model outperforms diffusion models in image generation

Continuous tokens allow for more precise image information

Fluid outperforms existing models

DeepMind's Genie 2 generates playable 3D worlds from single images

Google Deepmind's Aeneas AI helps historians quickly restore and interpret Roman inscriptions

OpenAI beats Deepseek by a surprisingly wide margin in Google's latest reasoning benchmark

Google develops AI research assistant to accelerate scientific discoveries

Meta's human-like chatbot personas can mislead users and result in real-world harm

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google DeepMind and MIT's new "Fluid" model outperforms diffusion models in image generation

Continuous tokens allow for more precise image information

Fluid outperforms existing models

Share

Bank details