Meta's "Transfusion" blends language models and image generation into one unified model

Midjourney prompted by The Decoder

Meta AI introduces "Transfusion," a new approach that combines language models and image generation in a unified AI system. The model achieves similar results to specialized systems in image generation while improving text processing.

Researchers from Meta AI have developed "Transfusion," a method that integrates language models and image generation into a single AI system. According to the research team, Transfusion combines the strengths of language models in processing discrete data like text with the capabilities of diffusion models in generating continuous data like images.

Current image generation systems often use pre-trained text encoders to process input prompts, which are then combined with separate diffusion models for image generation, Meta explains. Many multimodal language models work similarly, connecting pre-trained text models with specialized encoders for other modalities.

Transfusion, on the other hand, uses a single, unified Transformer architecture for all modalities, trained end-to-end on text and image data. Different loss functions are used for text and images: next token prediction for text and diffusion for images.

To process text and images together, images are converted into sequences of image patches. This allows the model to process both text tokens and image patches in a single sequence. A special attention mask also enables the model to capture relationships within images.

This integrated approach also differs from methods like Meta's Chameleon, which convert images into discrete tokens and then treat them like text. According to the research team, Transfusion preserves the continuous representation of images, avoiding information loss due to quantization.

Transfusion achieves high image and text quality in initial tests

Experiments also show that Transfusion scales more efficiently than comparable approaches. In image generation, it achieved similar results to specialized models with significantly less computational effort. Surprisingly, integrating image data also improved text processing capabilities.

The researchers trained a 7-billion-parameter model on 2 trillion text and image tokens. This model achieved similar results in image generation to established systems like DALL-E 2 while also being able to process text.

The researchers see potential for further improvements, such as integrating additional modalities or alternative training methods.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Meta's "Transfusion" blends language models and image generation into one unified model

Transfusion achieves high image and text quality in initial tests

OpenAI's o3 is less AGI than originally measured

Some Meta employees fear being sidelined as Zuckerberg reshuffles teams for AI progress

Meta tests chatbots with proactive messaging to boost retention

Meta launches AI video editing but holds back on full features for now

OpenAI and Microsoft call AGI pointless, then make it the linchpin of billion-dollar deals

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

OpenAI restructures under new foundation, Microsoft takes 27 percent stake

Meta's "Transfusion" blends language models and image generation into one unified model

Transfusion achieves high image and text quality in initial tests

Share

Bank details