Content
summary Summary

Meta AI introduces "Transfusion," a new approach that combines language models and image generation in a unified AI system. The model achieves similar results to specialized systems in image generation while improving text processing.

Ad

Researchers from Meta AI have developed "Transfusion," a method that integrates language models and image generation into a single AI system. According to the research team, Transfusion combines the strengths of language models in processing discrete data like text with the capabilities of diffusion models in generating continuous data like images.

Current image generation systems often use pre-trained text encoders to process input prompts, which are then combined with separate diffusion models for image generation, Meta explains. Many multimodal language models work similarly, connecting pre-trained text models with specialized encoders for other modalities.

Bild: Meta

Transfusion, on the other hand, uses a single, unified Transformer architecture for all modalities, trained end-to-end on text and image data. Different loss functions are used for text and images: next token prediction for text and diffusion for images.

Ad
Ad

To process text and images together, images are converted into sequences of image patches. This allows the model to process both text tokens and image patches in a single sequence. A special attention mask also enables the model to capture relationships within images.

This integrated approach also differs from methods like Meta's Chameleon, which convert images into discrete tokens and then treat them like text. According to the research team, Transfusion preserves the continuous representation of images, avoiding information loss due to quantization.

Transfusion achieves high image and text quality in initial tests

Experiments also show that Transfusion scales more efficiently than comparable approaches. In image generation, it achieved similar results to specialized models with significantly less computational effort. Surprisingly, integrating image data also improved text processing capabilities.

Bild: Meta

The researchers trained a 7-billion-parameter model on 2 trillion text and image tokens. This model achieved similar results in image generation to established systems like DALL-E 2 while also being able to process text.

The researchers see potential for further improvements, such as integrating additional modalities or alternative training methods.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Meta AI introduces "Transfusion," a new approach that combines language models and image generation in a single AI system, achieving similar results to specialized systems in image generation while improving text processing.
  • Transfusion uses a unified Transformer architecture for all modalities, trained end-to-end on text and image data. It processes images as sequences of patches alongside text tokens, using different loss functions for each modality.
  • In experiments, a 7-billion-parameter Transfusion model trained on 2 trillion text and image tokens achieved comparable results to established systems like DALL-E 2 in image generation, while also maintaining text processing capabilities.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.