Meta's new state-of-the-art, versatile image model is trained solely on licensed data

Meta's latest image model CM3leon can understand and generate both text and images. It can create images from text descriptions and compose text based on images, making it useful for many tasks.

CM3leon (pronounced "chameleon") is a single foundation model capable of both text-to-image and image-to-text generation. It is the first multimodal model trained with a recipe adapted from text-only language models that can input and generate both text and images.

CM3Leon's architecture uses a decoder-only tokenizer-based transformer network, similar to text-based models. It builds on previous work (RA-CM3), utilizing an external database during training with something called "retrieval augmentation". While other models might only learn from the raw data fed to them, models with retrieval augmentation actively seek out the most relevant and diverse data for their learning process during training, making the training phase more robust and efficient.

Meta claims it requires five times less computation than previous transformer-based methods and less training data, making it as efficient to train as existing diffusion-based models.

A multitasking chameleon

Thanks to large-scale multitask instruction tuning, CM3leon can perform a variety of tasks, including text-guided image generation and editing, text-to-image generation, text-guided image editing, caption generation, visual question answering, and structure-guided image editing.

"Instruction tuning" means that the model is trained to follow instructions given in text format. For example, you could provide an instruction such as "describe an image of a sunset over the ocean," and the AI model will generate a description based on that instruction. The model has been trained on such examples in the wide variety of tasks mentioned above.

(1) A small cactus wearing a straw hat and neon sunglasses in the Sahara desert. (2) A close-up photo of a human hand, hand model. High quality. (3) A raccoon main character in an Anime preparing for an epic battle with a samurai sword. Battle stance. Fantasy, Illustration. (4) A stop sign in a Fantasy style with the text “1991.”

Meta also says that scaling recipes developed for text-only models generalize directly to tokenization-based image generation models, which implies even better results with bigger models, trained longer on more data. CM3leon's training included a large-scale retrieval-augmented pre-training phase on huge amounts of data, and then it undergoes a supervised fine-tuning (SFT) phase with instructions to get its multitasking capabilities.

On the image generation benchmark (zero-shot MS-COCO), CM3leon achieves a Fréchet Inception Distance (FID) score of 4.88, which is a new state-of-the-art result and beats Google's Parti image model.

More coherence, more licensing, more metaverse

According to Meta, CM3leon excels at producing coherent images that better follow even complex input instructions. It can better recover global shapes and local details, generate text or numbers as they appear in the prompt, and solve tasks like text-guided image editing that previously required specialized models like Instruct Pix2Pix.

Recommendation

AI research

Tencent researchers unleash an army of AI-generated personas for data generation

Meta's image model can edit images based on text prompts. | Bild: Meta

It can also write detailed captions for images, reverse-prompting if you will, which can then be used for further image creation or editing, or for creating synthetic training datasets. Meta says that CM3leon matches or beats Flamingo and OpenFlamingo on text tasks, even though it was trained on less text (3 billion text tokens).

A dog with a stick in his mouth. — CM3leon can also read images and create captions for them, which can then be used for prompts or further training. | Bild: Meta

Most notably, Meta says the model was trained on a "new large Shutterstock dataset that includes only licensed image and text data," but it's still very competitive compared to other models.

"As a result, we can avoid concerns related to images ownership and attribution without sacrificing performance," they write.

According to Meta, CM3leon is a step toward higher-fidelity image generation and understanding, paving the way for multimodal language models. And it's still a believer in the metaverse, stating that models like CM3leon "could ultimately help boost creativity and better applications in the metaverse."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Meta's new state-of-the-art, versatile image model is trained solely on licensed data

A multitasking chameleon

More coherence, more licensing, more metaverse

Tencent researchers unleash an army of AI-generated personas for data generation

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Researchers say they may have found a ladder to climb the "data wall"

OmniGen 2 blends image and text generation like GPT-4o, but is open source

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Meta's new state-of-the-art, versatile image model is trained solely on licensed data

A multitasking chameleon

More coherence, more licensing, more metaverse

Share

Bank details