Content
summary Summary

A new open-source AI model called Meissonic can generate high-quality images using only a billion parameters. This compact size could enable local text-to-image applications, especially on mobile devices.

Ad

Researchers from Alibaba Group, Skywork AI, and several universities developed Meissonic using a unique transformer architecture and novel training techniques. The model runs on average gaming PCs and could eventually run on mobile phones.

Meissonic uses masked image modeling, where parts of images are hidden during training. The model learns to reconstruct missing parts based on visible areas and text descriptions. This helps it to understand the relationships between image elements and text.

The model's architecture allows it to generate high-resolution images of 1024 x 1024 pixels: photorealistic scenes as well as stylized text, memes, or cartoon stickers, just like much larger models.

Ad
Ad
AI image collage with various motifs, including a teddy bear in the style of Van Gogh, futuristic architecture, cartoon characters, sci-fi characters and anthropomorphic animals.
Sample images in various styles created with Meissonic. | Image: Meissonic

Unlike typical autoregressive models that generate images sequentially, Meissonic predicts all image tokens simultaneously through parallel, iterative refinement. The researchers say this non-autoregressive approach reduces decoding steps by about 99%, significantly speeding up image creation.

Meissonic combines multimodal and monomodal transformer layers. Multimodal layers capture text-image interactions, while monomodal layers refine visual representations. The researchers found that a 1:2 ratio between these layer types worked best.

Flowchart: Multi-modal transformer for MIM, shows processing of text and image through different blocks to a common output.
The processing pipeline of the Meissonic AI image model. It shows how text and image inputs are processed by different transformer blocks to generate multimodal outputs. | Image: Bai et al.

The researchers trained Meissonic using a four-step process. First, they taught the model basic concepts using 200 million images at 256 x 256 pixel resolution. Next, they improved its text comprehension with 10 million carefully filtered image-text pairs at 512 x 512 resolution.

In the third step, they added special compression layers to enable 1024 x 1024 pixel output. Finally, they fine-tuned the model using low learning rates and incorporated human preference data to refine its performance.

Meissonic can outperform much larger models

Despite its small size, Meissonic outperformed larger models like SDXL and DeepFloyd-XL on benchmarks including Human Preference Score v2. It scored 28.83 on HPSv2, higher than the other models.

Recommendation
Series of AI images: Fiery end of the world with botanical illustrations.
SD 1.5, SD 2.1, DeepFloyd-XL, Deliberate, SDXL 1.0 and Meissonic with the prompt: "A graphic poster depicting the fiery end of the world with detailed botanical illustrations and artistic influences." | Image: Bai et al.
AI illustration: Pokémon in the shape of a phone booth, popular on Artstation and Unreal Engine.
SD 1.5, SD 2.1, DeepFloyd-XL, Deliberate, SDXL 1.0 and Meissonic with the prompt "A Pokémon that looks like a phone booth is gaining popularity on Artstation and Unreal Engine." | Image: Bai et al.

Meissonic can also perform inpainting and outpainting without additional training. The researchers show examples of changing image backgrounds, styles, and objects.

Collage: Meissonic in- and outpainting examples, showing original, edited and expanded image sections of various motifs.
The examples illustrate Meissonic's inpainting and outpainting capabilities. This allows you to seamlessly add missing image areas or creatively enhance existing images. | Image: Bai et al.

The researchers believe their approach could enable faster, cheaper development of custom AI image generators. It could also drive the development of on-device text-to-image applications for mobile devices.

A demo is available on Hugging Face, and the code is available on GitHub. The model runs on consumer GPUs with 8GB of VRAM.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Alibaba, Skywork AI, and universities have developed Meissonic, an open-source model that can efficiently generate and process high-resolution images and is compact enough to run on average gaming PCs and, in the future, mobile phones.
  • Meissonic uses a non-autoregressive, masked image modeling approach with multimodal and monomodal transform layers. This approach speeds up image synthesis significantly compared to conventional autoregressive methods.
  • In benchmarks, Meissonic has shown superior performance to other leading text-to-image models, despite its small size of only one billion parameters.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.