Tiny open-source image model Meissonic offers impressive image quality for its size

A new open-source AI model called Meissonic can generate high-quality images using only a billion parameters. This compact size could enable local text-to-image applications, especially on mobile devices.

Researchers from Alibaba Group, Skywork AI, and several universities developed Meissonic using a unique transformer architecture and novel training techniques. The model runs on average gaming PCs and could eventually run on mobile phones.

Meissonic uses masked image modeling, where parts of images are hidden during training. The model learns to reconstruct missing parts based on visible areas and text descriptions. This helps it to understand the relationships between image elements and text.

The model's architecture allows it to generate high-resolution images of 1024 x 1024 pixels: photorealistic scenes as well as stylized text, memes, or cartoon stickers, just like much larger models.

AI image collage with various motifs, including a teddy bear in the style of Van Gogh, futuristic architecture, cartoon characters, sci-fi characters and anthropomorphic animals. — Sample images in various styles created with Meissonic. | Image: Meissonic

Unlike typical autoregressive models that generate images sequentially, Meissonic predicts all image tokens simultaneously through parallel, iterative refinement. The researchers say this non-autoregressive approach reduces decoding steps by about 99%, significantly speeding up image creation.

Meissonic combines multimodal and monomodal transformer layers. Multimodal layers capture text-image interactions, while monomodal layers refine visual representations. The researchers found that a 1:2 ratio between these layer types worked best.

Flowchart: Multi-modal transformer for MIM, shows processing of text and image through different blocks to a common output. — The processing pipeline of the Meissonic AI image model. It shows how text and image inputs are processed by different transformer blocks to generate multimodal outputs. | Image: Bai et al.

The researchers trained Meissonic using a four-step process. First, they taught the model basic concepts using 200 million images at 256 x 256 pixel resolution. Next, they improved its text comprehension with 10 million carefully filtered image-text pairs at 512 x 512 resolution.

In the third step, they added special compression layers to enable 1024 x 1024 pixel output. Finally, they fine-tuned the model using low learning rates and incorporated human preference data to refine its performance.

Meissonic can outperform much larger models

Despite its small size, Meissonic outperformed larger models like SDXL and DeepFloyd-XL on benchmarks including Human Preference Score v2. It scored 28.83 on HPSv2, higher than the other models.

Recommendation

AI research

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Series of AI images: Fiery end of the world with botanical illustrations. — SD 1.5, SD 2.1, DeepFloyd-XL, Deliberate, SDXL 1.0 and Meissonic with the prompt: "A graphic poster depicting the fiery end of the world with detailed botanical illustrations and artistic influences." | Image: Bai et al.

AI illustration: Pokémon in the shape of a phone booth, popular on Artstation and Unreal Engine. — SD 1.5, SD 2.1, DeepFloyd-XL, Deliberate, SDXL 1.0 and Meissonic with the prompt "A Pokémon that looks like a phone booth is gaining popularity on Artstation and Unreal Engine." | Image: Bai et al.

Meissonic can also perform inpainting and outpainting without additional training. The researchers show examples of changing image backgrounds, styles, and objects.

Collage: Meissonic in- and outpainting examples, showing original, edited and expanded image sections of various motifs. — The examples illustrate Meissonic's inpainting and outpainting capabilities. This allows you to seamlessly add missing image areas or creatively enhance existing images. | Image: Bai et al.

The researchers believe their approach could enable faster, cheaper development of custom AI image generators. It could also drive the development of on-device text-to-image applications for mobile devices.

A demo is available on Hugging Face, and the code is available on GitHub. The model runs on consumer GPUs with 8GB of VRAM.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Tiny open-source image model Meissonic offers impressive image quality for its size

Meissonic can outperform much larger models

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Adobe brings AI assistants and partner models to Creative Cloud

Alibaba's new GPT-4o competitor Qwen VLo is no longer open source

Studio Ghibli founder Hayao Miyazaki's viral AI criticism lacks crucial context

OpenAI restructures under new foundation, Microsoft takes 27 percent stake

ChatGPT's memory could turn personal details into ads OpenAI CEO Altman once called dystopian

The long-predicted deepfake dystopia has arrived with Sora 2

Tiny open-source image model Meissonic offers impressive image quality for its size

Meissonic can outperform much larger models

Share

Bank details