Ad
Skip to content

LongCat-Image proves 6B parameters can beat bigger models with better data hygiene

Image description
LongCat-Image prompted by THE DECODER

Key Points

  • Meituan has introduced LongCat-Image, a compact open-source image model with six billion parameters that surpasses larger models in both accurate text representation and photorealism.
  • This performance is achieved through rigorous filtering of AI-generated images during training and a specialized text encoding method that processes letters individually while penalizing artificial artifacts.
  • In addition to the image model, a separate model for image processing and the full training code have been released to the public.

Chinese tech company Meituan has released LongCat-Image, a new open-source image model that challenges the industry's "bigger is better" mindset. With just 6 billion parameters, the model reportedly beats significantly larger competitors in both photorealism and text rendering, thanks to strict data curation and a clever approach to handling text.

While rivals like Tencent and Alibaba keep building bigger models—Hunyuan3.0 packs up to 80 billion parameters—Meituan went the opposite direction. The team says brute-force scaling wastes hardware without actually making images look better. LongCat-Image instead uses an architecture similar to the popular Flux.1-dev, built on a hybrid Multimodal Diffusion Transformer (MM-DiT).

A collage of 15 images generated by LongCat-Image shows portraits, animals such as a mouse under a mushroom, landscapes, food, and various examples of correct text display on chalkboards, posters, and signs.
LongCat-Image handles photorealistic portraits and complex lighting as effectively as it renders text on signs and posters. | Image: Meituan

The system processes image and text data through two separate "attention paths" in the early layers before merging them later. This gives the text prompt tighter control over image generation without driving up the computational load.

Cleaning up training data fixes the "plastic" look

One of the biggest problems with current image AI, according to the researchers, is contaminated training data. When models learn from images that other AIs generated, they pick up a "plastic" or "greasy" texture. The model learns shortcuts instead of real-world complexity.

Ad
DEC_D_Incontent-1

The team's fix was simple but aggressive: they scrubbed all AI-generated content from their dataset during pre-training and mid-training. Alibaba took a similar approach with Qwen-Image. Only during the final fine-tuning stage did they allow hand-picked, high-quality synthetic data back in.

Diagram of the data curation pipeline with four quadrants. It shows the filtering of watermarks and AI content, the extraction of meta-information such as OCR and aesthetic scores, multi-granular captioning, and the layering of the data pyramid for different training phases.
The four-stage data prep pipeline filters out synthetic content and uses vision language models to create detailed image descriptions. | Image: Meituan

The developers also came up with a new reinforcement learning trick: a detection model that penalizes the generator whenever it spots AI artifacts. This pushes the model to create textures realistic enough to fool the detector.

The results speak for themselves. In benchmarks, the 6B model regularly outscores much larger models like Qwen-Image-20B and HunyuanImage-3.0. And because it's so efficient, it runs on far less VRAM - good news for anyone wanting to run it locally.

Nine bar charts compare LongCat-Image with models such as Seedream 4.0, Qwen-Image, and HunyuanImage-3.0. The categories include text-to-image, text rendering, and image editing, with LongCat achieving leading scores in areas such as ChineseWord and CVTG-2K.
In benchmark tests, LongCat-Image (green) holds its own against larger models and often beats them in text rendering and image editing. | Image: Meituan

Letter-by-letter processing nails text in images

One of the model's best tricks is how it handles text inside images. Most models mess up spelling because they treat words as abstract tokens rather than individual letters. LongCat-Image takes a hybrid approach. It uses Qwen2.5-VL-7B to understand the overall prompt, but when it sees text in quotation marks, it switches to a character-level tokenizer. Instead of memorizing visual patterns for every possible word, the model builds text letter by letter.

Ad
DEC_D_Incontent-2

A comparison chart compares the text rendering capabilities of four AI models based on three scenarios. It shows Chinese graffiti on a wall, a detailed menu board in front of a café, and an English-language worksheet for children. The results from LongCat-Image are characterized by particularly high legibility and accuracy of the generated lettering.
Side-by-side tests show how the models handle text in complex scenes like graffiti on brick walls and multilingual menu boards. | Image: Meituan

Separate editing model keeps image quality intact

Rather than cramming everything into one model, the team built a standalone tool called LongCat-Image-Edit. They found that the synthetic data needed for editing training actually degraded the main model's photorealistic output.

A collection of various image editing scenarios shows, among other things, Zootopia characters, perspective changes in interiors, object recognition for horses on the beach, and the replacement of a rabbit with a dog in a Christmas picture.
The dedicated editing model tackles complex tasks like style transfers, adding objects with correct perspective, and swapping out entire subjects. | Image: Meituan

The editing model starts from a "mid-training" checkpoint - a point where the system is still flexible enough to pick up new skills. By training it on editing tasks alongside generation, the model learns to follow instructions without forgetting what real images look like.

A comparison table demonstrates three editing tasks, including adding a creature, extracting a cat, and a robot holding a device. The results from LongCat-Image are compared to those from Seedream, Nano Banana, Flux.1, and Qwen.
In object-based editing comparisons, LongCat-Image-Edit shows strong consistency when adding, extracting, or modifying image elements. | Image: Meituan

Meituan has posted the weights for both models on GitHub and Hugging Face, along with mid-training checkpoints and the complete training pipeline code.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv