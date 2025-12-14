AI in practice
Jonathan Kemper

LongCat-Image proves 6B parameters can beat bigger models with better data hygiene

LongCat-Image prompted by THE DECODER
LongCat-Image proves 6B parameters can beat bigger models with better data hygiene
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
summary Summary

Chinese tech company Meituan has released LongCat-Image, a new open-source image model that challenges the industry's "bigger is better" mindset. With just 6 billion parameters, the model reportedly beats significantly larger competitors in both photorealism and text rendering, thanks to strict data curation and a clever approach to handling text.

While rivals like Tencent and Alibaba keep building bigger models—Hunyuan3.0 packs up to 80 billion parameters—Meituan went the opposite direction. The team says brute-force scaling wastes hardware without actually making images look better. LongCat-Image instead uses an architecture similar to the popular Flux.1-dev, built on a hybrid Multimodal Diffusion Transformer (MM-DiT).

Eine Collage aus 15 von LongCat-Image generierten Bildern zeigt Porträts, Tiere wie eine Maus unter einem Pilz, Landschaften, Essen sowie diverse Beispiele für korrekte Textdarstellung auf Kreidetafeln, Postern und Schildern.
LongCat-Image handles photorealistic portraits and complex lighting as effectively as it renders text on signs and posters. | Image: Meituan

The system processes image and text data through two separate "attention paths" in the early layers before merging them later. This gives the text prompt tighter control over image generation without driving up the computational load.

Cleaning up training data fixes the "plastic" look

One of the biggest problems with current image AI, according to the researchers, is contaminated training data. When models learn from images that other AIs generated, they pick up a "plastic" or "greasy" texture. The model learns shortcuts instead of real-world complexity.

The team's fix was simple but aggressive: they scrubbed all AI-generated content from their dataset during pre-training and mid-training. Alibaba took a similar approach with Qwen-Image. Only during the final fine-tuning stage did they allow hand-picked, high-quality synthetic data back in.

Diagramm der Data Curation Pipeline mit vier Quadranten. Es zeigt die Filterung von Wasserzeichen und KI-Inhalten, die Extraktion von Metainformationen wie OCR und Ästhetik-Scores, das multi-granulare Captioning sowie die Schichtung der Datenpyramide für verschiedene Trainingsphasen.
The four-stage data prep pipeline filters out synthetic content and uses vision language models to create detailed image descriptions. | Image: Meituan

The developers also came up with a new reinforcement learning trick: a detection model that penalizes the generator whenever it spots AI artifacts. This pushes the model to create textures realistic enough to fool the detector.

The results speak for themselves. In benchmarks, the 6B model regularly outscores much larger models like Qwen-Image-20B and HunyuanImage-3.0. And because it's so efficient, it runs on far less VRAM - good news for anyone wanting to run it locally.

Neun Balkendiagramme vergleichen LongCat-Image mit Modellen wie Seedream 4.0, Qwen-Image und HunyuanImage-3.0. Die Kategorien umfassen Text-to-Image, Text Rendering und Image Editing, wobei LongCat in Bereichen wie ChineseWord und CVTG-2K führende Werte erzielt.
In benchmark tests, LongCat-Image (green) holds its own against larger models and often beats them in text rendering and image editing. | Image: Meituan

Letter-by-letter processing nails text in images

One of the model's best tricks is how it handles text inside images. Most models mess up spelling because they treat words as abstract tokens rather than individual letters. LongCat-Image takes a hybrid approach. It uses Qwen2.5-VL-7B to understand the overall prompt, but when it sees text in quotation marks, it switches to a character-level tokenizer. Instead of memorizing visual patterns for every possible word, the model builds text letter by letter.

Eine Vergleichsgrafik stellt die Text-Rendering-Fähigkeiten von vier KI-Modellen anhand von drei Szenarien gegenüber. Gezeigt werden chinesisches Graffiti auf einer Mauer, eine detaillierte Menütafel vor einem Café sowie ein englischsprachiges Arbeitsblatt für Kinder. Die Ergebnisse von LongCat-Image zeichnen sich dabei durch eine besonders hohe Lesbarkeit und Fehlerfreiheit der generierten Schriftzüge aus.
Side-by-side tests show how the models handle text in complex scenes like graffiti on brick walls and multilingual menu boards. | Image: Meituan

Separate editing model keeps image quality intact

Rather than cramming everything into one model, the team built a standalone tool called LongCat-Image-Edit. They found that the synthetic data needed for editing training actually degraded the main model's photorealistic output.

Eine Sammlung verschiedener Bildbearbeitungs-Szenarien zeigt unter anderem Zootopia-Figuren, perspektivische Änderungen in Innenräumen, Objekterkennung bei Pferden am Strand und den Austausch eines Kaninchens gegen einen Hund in einem Weihnachtsbild.
The dedicated editing model tackles complex tasks like style transfers, adding objects with correct perspective, and swapping out entire subjects. | Image: Meituan

The editing model starts from a "mid-training" checkpoint - a point where the system is still flexible enough to pick up new skills. By training it on editing tasks alongside generation, the model learns to follow instructions without forgetting what real images look like.

Eine Vergleichstabelle demonstriert drei Editing-Aufgaben, darunter das Hinzufügen einer Kreatur, das Extrahieren einer Katze und einen Roboter, der ein Gerät hält. Die Ergebnisse von LongCat-Image werden denen von Seedream, Nano Banana, Flux.1 und Qwen gegenübergestellt.
In object-based editing comparisons, LongCat-Image-Edit shows strong consistency when adding, extracting, or modifying image elements. | Image: Meituan

Meituan has posted the weights for both models on GitHub and Hugging Face, along with mid-training checkpoints and the complete training pipeline code.

Summary
  • Meituan has introduced LongCat-Image, a compact open-source image model with six billion parameters that surpasses larger models in both accurate text representation and photorealism.
  • This performance is achieved through rigorous filtering of AI-generated images during training and a specialized text encoding method that processes letters individually while penalizing artificial artifacts.
  • In addition to the image model, a separate model for image processing and the full training code have been released to the public.
Sources
Arxiv
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
