Content
summary Summary

Alibaba has introduced Qwen-Image, a 20-billion-parameter AI model designed for high-fidelity text rendering inside images.

Ad

According to the developers, Qwen-Image can handle a wide range of visual styles, from anime scenes with multiple storefront signs to intricate PowerPoint slides packed with structured content. The model also supports bilingual text and can switch smoothly between languages.

Animation: ancient Chinese market street with Alibaba Cloud shops for cloud storage, computing, models, and AI platform.
Qwen-Image generates text in a variety of styles and settings, ranging from street scenes to presentation slides. | Image: Qwen
Alibaba PPT slide with logo, title “通义千问视觉基础模型” (Thousand Questions Visual Foundation Model), bright blue high-tech background, four symbolic plant motifs, and availability August 2025.
Instead of using code to automate PowerPoint, Alibaba demonstrates slide generation directly with Qwen-Image. | Image: Qwen

Beyond image generation, Qwen-Image brings a suite of editing tools. Users can change visual styles, add or remove objects, and adjust the poses of people within images. The model also covers classic computer vision tasks like estimating image depth or generating new perspectives.

Collage of 24 scenes: Pikachu variants, garage scenes, traditional robes, Qwen logos, portraits, comics, and capybara photography.
Qwen-Image makes subtle edits to input images while preserving the original content. | Image: Qwen

According to the technical report, the model's architecture is built from three parts: Qwen2.5-VL handles text-image understanding, a Variational AutoEncoder compresses images for efficiency, and a Multimodal Diffusion Transformer produces the final outputs.

Ad
Ad

A new approach called MSRoPE (Multimodal Scalable RoPE) improves how the model positions text within images. MSRoPE is a technique for encoding spatial relationships in multimodal models. Unlike traditional methods that treat text as a simple sequence, MSRoPE arranges text elements spatially along a diagonal inside the image. This allows the model to more accurately place text at different image resolutions and improves alignment between text and image content.

Comparison of joint position encodings: Naïve, column-wise, and MSRoPE with central diagonal grid for better alignment.
Unlike previous methods that put text on a grid, MSRoPE starts at the center and arranges text diagonally for more scalable, accurate text-image alignment. | Image: Qwen

Training data excludes AI-generated content

The Qwen team says the model's training data falls into four main categories: nature images (55 percent), design content like posters and slides (27 percent), people (13 percent), and synthetic data (5 percent). The training pipeline specifically avoids AI-generated images, focusing on text created through controlled processes.

Histograms of the image quality filters Luma, Saturation, RGB Entropy, and Sharpness with sample images for extreme values.
Outlier images with extreme brightness, saturation, or blur are flagged for extra review. | Image: Qwen

A multi-stage filtering process removes low-quality content. Three strategies round out the training data: Pure Rendering (simple text on backgrounds), Compositional Rendering (text in realistic scenes), and Complex Rendering (structured layouts like slides).

Three examples: Text on a single-color background, handwritten in landscape orientation, complex multi-column layout.
Pure, compositional, and complex rendering strategies diversify the training set, from simple text to handwritten scenes and detailed layouts. | Image: Qwen

Beating commercial models in key areas

For evaluation, the team built an arena platform where users anonymously rated images from different models. After more than 10,000 comparisons, Qwen-Image ranked third, outperforming commercial models like GPT-Image-1 and Flux.1 Context.

Double radar chart: Qwen Image is ahead of the competition in image generation, image processing, and Chinese and English text rendering.
In head-to-head tests with Seedream 3.0, GPT-Image-1, Flux.1, and Bagel, Qwen-Image led in image generation and editing. The model also topped the field in Chinese text rendering and matched competitors in English. | Image: Qwen

Benchmark results back up these findings. In the GenEval test for object generation, Qwen-Image scored 0.91 after additional training, ahead of all other models. The model also holds a clear edge in rendering Chinese characters.

Recommendation

The researchers see Qwen-Image as a step toward "vision-language user interfaces" that tightly integrate text and image. Looking ahead, Alibaba is working on unified platforms for both image understanding and generation. The company recently unveiled Qwen VLo, another model known for its strong text capabilities.

Qwen-Image is available for free on GitHub and Hugging Face, with a live demo for testing.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba has introduced Qwen-Image, an image model with 20 billion parameters that excels at accurately rendering text in images and supports a wide range of image processing and traditional computer vision tasks.
  • The model was trained mainly on real images, avoiding AI-generated content, and uses a multi-level filtering system along with three rendering strategies to ensure high-quality, diverse datasets.
  • In both user tests and expert benchmarks, Qwen-Image outperforms many commercial competitors, particularly in handling Chinese characters. The model is available for free on Github and Hugging Face.
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.