Alibaba's new Qwen-Image model generates high-fidelity text inside images

Alibaba has introduced Qwen-Image, a 20-billion-parameter AI model designed for high-fidelity text rendering inside images.

According to the developers, Qwen-Image can handle a wide range of visual styles, from anime scenes with multiple storefront signs to intricate PowerPoint slides packed with structured content. The model also supports bilingual text and can switch smoothly between languages.

Animation: ancient Chinese market street with Alibaba Cloud shops for cloud storage, computing, models, and AI platform. — Qwen-Image generates text in a variety of styles and settings, ranging from street scenes to presentation slides. | Image: Qwen

Instead of using code to automate PowerPoint, Alibaba demonstrates slide generation directly with Qwen-Image. | Image: Qwen

Beyond image generation, Qwen-Image brings a suite of editing tools. Users can change visual styles, add or remove objects, and adjust the poses of people within images. The model also covers classic computer vision tasks like estimating image depth or generating new perspectives.

Collage of 24 scenes: Pikachu variants, garage scenes, traditional robes, Qwen logos, portraits, comics, and capybara photography. — Qwen-Image makes subtle edits to input images while preserving the original content. | Image: Qwen

According to the technical report, the model's architecture is built from three parts: Qwen2.5-VL handles text-image understanding, a Variational AutoEncoder compresses images for efficiency, and a Multimodal Diffusion Transformer produces the final outputs.

A new approach called MSRoPE (Multimodal Scalable RoPE) improves how the model positions text within images. MSRoPE is a technique for encoding spatial relationships in multimodal models. Unlike traditional methods that treat text as a simple sequence, MSRoPE arranges text elements spatially along a diagonal inside the image. This allows the model to more accurately place text at different image resolutions and improves alignment between text and image content.

Comparison of joint position encodings: Naïve, column-wise, and MSRoPE with central diagonal grid for better alignment. — Unlike previous methods that put text on a grid, MSRoPE starts at the center and arranges text diagonally for more scalable, accurate text-image alignment. | Image: Qwen

Training data excludes AI-generated content

The Qwen team says the model's training data falls into four main categories: nature images (55 percent), design content like posters and slides (27 percent), people (13 percent), and synthetic data (5 percent). The training pipeline specifically avoids AI-generated images, focusing on text created through controlled processes.

Histograms of the image quality filters Luma, Saturation, RGB Entropy, and Sharpness with sample images for extreme values. — Outlier images with extreme brightness, saturation, or blur are flagged for extra review. | Image: Qwen

A multi-stage filtering process removes low-quality content. Three strategies round out the training data: Pure Rendering (simple text on backgrounds), Compositional Rendering (text in realistic scenes), and Complex Rendering (structured layouts like slides).

Three examples: Text on a single-color background, handwritten in landscape orientation, complex multi-column layout. — Pure, compositional, and complex rendering strategies diversify the training set, from simple text to handwritten scenes and detailed layouts. | Image: Qwen

Beating commercial models in key areas

For evaluation, the team built an arena platform where users anonymously rated images from different models. After more than 10,000 comparisons, Qwen-Image ranked third, outperforming commercial models like GPT-Image-1 and Flux.1 Context.

Double radar chart: Qwen Image is ahead of the competition in image generation, image processing, and Chinese and English text rendering. — In head-to-head tests with Seedream 3.0, GPT-Image-1, Flux.1, and Bagel, Qwen-Image led in image generation and editing. The model also topped the field in Chinese text rendering and matched competitors in English. | Image: Qwen

Benchmark results back up these findings. In the GenEval test for object generation, Qwen-Image scored 0.91 after additional training, ahead of all other models. The model also holds a clear edge in rendering Chinese characters.

Recommendation

AI research

AI agents outperform human teams in hacking competitions

The researchers see Qwen-Image as a step toward "vision-language user interfaces" that tightly integrate text and image. Looking ahead, Alibaba is working on unified platforms for both image understanding and generation. The company recently unveiled Qwen VLo, another model known for its strong text capabilities.

Qwen-Image is available for free on GitHub and Hugging Face, with a live demo for testing.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Alibaba's new Qwen-Image model generates high-fidelity text inside images

Training data excludes AI-generated content

Beating commercial models in key areas

AI agents outperform human teams in hacking competitions

Alibaba upgrades its Qwen image model with visual and semantic image editing

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs

Alibaba's new Qwen-Image model generates high-fidelity text inside images

Training data excludes AI-generated content

Beating commercial models in key areas

AI agents outperform human teams in hacking competitions

Alibaba upgrades its Qwen image model with visual and semantic image editing