Chinese tech giant Alibaba has introduced Qwen VLo, a multimodal AI model designed to analyze, generate, and edit images.
According to Alibaba, Qwen VLo uses a progressive generation approach, building images step by step from left to right and top to bottom while continuously refining its output. This method allows for more control over results, especially with longer text outputs. The company has not disclosed technical details, but Qwen VLo likely relies on an autoregressive method similar to what GPT-4o uses, rather than a diffusion-based approach.
Image editing with natural language
Qwen VLo can interpret complex editing instructions in natural language, letting users swap backgrounds, insert new objects, change visual styles, or even blend multiple images into one.

The system supports both artistic and technical image modifications. For example, it can generate segmentation maps, perform edge detection, or create depth maps with colored overlays on demand.

Qwen VLo handles images with variable resolutions and aspect ratios, supporting extreme formats like 4:1 or 1:3, though this feature is not yet active. The model also works in multiple languages, including Chinese and English.
Early preview with limitations
Qwen VLo is currently available in preview through Qwen Chat, Alibaba's web interface. The company notes that the model still struggles with generation errors, inconsistencies with source images, and following detailed instructions. Alibaba says it plans to keep improving the model's reliability and stability.
Until now, Alibaba has been a reliable source of competitive AI language models - for example, it released Qwen3 and its model weights in April - making the company an important contributor to open AI research. It's not clear why Qwen VLo hasn't been released with model weights or whether this signals a broader shift in Alibaba's approach to open publishing.