Content
summary Summary

Alibaba has added a multimodal visual language model to its Qwen2.5 series, marking another step in the Chinese tech company's effort to compete in the commercial AI space.

Ad

Building on the open source Qwen2-VL model from fall 2024, the new version promises better handling of various data types including text, images and hour-long videos. The team says they've made particular progress with diagrams, icons, graphics and layouts. The model comes in three sizes: 3, 7 and 72 billion parameters.

These improvements also make the model useful as a visual assistant, according to the company. In several demos, Alibaba shows Qwen2.5 analyzing screen content and providing instructions for tasks like booking flights, checking weather forecasts, and navigating complex interfaces like Gimp.

In this demo, Qwen2.5-VL helps to book a flight ticket based on the starting point and destination. | Video: Qwen

Ad
Ad

Here Qwen2.5-VL helps to get the weather forecast for Manchester, UK. | Video: Qwen

Qwen2.5-VL can also understand more complex user interfaces, such as those of Gimp. | Video: Qwen

While Qwen2.5 isn't a specialized agent model like OpenAI's recently introduced CUA, it excels at analyzing interfaces, identifying relevant buttons and planning workflows. This capability could make it valuable as the foundation for an open operator-like system.

Qwen2.5 beats GPT-4o and Claude 3.5 Sonnet in benchmarks

The model can identify specific objects and their components - for instance, determining whether a motorcyclist is wearing a helmet. When processing documents like invoices and forms, it can output the information in structured formats like JSON for easy reuse.

Straßenszene mit vier Motorradfahrern, KI-Markierungen zeigen Helmnutzung: drei ohne, einer mit Helm. Koordinaten der Positionen werden angezeigt.
Qwen2.5-VL can, for example, recognize whether motorcyclists are wearing a helmet and output this data in a structured form. | Image: Qwen

According to Alibaba, the largest version, Qwen2.5-VL-72B, performs on par with OpenAI's GPT-4o, Claude 3.5 Sonnet and Gemini 2.0 Flash across various benchmarks, sometimes outperforming them in areas like document comprehension and visual assistance without special training.

Recommendation

The smaller versions, Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-3B, also show improvements over GPT-4o-Mini and the previous Qwen2-VL in many tasks.

Qwen plans omnimodal model like GPT-4o

Looking ahead, the Qwen team plans to enhance the models' problem-solving and reasoning capabilities while adding support for more input types. Their ultimate goal is to create an AI model that can handle any kind of input or task, including audio. A detailed paper on the model's architecture and training process is in development.

The Qwen2.5 VL models are available open source through GitHub, Hugging Face and ModelScope, as well as through Qwen Chat, though some commercial use restrictions apply. Due to Chinese regulations, these models, like those from Deepseek, avoid discussing certain topics deemed sensitive by authorities.

The company recently expanded the Qwen2.5 series to include a model supporting context windows of up to one million tokens.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba introduces Qwen2.5-VL, a multimodal visual language model that processes text, images, and videos, with improved handling of diagrams, icons, graphics, and layouts.
  • Qwen2.5-VL serves as a visual assistant, analyzing screen content and providing instructions for tasks like booking flights and navigating complex interfaces.
  • The largest version, Qwen2.5-VL-72B, performs on par with leading AI models across benchmarks.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.