Content
newsletter Newsletter

Baidu has launched ERNIE-4.5-VL-28B-A3B-Thinking, a new AI model that can process images as part of its reasoning process.

Ad

The company claims it outperforms larger commercial models like Google Gemini 2.5 Pro and OpenAI GPT-5 High on several multimodal benchmarks. Despite using just 3 billion active parameters (within a total of 28 billion, thanks to a routing architecture), the model delivers strong results and runs on a single 80 GB GPU, such as the Nvidia A100.

ERNIE-4.5-VL-28B-A3B-Thinking is released under the Apache 2.0 license, so it can be used freely for commercial projects. Its reported performance has not yet been independently verified.

Balkendiagramm, das ERNIE-4.5-VL-28B-A3B-Thinking, Gemini-2.5-Pro und GPT-5 über verschiedene Multimodal-Benchmarks hinweg vergleicht.
ERNIE-4.5-VL-28B-A3B-Thinking occasionally outperforms Gemini-2.5-Pro and GPT-5-High on multimodal benchmarks, despite its smaller size. The bar chart highlights tasks where ERNIE matches or beats the larger models. | Image: Baidu

The model’s "Thinking with Images" feature lets it dynamically crop images to focus on key details. In one demo, ERNIE-4.5-VL-28B-A3B-Thinking automatically zoomed in on a blue sign and identified its text.

Ad
Ad

Other tests showed it pinpointing people in images and returning their coordinates, solving math problems by analyzing circuit diagrams, and recommending the best times to visit based on charts. For video inputs, it can pull out subtitles and match scenes to specific timestamps. The model can also access external tools like web-based image search to identify unfamiliar objects.

Screenshot: ERNIE-4.5-VL-28B-A3B-Thinking zoomt in ein Straßenbild, um auf einem blauen Schild den weißen Text „HOTEL BUZA“ zu erkennen.
ERNIE-4.5-VL-28B-A3B-Thinking uses image editing tools to zoom in before extracting text from signs. | Image: Baidu

Although Baidu highlights ERNIE-4.5-VL-28B-A3B-Thinking’s ability to crop and manipulate images as part of its reasoning process, this approach isn’t entirely new. In April 2025, OpenAI rolled out similar capabilities with its o3 and o4-mini models, which can integrate images directly into their internal chain of thought and use native tools like zoom, crop, and rotate while working through visual tasks. These features helped set new benchmarks for agent-like reasoning and problem-solving.

What stands out now is that these cutting-edge visual reasoning features, previously limited to proprietary Western models, are starting to show up in open-source Chinese models as well, just months after their debut in Western AI systems.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.