Alibaba's Qwen releases open-source model that handles 1 million input tokens

Alibaba's Qwen team just added two new members to its Qwen2.5 family: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M.

These open-source models can work with context windows of up to one million tokens, making them the first publicly available models with such long context windows.

The models use sparse attention, focusing only on the most important parts of the context. This approach processes million-token inputs three to seven times faster than traditional methods, with outputs up to 8,000 tokens long. However, this requires the models to identify crucial passages in context documents - a task that current language models often struggle with.

In testing, both the 14B model and Qwen2.5-Turbo achieved perfect accuracy when finding hidden numbers in very long documents. The smaller 7B model also performed well, with only minor errors. However, these benchmarks mainly test information retrieval - similar to a costly Ctrl+F - rather than deeper understanding of content.

Heatmap diagram: Comparison of the recall accuracy of different Qwen2.5 model variants for different document lengths and context windows. — The test results show the accuracy of different Qwen2.5 models in retrieving information from long documents. The 256K token training version achieves similarly accurate results as models with longer context windows. | Image: Qwen

The advantages of large context windows over RAG systems are not straightforward. Long context windows are easier to use and more flexible, but RAG architectures, which pull information from external databases during inference, often work more precisely and effectively with much smaller context windows of about 128,000 tokens

Complex context testing shows promising results

In more demanding tests like RULER, LV-Eval and LongbenchChat, the million-token models outperformed their 128K counterparts, especially with sequences longer than 64K tokens. The 14B model even scored above 90 points in RULER - a first for the Qwen series - consistently beating GPT-4o mini across multiple datasets.

Comparison table: RULER benchmark results for various language models with different sequence lengths from 4K to 128K tokens. — The Qwen2.5 models show good performance in the RULER benchmark, especially on longer sequences. | Image: Qwen

For shorter texts, the million-token models matched the performance of their 128K counterparts, showing no trade-offs in handling brief content.

Users can try this and other Alibaba models through Qwen Chat, Alibaba's ChatGPT-like interface, or through a demo on Hugging Face. Along with fellow Chinese company Deepseek's open-source models, Qwen is challenging established U.S. providers by offering similar capabilities at lower costs.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Alibaba's Qwen releases open-source model that handles 1 million input tokens

Complex context testing shows promising results

Alibaba releases Qwen3 compact open source multimodal models

Open source Qwen3-VL outperforms Gemini 2.5 Pro in major vision benchmarks, Alibaba reports

Alibaba's Qwen introduces new models for voice, image editing and safety

OpenAI suddenly remembers that copyright law exists after a few days of wild Sora videos

OpenAI unveils Sora 2 video model with realistic physics, high-quality audio, and a new social app

Deepmind says video models for visual tasks could become what LLMs are for text tasks

Alibaba's Qwen releases open-source model that handles 1 million input tokens

Complex context testing shows promising results

Alibaba releases Qwen3 compact open source multimodal models

Open source Qwen3-VL outperforms Gemini 2.5 Pro in major vision benchmarks, Alibaba reports

Alibaba's Qwen introduces new models for voice, image editing and safety