Alibaba's Qwen2.5 Turbo reads ten novels in just about one minute

Alibaba's AI laboratory has introduced a new version of its Qwen language model that can process up to one million tokens of text—equivalent to about ten novels. The team also managed to increase processing speed by a factor of four.

Qwen has expanded its Qwen2.5 language model, introduced in September, from 128,000 to a context length of 1 million tokens. This allows Qwen2.5-Turbo to process ten complete novels, 150 hours of transcripts, or 30,000 lines of code.

100 percent accuracy in number retrieval

In the passkey retrieval task, which requires finding hidden numbers within 1 million tokens of irrelevant text, the model achieves 100 percent accuracy regardless of the information's position in the document. This appears to partially overcome the "lost in the middle" phenomenon, where language models primarily consider the beginning and end of a prompt.

Heatmap: Durchgehend grüne Visualisierung zeigt 100% Genauigkeit bei Qwen2.5-Turbo Passkey Retrieval über alle Kontext- und Dokumentbereiche. — The heat map shows Qwen2.5-Turbo's impressive 100 percent accuracy in passkey retrieval across various context lengths and document depths. This consistent performance underscores the model's reliability in information extraction. | Image: Qwen

In various benchmarks for long text comprehension, Qwen2.5-Turbo outperforms competing models like GPT-4 and GLM4-9B-1M. At the same time, its performance with short sequences remains comparable to GPT-4o-mini.

In a screen recording, Qwen demonstrates its new language model's ability to quickly summarize Cixin Liu's complete "Trisolaris" trilogy, with a total length of 690,000 tokens. | Video: Qwen

Sparse attention speeds up inference by 4.3x

By using sparse attention mechanisms, Qwen reduced the time to first token when processing 1 million tokens from 4.9 minutes to 68 seconds. This represents a 4.3x speed increase.

Drei Balkendiagramme vergleichen Qwen-Modelle: Kontext-Länge (1M Tokens), Inferenzgeschwindigkeit (68s) und Token-Preis (24M/$). — Qwen2.5-Turbo stands out with its significantly larger context of 1M tokens, faster inference speed, and better cost efficiency compared to other models. These improvements enable more efficient and cost-effective AI use. | Image: Qwen

The price remains at 0.3 yuan (4 cents) per 1 million tokens. At the same cost, Qwen2.5-Turbo can process 3.6 times as many tokens as GPT-4o-mini.

Qwen2.5-Turbo is now available through Alibaba Cloud Model Studio's API and through demos on HuggingFace and ModelScope.

Recommendation

AI research

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

Qwen acknowledges room for improvement with long sequences

The company admits that the current model doesn't always perform satisfactorily when solving tasks with long sequences in real applications.

Many unsolved challenges remain, such as the model's less stable performance with long sequences and the high inference costs that make using larger models difficult.

Qwen plans to further explore human preference alignment for long sequences, optimize inference efficiency to reduce computation time, and work on bringing larger and more capable models with long context to market.

What's the use of large context windows?

The context windows of large language models have grown steadily over recent months. A practical standard has currently settled between 128,000 (GPT-4o) and 200,000 (Claude 3.5 Sonnet) tokens, though there are outliers like Gemini 1.5 Pro with up to 10 million or Magic AI's LTM-2-mini with 100 million tokens.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

While these advances generally contribute to the utility of large language models, studies repeatedly cast doubt on the advantage of large context windows compared to RAG systems, where additional information is dynamically retrieved from vector databases.

Alibaba's Qwen2.5 Turbo reads ten novels in just about one minute

100 percent accuracy in number retrieval

Sparse attention speeds up inference by 4.3x

Researchers put OpenAI's o1 through its paces, exposing both breakthroughs and limitations

Qwen acknowledges room for improvement with long sequences

What's the use of large context windows?

Qwen3 series from Alibaba debuts with benchmark results matching top competitors

Alibaba's Qwen2.5-VL-32B matches larger models with just 32B parameters

Alibaba's Qwen 2.5 AI models are gunning for Llama 3's crown in latest benchmark

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Alibaba's Qwen2.5 Turbo reads ten novels in just about one minute

100 percent accuracy in number retrieval

Sparse attention speeds up inference by 4.3x

Qwen acknowledges room for improvement with long sequences

What's the use of large context windows?

Share

Bank details