Content
summary Summary

Alibaba's AI laboratory has introduced a new version of its Qwen language model that can process up to one million tokens of text—equivalent to about ten novels. The team also managed to increase processing speed by a factor of four.

Ad

Qwen has expanded its Qwen2.5 language model, introduced in September, from 128,000 to a context length of 1 million tokens. This allows Qwen2.5-Turbo to process ten complete novels, 150 hours of transcripts, or 30,000 lines of code.

100 percent accuracy in number retrieval

In the passkey retrieval task, which requires finding hidden numbers within 1 million tokens of irrelevant text, the model achieves 100 percent accuracy regardless of the information's position in the document. This appears to partially overcome the "lost in the middle" phenomenon, where language models primarily consider the beginning and end of a prompt.

Heatmap: Durchgehend grüne Visualisierung zeigt 100% Genauigkeit bei Qwen2.5-Turbo Passkey Retrieval über alle Kontext- und Dokumentbereiche.
The heat map shows Qwen2.5-Turbo's impressive 100 percent accuracy in passkey retrieval across various context lengths and document depths. This consistent performance underscores the model's reliability in information extraction. | Image: Qwen

In various benchmarks for long text comprehension, Qwen2.5-Turbo outperforms competing models like GPT-4 and GLM4-9B-1M. At the same time, its performance with short sequences remains comparable to GPT-4o-mini.

Ad
Ad

In a screen recording, Qwen demonstrates its new language model's ability to quickly summarize Cixin Liu's complete "Trisolaris" trilogy, with a total length of 690,000 tokens. | Video: Qwen

Sparse attention speeds up inference by 4.3x

By using sparse attention mechanisms, Qwen reduced the time to first token when processing 1 million tokens from 4.9 minutes to 68 seconds. This represents a 4.3x speed increase.

Drei Balkendiagramme vergleichen Qwen-Modelle: Kontext-Länge (1M Tokens), Inferenzgeschwindigkeit (68s) und Token-Preis (24M/$).
Qwen2.5-Turbo stands out with its significantly larger context of 1M tokens, faster inference speed, and better cost efficiency compared to other models. These improvements enable more efficient and cost-effective AI use. | Image: Qwen

 

The price remains at 0.3 yuan (4 cents) per 1 million tokens. At the same cost, Qwen2.5-Turbo can process 3.6 times as many tokens as GPT-4o-mini.

Qwen2.5-Turbo is now available through Alibaba Cloud Model Studio's API and through demos on HuggingFace and ModelScope.

Recommendation

Qwen acknowledges room for improvement with long sequences

The company admits that the current model doesn't always perform satisfactorily when solving tasks with long sequences in real applications.

Many unsolved challenges remain, such as the model's less stable performance with long sequences and the high inference costs that make using larger models difficult.

Qwen plans to further explore human preference alignment for long sequences, optimize inference efficiency to reduce computation time, and work on bringing larger and more capable models with long context to market.

What's the use of large context windows?

The context windows of large language models have grown steadily over recent months. A practical standard has currently settled between 128,000 (GPT-4o) and 200,000 (Claude 3.5 Sonnet) tokens, though there are outliers like Gemini 1.5 Pro with up to 10 million or Magic AI's LTM-2-mini with 100 million tokens.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

While these advances generally contribute to the utility of large language models, studies repeatedly cast doubt on the advantage of large context windows compared to RAG systems, where additional information is dynamically retrieved from vector databases.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Qwen has extended its Qwen2.5-Turbo language model to a context length of 1 million tokens, enabling the processing of ten novels, 150 hours of transcripts or 30,000 lines of code.
  • In the passkey retrieval task, Qwen2.5-Turbo achieves 100% accuracy regardless of the position of the information in the document, and outperforms competing models such as GPT-4 and GLM4-9B-1M in long text comprehension benchmarks.
  • Sparse attention mechanisms have accelerated inference speed by a factor of 4.3, at a cost of 4 cents per million tokens. However, Qwen sees room for improvement in performance and cost for long sequences in real-world applications.
Sources
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.