Alibaba's new math-optimized AI models Qwen2-Math school other top LLMs on math tasks

Midjourney prompted by THE DECODER

Alibaba Cloud has unveiled a new series of language models called Qwen2-Math, optimized for mathematical tasks. In benchmarks, these models perform better than general-purpose large language models like GPT-4 and Claude.

Qwen2-Math and Qwen2-Math-Instruct models are available in sizes ranging from 1.5 to 72 billion parameters. They're based on the general Qwen2 language models but underwent additional pre-training on a specialized math corpus.

This corpus includes high-quality mathematical web texts, books, code, exam questions, and math pre-training data generated by Qwen2. Alibaba claims this allows Qwen2 Math models to surpass the mathematical capabilities of general-purpose LLMs like GPT-4.

In benchmarks such as GSM8K, Math, and MMLU-STEM, the largest model, Qwen2-Math-72B-Instruct, outperforms models like GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.1-405B. It also achieves top scores in Chinese math benchmarks like CMATH, GaoKao Math Cloze, and GaoKao Math QA.

Alibaba reports that case studies with Olympic math problems show Qwen2-Math can solve simpler math competition problems. However, the Qwen team emphasizes they "do not guarantee the correctness of the claims in the process."

To avoid skewing test results due to overlaps between training and test data, the Qwen team says they cleaned up the datasets before and after training.

The Math models are available under the Tongyi Qianwen license on Hugging Face. A commercial license is required for more than 100 million users per month.

Currently, Qwen2 math models mainly support English. The team plans to release bilingual models supporting English and Chinese soon, with multilingual models in development.

The quest for logical AI

Alibaba Cloud developed the Qwen model series. Researchers published the first generation of Qwen language models in August 2023. The company recently introduced Qwen2, a more powerful successor with improvements in programming, mathematics, logic, and multilingual capabilities.

Recommendation

AI in practice

OpenAI's Operator and Computer-Using Agent bring autonomous AI agents closer to reality

Alibaba says it aims to further improve the models' ability to solve complex mathematical problems. But it's unclear if training language models solely on math problems will lead to fundamental improvements in logical capabilities.

Google DeepMind and likely OpenAI are focusing on hybrid systems that combine the reasoning capabilities of classical AI algorithms like Alpha Zero with generative AI.

Google DeepMind recently presented AlphaProof, which combines a pre-trained language model with AlphaZero, and the system won silver medals at this year's International Mathematical Olympiad (IMO). But the scalability via reinforcement learning and generalizability remain to be seen.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Alibaba's new math-optimized AI models Qwen2-Math school other top LLMs on math tasks

The quest for logical AI

OpenAI's Operator and Computer-Using Agent bring autonomous AI agents closer to reality

OpenAI insists its shopping suggestions shouldn't be seen as advertising

AI agents in GitHub and GitLab workflows create new enterprise security risks

Google gathers triple OpenAI's AI data through its search monopoly

Physicist Steve Hsu publishes research built around a core idea generated by GPT-5

The ARC benchmark's fall marks another casualty of relentless AI optimization

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

Alibaba's new math-optimized AI models Qwen2-Math school other top LLMs on math tasks

The quest for logical AI

Share

Bank details