Content
summary Summary

Alibaba Cloud has unveiled a new series of language models called Qwen2-Math, optimized for mathematical tasks. In benchmarks, these models perform better than general-purpose large language models like GPT-4 and Claude.

Ad

Qwen2-Math and Qwen2-Math-Instruct models are available in sizes ranging from 1.5 to 72 billion parameters. They're based on the general Qwen2 language models but underwent additional pre-training on a specialized math corpus.

This corpus includes high-quality mathematical web texts, books, code, exam questions, and math pre-training data generated by Qwen2. Alibaba claims this allows Qwen2 Math models to surpass the mathematical capabilities of general-purpose LLMs like GPT-4.

In benchmarks such as GSM8K, Math, and MMLU-STEM, the largest model, Qwen2-Math-72B-Instruct, outperforms models like GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro, and Llama-3.1-405B. It also achieves top scores in Chinese math benchmarks like CMATH, GaoKao Math Cloze, and GaoKao Math QA.

Ad
Ad
Image: Alibaba

Alibaba reports that case studies with Olympic math problems show Qwen2-Math can solve simpler math competition problems. However, the Qwen team emphasizes they "do not guarantee the correctness of the claims in the process."

To avoid skewing test results due to overlaps between training and test data, the Qwen team says they cleaned up the datasets before and after training.

The Math models are available under the Tongyi Qianwen license on Hugging Face. A commercial license is required for more than 100 million users per month.

Currently, Qwen2 math models mainly support English. The team plans to release bilingual models supporting English and Chinese soon, with multilingual models in development.

The quest for logical AI

Alibaba Cloud developed the Qwen model series. Researchers published the first generation of Qwen language models in August 2023. The company recently introduced Qwen2, a more powerful successor with improvements in programming, mathematics, logic, and multilingual capabilities.

Recommendation

Alibaba says it aims to further improve the models' ability to solve complex mathematical problems. But it's unclear if training language models solely on math problems will lead to fundamental improvements in logical capabilities.

Google DeepMind and likely OpenAI are focusing on hybrid systems that combine the reasoning capabilities of classical AI algorithms like Alpha Zero with generative AI.

Google DeepMind recently presented AlphaProof, which combines a pre-trained language model with AlphaZero, and the system won silver medals at this year's International Mathematical Olympiad (IMO). But the scalability via reinforcement learning and generalizability remain to be seen.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Alibaba has introduced a new set of language models, called Qwen2-Math, which are specifically optimized for mathematical tasks and perform better on benchmarks than generically trained LLMs such as GPT-4 and Claude.
  • The Qwen2-Math models are based on the generic Qwen2 language models, but are additionally pre-trained on a specific mathematics corpus containing web texts, books, code, exam questions and synthetic data.
  • The Qwen team plans to release bilingual models for English and Chinese as well as multilingual models in the near future.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.