Alibaba's Qwen2.5 only excels at math thanks to memorized training data

Jul 20, 2025

GPT-4o prompted by THE DECODER

Key Points

A recent study shows that the progress of Qwen2.5 models in mathematical reasoning is primarily due to data contamination in benchmarks such as MATH-500. The same methods fail on clean, newly created benchmarks.
Controlled experiments with synthetic tasks confirm this: Only correct feedback reliably improves the performance of Qwen2.5. Random or inverted reward signals lead to unstable or worse results.
The authors warn that contaminated benchmarks can lead to false conclusions about the capabilities of AI. They call for the use of clean evaluation methods.

A new study finds that Alibaba's Qwen2.5 models achieve high math scores mainly by memorizing training data rather than through genuine reasoning.

Researchers discovered that what appears to be progress in mathematical reasoning is largely due to data contamination. When tested on "clean" benchmarks that the model had not seen during training, Qwen2.5's performance dropped sharply.

To test this, the team gave Qwen2.5 only the first 60 percent of problems from the MATH 500 benchmark and asked it to complete the rest. Qwen2.5-Math-7B managed to reconstruct the missing 40 percent with 54.6 percent accuracy and answer correctly 53.6 percent of the time. In comparison, Llama3.1-8B managed just 3.8 and 2.4 percent. This suggests Qwen2.5 had already encountered these problems during training.

Comparison of EM and ROUGE-L results of three models on six math datasets at 80%, 60%, and 40% prompt lengths. — Qwen2.5-Math-7B can accurately reconstruct missing sections of MATH 500 benchmark problems, indicating it has likely seen the data before. | Image: Wu et al.

The researchers then tested the model with LiveMathBench (version 202505), a "clean" benchmark released after Qwen2.5. On this dataset, Qwen2.5's completion rate dropped to zero, matching Llama, and its answer accuracy fell to just two percent.

The likely reason is that Qwen2.5 was pre-trained on large online datasets, including GitHub repositories containing benchmark problems and their solutions. As a result, even random or incorrect reward signals during training could improve its results on MATH-500 because of its prior exposure to the data.

Bar chart: MATH-500 accuracy of Qwen2.5 and Llama-3.1 models with greedy/Average@16 decoding, with/without template. — Qwen2.5 models show steep performance drops on MATH-500 when response templates are changed, while Llama-3.1-8B remains almost unaffected. | Image: Wu et al.

To address this, the team created the RandomCalculation dataset, containing fully synthetic arithmetic problems generated after Qwen2.5's release. On these new problems, Qwen2.5's accuracy declined as problem complexity increased. Only correct reward signals improved performance, while random rewards made training unstable and inverted rewards degraded math skills.

Four line graphs: accuracy vs. calculation step for Qwen2.5-Math-7B and -7B-Instruct with/without template and Greedy/Avg@16 decoding. — All Qwen2.5 variants lose accuracy as the number of calculation steps increases on synthetic math problems. | Image: Wu et al.

Controlled RLVR (Reinforcement Learning with Verifiable Rewards) experiments confirmed these results: only correct rewards led to stable improvement, while random or inverted rewards failed to boost performance or actively degraded it.

These findings call into question the idea that Qwen2.5's math abilities reflect real reasoning. Instead, the results show the model relies heavily on memorized data.

Alibaba launched Qwen2.5 in September 2024, followed by the Qwen3 series. Whether these findings also apply to Qwen3 remains to be seen.

The study's authors warn that contaminated benchmarks can lead to misleading conclusions about AI progress. They recommend future research rely on clean, uncontaminated benchmarks and evaluate multiple model series for more reliable results.

Benchmark gaming isn't new

The results highlight how difficult it is to separate true reasoning from memorization in large language models, and why rigorous, clean evaluation methods are essential for trustworthy AI research. Previous work has shown that benchmarks can be manipulated or "gamed."

For example, Meta submitted a version of Llama 4 specifically tuned to perform well on the LMArena benchmark by using customized response formats. Other studies show that models like Gemini 2.5 Pro and Claude 3.5 Sonnet can identify test scenarios with up to 95 percent accuracy and adjust their responses, raising even broader questions about the validity of current evaluation methods.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv