So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

A new study from Tsinghua University and Shanghai Jiao Tong University examines whether reinforcement learning with verifiable rewards (RLVR) helps large language models reason better—or simply makes them more efficient at repeating known solutions.

The research finds that RLVR improves the chance of producing a correct answer on the first try—known as pass@1—but does not unlock new capabilities. "RLVR is not as powerful as previously believed—it doesn't enable the model to solve problems that the base model can't solve," writes study lead Yang Yue.

OpenAI CEO Sam Altman appears to be aware of these limitations. He has suggested that combining reasoning abilities with "a much bigger model" through pre-training could eventually lead to "the first bits or sort of signs of life on genuine new scientific knowledge," indicating that scale—not reinforcement alone—may be key to advancing reasoning capabilities.

RLVR is primarily used in training reasoning models on tasks with verifiable outcomes, such as mathematics, programming, and visual reasoning. Instead of relying on human feedback, it uses automatic signals—like correct calculations or passed code tests—as reward criteria. This approach has been applied in models including OpenAI’s o-series and Deepseek-R1.

Efficiency comes at the cost of variety

The study shows that RLVR reduces the diversity of outputs—referred to as entropy—by concentrating responses around a few high-reward solution paths. This increases the chances of success on a single try but limits the model’s ability to explore alternatives across multiple generations.

Researchers compared base models and RLVR variants using the pass@k metric, which measures whether at least one correct answer appears among several attempts. RLVR models performed better when only a few answers were sampled, due to their focus on high-probability strategies. However, when more answers were generated, base models outperformed them by producing a wider range of responses—regardless of the specific model or task type.

The same pattern held across mathematics, programming, and visual reasoning tasks. RLVR-trained models often succeeded on the first attempt but showed reduced performance when tested over multiple trials.

Video: Yue et al.

Manual review of chain-of-thought (CoT) reasoning revealed that base models were already capable of solving complex tasks using varied strategies—some of which had previously been attributed only to RL-trained models. Visualizations of reasoning paths confirmed that RLVR did not introduce new behaviors but instead increased the likelihood of selecting successful strategies already present in the base model.

Recommendation

AI research

Study reveals major reasoning flaws in smaller AI language models

Comparison diagram: Search trees for base and RLVR models for two problems, showing increased efficiency vs. reduced reasoning capacity. — The graph shows how RLVR training affects reasoning performance in language models. While it boosts efficiency on task A, it narrows the model’s ability to generalize to task B—highlighting tradeoffs in optimizing for different problem types. | Image: Yue et al.

RLVR helps repetition, not generalization

AI researcher Nathan Lambert describes the findings as consistent with expectations. "This isn’t a new intuition," he writes, "but a nice new set of results." According to Lambert, this is "cool because it shows that RL reduces the entropy of samples but makes the model more effective at pass@1."

He also notes the narrow scope of the training data, pointing out that the models were trained only on MATH and GSM8K—datasets he describes as "great for controlled ablations" but "not great for showing the fundamental limits of RL training." Broader conclusions, he argues, will require scaling the approach: "OpenAI and others have shown that scaling RL is a crucial aspect of it, and with only these narrow training sets that isn’t really possible."

Rather than a critique of reinforcement learning as a whole, Lambert suggest the study underscores the need for continued progress. As Lambert puts it, "We just are getting to the point where we need to do hard things. Hard things are more interesting, but shocker, they're hard and take longer."

Yue notes that the study focused on RL models trained from scratch, without enhancements like chain-of-thought fine-tuning or knowledge distillation: "Here we focused on zero-RL trained model. OpenAI’s model should have extra COT finetuning and distillation etc." He also agrees that additional steps—such as warm-starting with supervised fine-tuning—could improve results for reasoning models.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Efficiency comes at the cost of variety

Study reveals major reasoning flaws in smaller AI language models

RLVR helps repetition, not generalization

OpenAI's AI system wins a gold medal-level score at the International Olympiad in Informatics 2025

GPT-5 is here and Gary Marcus is not impressed

Nvidia researchers urge the AI industry to rethink agentic AI in favor of smaller, more efficient LLMs

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Efficiency comes at the cost of variety

RLVR helps repetition, not generalization

Share

Bank details