Content
summary Summary

A new study from Tsinghua University and Shanghai Jiao Tong University examines whether reinforcement learning with verifiable rewards (RLVR) helps large language models reason better—or simply makes them more efficient at repeating known solutions.

Ad

The research finds that RLVR improves the chance of producing a correct answer on the first try—known as pass@1—but does not unlock new capabilities. "RLVR is not as powerful as previously believed—it doesn't enable the model to solve problems that the base model can't solve," writes study lead Yang Yue.

OpenAI CEO Sam Altman appears to be aware of these limitations. He has suggested that combining reasoning abilities with "a much bigger model" through pre-training could eventually lead to "the first bits or sort of signs of life on genuine new scientific knowledge," indicating that scale—not reinforcement alone—may be key to advancing reasoning capabilities.

RLVR is primarily used in training reasoning models on tasks with verifiable outcomes, such as mathematics, programming, and visual reasoning. Instead of relying on human feedback, it uses automatic signals—like correct calculations or passed code tests—as reward criteria. This approach has been applied in models including OpenAI’s o-series and Deepseek-R1.

Ad
Ad

Efficiency comes at the cost of variety

The study shows that RLVR reduces the diversity of outputs—referred to as entropy—by concentrating responses around a few high-reward solution paths. This increases the chances of success on a single try but limits the model’s ability to explore alternatives across multiple generations.

Researchers compared base models and RLVR variants using the pass@k metric, which measures whether at least one correct answer appears among several attempts. RLVR models performed better when only a few answers were sampled, due to their focus on high-probability strategies. However, when more answers were generated, base models outperformed them by producing a wider range of responses—regardless of the specific model or task type.

The same pattern held across mathematics, programming, and visual reasoning tasks. RLVR-trained models often succeeded on the first attempt but showed reduced performance when tested over multiple trials.

Video: Yue et al.

Manual review of chain-of-thought (CoT) reasoning revealed that base models were already capable of solving complex tasks using varied strategies—some of which had previously been attributed only to RL-trained models. Visualizations of reasoning paths confirmed that RLVR did not introduce new behaviors but instead increased the likelihood of selecting successful strategies already present in the base model.

Recommendation
Comparison diagram: Search trees for base and RLVR models for two problems, showing increased efficiency vs. reduced reasoning capacity.
The graph shows how RLVR training affects reasoning performance in language models. While it boosts efficiency on task A, it narrows the model’s ability to generalize to task B—highlighting tradeoffs in optimizing for different problem types. | Image: Yue et al.

RLVR helps repetition, not generalization

AI researcher Nathan Lambert describes the findings as consistent with expectations. "This isn’t a new intuition," he writes, "but a nice new set of results." According to Lambert, this is "cool because it shows that RL reduces the entropy of samples but makes the model more effective at pass@1."

He also notes the narrow scope of the training data, pointing out that the models were trained only on MATH and GSM8K—datasets he describes as "great for controlled ablations" but "not great for showing the fundamental limits of RL training." Broader conclusions, he argues, will require scaling the approach: "OpenAI and others have shown that scaling RL is a crucial aspect of it, and with only these narrow training sets that isn’t really possible."

Rather than a critique of reinforcement learning as a whole, Lambert suggest the study underscores the need for continued progress. As Lambert puts it, "We just are getting to the point where we need to do hard things. Hard things are more interesting, but shocker, they're hard and take longer."

Yue notes that the study focused on RL models trained from scratch, without enhancements like chain-of-thought fine-tuning or knowledge distillation: "Here we focused on zero-RL trained model. OpenAI’s model should have extra COT finetuning and distillation etc." He also agrees that additional steps—such as warm-starting with supervised fine-tuning—could improve results for reasoning models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Tsinghua University and Shanghai Jiao Tong University found that reinforcement learning for reasoning models raises the success rate for individual answers, but does not help models solve new types of problems—models only succeed at tasks they already managed before.
  • The RLVR technique leads models to focus more on known solution paths, which improves the chance of a correct answer on a single try but reduces the diversity and overall performance when models are given multiple attempts.
  • According to the study, RLVR makes language models more efficient but does not expand their abilities beyond what the base model can do—contrary to what many in the industry may expect.
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.