Content
summary Summary

Update from November 11, 2025:

Ad

The paper discussed below has now earned the highest possible score at NeurIPS, one of the world’s leading machine learning conferences.

Since its release, the work has sparked debate. Some researchers argue that the use of high "pass@k" values in benchmarks—where a model is given hundreds or even thousands of chances to solve a problem, and only needs to get the right answer once—doesn't measure genuine reasoning. Instead, it just shows whether a model can eventually stumble onto the correct answer by chance.

Others point out that the stronger consistency seen in reinforcement learning (RL)-trained models might actually reflect more focused, intelligent reasoning rather than a flaw. Critics say that, instead of just checking if a model ever gets the right answer with enough tries, benchmarks should test whether a model systematically follows logical reasoning steps and reaches the correct conclusions more often.

Ad
Ad

The authors acknowledge that "pass@1024"—which allows the model 1,024 attempts and counts a success if any answer is correct—can be skewed by luck on tasks with only a handful of possible answers, such as AIME. Still, they emphasize that the same patterns hold for tougher problems, including programming and math tests, where guessing isn’t enough. Their manual analysis also shows that base models frequently produce sound logical solutions, which they argue supports the idea that large, pretrained base models have more reasoning potential than previously assumed. Looking forward, the team plans to introduce explicit random baselines to better control for lucky guesses in future studies.

The authors stress that their paper does not claim reinforcement learning can never improve reasoning or go beyond what a base model can do. Instead, they plan further experiments to explore if and how RL can enhance LLM reasoning, and note that results may shift as models and datasets grow larger.

Article from April 22, 2025:

A new study from Tsinghua University and Shanghai Jiao Tong University examines whether reinforcement learning with verifiable rewards (RLVR) helps large language models reason better—or simply makes them more efficient at repeating known solutions.

The research finds that RLVR improves the chance of producing a correct answer on the first try—known as pass@1—but does not unlock new capabilities. "RLVR is not as powerful as previously believed—it doesn't enable the model to solve problems that the base model can't solve," writes study lead Yang Yue.

Recommendation

OpenAI CEO Sam Altman appears to be aware of these limitations. He has suggested that combining reasoning abilities with "a much bigger model" through pre-training could eventually lead to "the first bits or sort of signs of life on genuine new scientific knowledge," indicating that scale—not reinforcement alone—may be key to advancing reasoning capabilities.

RLVR is primarily used in training reasoning models on tasks with verifiable outcomes, such as mathematics, programming, and visual reasoning. Instead of relying on human feedback, it uses automatic signals—like correct calculations or passed code tests—as reward criteria. This approach has been applied in models including OpenAI’s o-series and Deepseek-R1.

Efficiency comes at the cost of variety

The study shows that RLVR reduces the diversity of outputs—referred to as entropy—by concentrating responses around a few high-reward solution paths. This increases the chances of success on a single try but limits the model’s ability to explore alternatives across multiple generations.

Researchers compared base models and RLVR variants using the pass@k metric, which measures whether at least one correct answer appears among several attempts. RLVR models performed better when only a few answers were sampled, due to their focus on high-probability strategies. However, when more answers were generated, base models outperformed them by producing a wider range of responses—regardless of the specific model or task type.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The same pattern held across mathematics, programming, and visual reasoning tasks. RLVR-trained models often succeeded on the first attempt but showed reduced performance when tested over multiple trials.

Video: Yue et al.

Manual review of chain-of-thought (CoT) reasoning revealed that base models were already capable of solving complex tasks using varied strategies—some of which had previously been attributed only to RL-trained models. Visualizations of reasoning paths confirmed that RLVR did not introduce new behaviors but instead increased the likelihood of selecting successful strategies already present in the base model.

Comparison diagram: Search trees for base and RLVR models for two problems, showing increased efficiency vs. reduced reasoning capacity.
The graph shows how RLVR training affects reasoning performance in language models. While it boosts efficiency on task A, it narrows the model’s ability to generalize to task B—highlighting tradeoffs in optimizing for different problem types. | Image: Yue et al.

RLVR helps repetition, not generalization

AI researcher Nathan Lambert describes the findings as consistent with expectations. "This isn’t a new intuition," he writes, "but a nice new set of results." According to Lambert, this is "cool because it shows that RL reduces the entropy of samples but makes the model more effective at pass@1."

He also notes the narrow scope of the training data, pointing out that the models were trained only on MATH and GSM8K—datasets he describes as "great for controlled ablations" but "not great for showing the fundamental limits of RL training." Broader conclusions, he argues, will require scaling the approach: "OpenAI and others have shown that scaling RL is a crucial aspect of it, and with only these narrow training sets that isn’t really possible."

Rather than a critique of reinforcement learning as a whole, Lambert suggest the study underscores the need for continued progress. As Lambert puts it, "We just are getting to the point where we need to do hard things. Hard things are more interesting, but shocker, they're hard and take longer."

Ad
Ad

Yue notes that the study focused on RL models trained from scratch, without enhancements like chain-of-thought fine-tuning or knowledge distillation: "Here we focused on zero-RL trained model. OpenAI’s model should have extra COT finetuning and distillation etc." He also agrees that additional steps—such as warm-starting with supervised fine-tuning—could improve results for reasoning models.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Tsinghua University and Shanghai Jiao Tong University found that reinforcement learning for reasoning models raises the success rate for individual answers, but does not help models solve new types of problems—models only succeed at tasks they already managed before.
  • The RLVR technique leads models to focus more on known solution paths, which improves the chance of a correct answer on a single try but reduces the diversity and overall performance when models are given multiple attempts.
  • According to the study, RLVR makes language models more efficient but does not expand their abilities beyond what the base model can do—contrary to what many in the industry may expect.
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.