Limit of RLVR
We systematically study Reinforcement Learning with Verifiable Rewards (RLVR) across math, coding, and vision benchmarks and uncover that RL fine-tuning enhances sampling efficiency without expanding the reasoning capacity already present in base models.
Video: pass@k curves of base models and their zero-RL-trained counterparts across multiple mathematical benchmarks. When k is small, RL-trained models outperform their base versions. However, as k increases to the tens or hundreds, base models consistently catch up with RL-trained models across all benchmarks and LLM families without exception. Eventually, base models surpass RL-trained models.
Thank you for visiting our website and reading our paper! If you are already familiar with the paper or have questions about its conclusions, you may start with this section. If not, we recommend reading the main content first, then returning here.
Why Study the Capability Boundary Under Multiple Sampling?
This paper does not focus on user experience or practical utility, but rather on the limits of model capability. A key question we ask is: since RL is often viewed as a core pathway toward stronger models, can RL training alone one day enable models to solve open problems, such as mathematical proofs or biological discoveries, that the base model and human cannot? Other paradigms, such as AlphaEvolve, have already demonstrated the ability to achieve breakthroughs on open problems previously unsolved by humans.
Our Core Conclusion
Current RLVR methods (specifically those using binary 0/1 rewards) do not appear to teach models fundamentally new reasoning abilities. Instead, they primarily amplify reasoning capabilities that already exist within the base model. The observed performance gains (e.g., in avg@n) mostly arise from improved sampling efficiency. A key piece of evidence is that RL models rarely solve problems that the base model cannot solve at all.
The Intuition Behind This
Effective exploration in the vast language space is extremely difficult. If the base model cannot sample correct solutions, then under a 0/1 reward scheme, RL receives no useful gradient signal—thus, it cannot learn better strategies. In such settings, the RL-trained model naturally struggles to surpass the base model.
This Is Not a Rejection of the RL Paradigm
The above intuition applies only to the 0/1 reward setting, especially GRPO-style methods that omit a value network. In a broader RL context, there are still many promising directions to break the limit:
While these directions are theoretically promising, none have yet worked reliably in practice. We hope the community will continue advancing these ideas.
If you still have questions—such as about the “monkey-typing” analogy (infinity sampling), lucky guessing, whether RL is useful at all, or whether the training scale was sufficient, please refer to the Q&A section below for detailed explanations.
Recent breakthroughs in reasoning-focused large language models (LLMs) like OpenAI-o1, DeepSeek-R1, and Kimi-1.5 have largely relied on Reinforcement Learning with Verifiable Rewards (RLVR), which replaces human annotations with automated rewards (e.g., verified math solutions or passing code tests) to scale self-improvement. While RLVR enhances reasoning behaviors such as self-reflection and iterative refinement, we challenge a core assumption:
Does RLVR actually expand LLMs' reasoning capabilities, or does it merely optimize existing ones?
By evaluating models via pass@k, where success requires just one correct solution among k attempts, we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256). This demonstrates that RLVR narrows the model's exploration, favoring known high-reward paths instead of discovering new reasoning strategies. Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.
Video: The effect of RLVR on LLM's reasoning ability. Search trees are generated by repeated sampling from the base and RLVR-trained models for a given problem. Grey indicates paths that are unlikely to be sampled by the model, while black indicates paths that are likely to be sampled. Green indicates correct paths, which has positive rewards. Our key finding is that all reasoning paths in the RLVR model are already present in the base model. For certain problems like Problem A, RLVR training biases the distribution toward rewarded paths, improving sampling efficiency. However, this comes at the cost of reduced scope of reasoning capacity: For other problems like Problem B, the base model contains the correct path, whereas that of the RLVR model does not.
[1/3] We use pass@k not to measure practical utility, but to explore the reasoning capacity boundary of LLMs, as outlined in our paper.
[2/3] If a model can solve a difficult problem at least once in k samples, we consider that problem within its potential reasoning scope. If RL training truly expands reasoning, we would expect the RL model to solve more such problems than the base.
[3/3] However, we observe the opposite: RLVR models often solve fewer problems at large k, suggesting a narrowing of this boundary. This implies RLVR is optimizing within the base model's capabilities, not extending them.
[1/3] It's true that pass@1024 can be noisy for datasets like AIME whose answers are within limited integer space. But we also evaluated coding benchmarks, where guessing is nearly impossible to pass unit test case, and yet similar patterns hold, with base models performing better at large k.
[2/3] For AIME and GSM8K, we manually inspected the CoT outputs and found that, in most cases, the base model produced at least one correct reasoning path—not just lucky guesses.
[3/3] On datasets such as MATH500, where answers involve complex forms (roots, fractions, symbolic math) and are hard to guess, we again observed similar trends.
Together, these results highlight the often underappreciated reasoning potential of base models. We're considering including a random sampling baseline in future work.
Not quite. “More is different.” While it's true that in theory, even random typing has a non-zero chance of producing a correct answer—about 1/V^L, where V is the vocabulary size (~30k) and L is the output length (>200). However, in practice, that search space is astronomically large.
The key point is that the magnitude of that probability matters. If the base model has a prior that gives the correct answer a 1 in 10⁴ or 10⁵ chance, then RL might find it with millions of samples. But if that probability is 1 in 10¹⁰ or smaller, RL is extremely unlikely to escape local optima and reach meaningful reward.
In our paper, we show that for most problems, this probability is not negligible—we observe correct outputs with k = 128 or 1024, which is feasible with today's resources. So rather than being meaningless, pass@k reveals that base models already possess the necessary reasoning paths.
It's not surprising that RLVR turns pass@k into pass@1—that's what RL is designed for.
But what's more interesting is that RLVR doesn't do much beyond that in our experiments. It doesn't seem to introduce new reasoning abilities: if the base model can't solve a problem, the RL-trained model still can't either. This clearly highlights the upper bound of RL in reasoning.
And that's not something obvious. In traditional RL such as Atari or Go, RL is known to explore and discover new strategies, continuously self-improving without an inherent bound. But in the case of LLMs, RLVR seems constrained by the base model's existing capabilities.
Actually, this phenomenon that RL-trained models perform worse than base models in pass@k surprises lots of researchers.
In camera-ready, we include preliminary scaling experiments using Magistral, a near-frontier pure RLVR model, and the conclusion remains consistent. But it's still an open problem that if we scale RLVR compute to 10–1000× of Magistral, would it actually produce new knowledge beyond pretraining?
No. RL remains practically useful because it improves sample efficiency. However, if we want LLMs to solve truly harder problems beyond pretraining, we may need a new training paradigm that can go beyond the base model's ceiling.
Yes, DS-Math did observe similar trends, but their study was limited to a single instruction-tuned model and two math benchmarks.
In contrast, Our work systematically investigates this across true base models in a zero-RL setting, covering multiple LLM families and a wider range of benchmarks.
We also dive further by providing deeper analyses—including perplexity trends, different RL algorithms, and evaluation against distilled models—offering a more comprehensive view of RLVR's capabilities and limitations.
We believe the fact that the reasoning scope of RLVR models is bounded by the base model is a notable phenomenon that deserves deeper attention.
FYI: Following paper invisible leash also tested Nvidia's proRL v1, v2 and AceReason-Nemotron. The same observations that base surpasses RL on both math and coding tasks.
We conducted experiments across three representative domains to evaluate the effect of RLVR on the reasoning ability boundaries of base and RLVR models.
In the math experiments, we evaluate multiple LLM families (Qwen-2.5 and LLaMA-3.1) and their RL-trained variants on benchmarks like GSM8K, MATH500, and AIME24. We analyze pass@k curves to compare base and RL-trained models, observing that RL improves low-k performance but reduces problem coverage at high k. We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses. Additionally, we examine Oat-Zero-trained models and filter guessable problems to focus on challenging cases. The results show base models maintain broader reasoning coverage despite RL's initial accuracy gains.
In the coding experiments, we evaluate the RLVR-trained model CodeR1-Zero-Qwen2.5-7B, derived from Qwen2.5-7B-Instruct-1M, on benchmarks like LiveCodeBench, HumanEval+, and MBPP+. We assess performance using pass@k metrics, measuring correctness based on predefined test cases. The results show RLVR improves single-sample pass@1 scores but reduces coverage at higher sampling counts (k = 128). The original model exhibits continued potential for improvement with larger k, while RLVR's performance plateaus. This indicates RLVR enhances deterministic accuracy but limits exploration diversity.
In the experiments on visual reasoning, we evaluate Qwen-2.5-VL-7B on filtered visual reasoning benchmarks (MathVista and MathVision), removing multiple-choice questions to focus on robust problem-solving. The improvements from RLVR in visual reasoning align with those seen in math and coding benchmarks, indicating that the original model already covers a broad range of solvable problems, even in multimodal tasks. The consistency across domains suggests that RLVR enhances reasoning capabilities without fundamentally altering the model's problem-solving approach.
We present ONE of the sampled correct CoTs from the base model, manually selected from 2048 samplings for the hardest questions in AIME24. The responses from the base model tend to be long CoTs and exhibit reflective behavior, highlighting the strong reasoning ability inherent in the base model.
AceReason-Nemotron reported that after conducting 64 samples, their RL-trained model successfully solved four problems (No. 3, 14, 29, and 30) from the AIME24 dataset that DeepSeek-R1-Distill-Qwen-7B (base model in their RL training) failed to solve. While their RL-trained model is well-trained and impressively powerful, our findings suggest that the base model itself still exhibits considerable potential. With increased sampling, we observed that these four problems can indeed be solved by the base model: Problem No. 3 was solved after 5120 samples, and Problems No. 14, 29, and 30 were solved within 1024 samples. For each case, we provide representative examples of correct Chains of Thought (CoTs) and corresponding final answers:
Condensed: The process of consecutive reflective contents starting with "Wait," which are long and similar in the CoT, is summarized and abridged for better readability.
@article{yue2025limit-of-rlvr,
title={Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?},
author={Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao},
journal={arXiv preprint arXiv:2504.13837},
year={2025}
}