Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Question

You're using pass@k instead of majority vote—doesn't that make the results invalid?

Answer 1

[1/3] We use pass@k not to measure practical utility, but to explore the reasoning capacity boundary of LLMs, as outlined in our paper.
[2/3] If a model can solve a difficult problem at least once in k samples, we consider that problem within its potential reasoning scope. If RL training truly expands reasoning, we would expect the RL model to solve more such problems than the base.
[3/3] However, we observe the opposite: RLVR models often solve fewer problems at large k, suggesting a narrowing of this boundary. This implies RLVR is optimizing within the base model's capabilities, not extending them.

Answer 2

[1/3] It's true that pass@1024 can be noisy for datasets like AIME whose answers are within limited integer space. But we also evaluated coding benchmarks, where guessing is nearly impossible to pass unit test case, and yet similar patterns hold, with base models performing better at large k.
[2/3] For AIME and GSM8K, we manually inspected the CoT outputs and found that, in most cases, the base model produced at least one correct reasoning path—not just lucky guesses.
[3/3] On datasets such as MATH500, where answers involve complex forms (roots, fractions, symbolic math) and are hard to guess, we again observed similar trends.
Together, these results highlight the often underappreciated reasoning potential of base models. We're considering including a random sampling baseline in future work.

Answer 3

Not quite. “More is different.” While it's true that in theory, even random typing has a non-zero chance of producing a correct answer—about 1/V^L, where V is the vocabulary size (~30k) and L is the output length (>200). However, in practice, that search space is astronomically large.
The key point is that the magnitude of that probability matters. If the base model has a prior that gives the correct answer a 1 in 10⁴ or 10⁵ chance, then RL might find it with millions of samples. But if that probability is 1 in 10¹⁰ or smaller, RL is extremely unlikely to escape local optima and reach meaningful reward.
In our paper, we show that for most problems, this probability is not negligible—we observe correct outputs with k = 128 or 1024, which is feasible with today's resources. So rather than being meaningless, pass@k reveals that base models already possess the necessary reasoning paths.

Answer 4

It's not surprising that RLVR turns pass@k into pass@1—that's what RL is designed for.
But what's more interesting is that RLVR doesn't do much beyond that in our experiments. It doesn't seem to introduce new reasoning abilities: if the base model can't solve a problem, the RL-trained model still can't either. This clearly highlights the upper bound of RL in reasoning.
And that's not something obvious. In traditional RL such as Atari or Go, RL is known to explore and discover new strategies, continuously self-improving without an inherent bound. But in the case of LLMs, RLVR seems constrained by the base model's existing capabilities.
Actually, this phenomenon that RL-trained models perform worse than base models in pass@k surprises lots of researchers.

Answer 5

In camera-ready, we include preliminary scaling experiments using Magistral, a near-frontier pure RLVR model, and the conclusion remains consistent. But it's still an open problem that if we scale RLVR compute to 10–1000× of Magistral, would it actually produce new knowledge beyond pretraining?

Answer 6

No. RL remains practically useful because it improves sample efficiency. However, if we want LLMs to solve truly harder problems beyond pretraining, we may need a new training paradigm that can go beyond the base model's ceiling.

Answer 7

Yes, DS-Math did observe similar trends, but their study was limited to a single instruction-tuned model and two math benchmarks.
In contrast, Our work systematically investigates this across true base models in a zero-RL setting, covering multiple LLM families and a wider range of benchmarks.
We also dive further by providing deeper analyses—including perplexity trends, different RL algorithms, and evaluation against distilled models—offering a more comprehensive view of RLVR's capabilities and limitations.
We believe the fact that the reasoning scope of RLVR models is bounded by the base model is a notable phenomenon that deserves deeper attention.

Answer 8

FYI: Following paper invisible leash also tested Nvidia's proRL v1, v2 and AceReason-Nemotron. The same observations that base surpasses RL on both math and coding tasks.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

NeurIPS 2025 Best Paper Runner-Up Award

ICML 2025 Workshop AI4Math Best Paper Award

Before You Read

Introducing Our Work

Conclusion

Q&A

You're using pass@k instead of majority vote—doesn't that make the results invalid?

Isn't pass@k meaningless since you could eventually guess the right answer through randomly sampling k times?

Even random sampling can eventually generate the correct answer with a large enough k. So doesn't that make your result—that RL improves over base model's pass@k—meaningless?

Isn't it common sense that RL should turn pass@k into pass@1?

Is the training model too small in your experiment?

Is your paper saying RL is useless?

DeepSeek-Math reported similar results. How is your work different?

Do your conclusions contradict those in ProRL and AceReason-Nemotron?

Experiments

Math

Coding

Visual Reasoning

Case Study

Example

BibTeX