Is RL just distribution sharpening?

Starting Point: ICLR26 Oral — Reasoning With Sampling

The author proposes a method to directly sample from the base model that produces similar performance to the post-trained model on reasoning benchmarks. The metric is Pass@1. The novelty is two-fold:

  1. The base model now matches the post-trained model on Pass@1. Previous works mainly show that post-trained models are sharpened in their sampling output, so the output trace is not diverse enough, and they are inferior to their base counterpart only on Pass@K.
  2. A practical sampling algorithm to achieve strong Pass@1 performance.

My Questions

  1. The experiments are done on Qwen2.5-7B and Phi-3.5. Benchmarks are MATH500, HumanEval, GPQA, and AlpacaEval2. What about stronger models and more challenging benchmarks?
  2. Some reasoning formats are not visible in the pre-training data — for instance, tool-use data and interaction traces with RL environments. Can we generate environment-interaction data with a sampling strategy and train our model on this data only with NTP loss? What if we do a pre-training run on all data (including the previous post-training data)? Is pre-training all we need?
  3. How is the current RL cold start done? Does this provide a better way to cold-start the RL training run?
  4. What does RL provide besides distribution sharpening? Even if sampling from the base model has similar performance to RL models on benchmarks, can we say RL is “useless”? What is the utility of distribution sharpening — for example, faster search?

What this paper does

It proposes sampling and compares it against low-temperature sampling.

The basic idea of is that

that is, we can “sharpen” a distribution by exponentiating the numerator and denominator together.

Figure 1: Distribution sharpening

Now, we want to sharpen the distribution of a token .

In the view of the whole sequence, this distribution is marginalized (suppose the sequence length is ):

Note that is just sugar. The expanded version is:

where is the vocabulary.

Power sampling does exponentiation then sum. Low-temperature sampling does sum then exponentiate. For the denominator :

The advantage of power sampling over low-temperature sampling is illustrated by an example in the paper.1

1 Example: power vs. low-temperature sampling.

Define four two-token sequence probabilities:

These sum to (imagine the remaining of mass sitting on some unlisted outcome — it does not affect the argument).

The marginals are:

Under base : has higher marginal (), so prefers as the first token.

Low-temperature (, exponent of sums):

Power (, sum of exponents):

Analysis: If you commit to , you land on the highest-likelihood full sequence available ( with ). If you commit to , the best you can do is or at each. Power sampling correctly picks the token that leads to the best final sequence; low-temperature sampling picks the token with the most total future mass, which happens to be split into mediocre futures.

Experiment Results

The experiments claim two things:

  1. The sampling algorithm matches Pass@1 performance with the RL model.
  2. It maintains sampling diversity and does not compromise Pass@K performance.

Table 1: Credit: caption from the original paper. Power sampling matches and even outperforms GRPO across model families and tasks. Bold marks the better of power sampling vs. GRPO; underlines mark cases where power sampling beats GRPO.

Method MATH500 HumanEval GPQA AlpacaEval2.0
Qwen2.5-Math-7B
Base 0.496 0.329 0.278 1.61
Low-temperature 0.690 0.512 0.353 2.09
Power Sampling (ours) 0.748 0.573 0.389 2.88
GRPO (MATH) 0.785 0.537 0.399 2.38
Qwen2.5-7B
Base 0.498 0.329 0.278 7.05
Low-temperature 0.628 0.524 0.303 5.29
Power Sampling (ours) 0.706 0.622 0.318 8.59
GRPO (MATH) 0.740 0.561 0.354 7.62
Phi-3.5-mini-instruct
Base 0.400 0.213 0.273 14.82
Low-temperature 0.478 0.585 0.293 18.15
Power Sampling (ours) 0.508 0.732 0.364 17.65
GRPO (MATH) 0.406 0.134 0.359 16.74

Figure 2: Credit: caption from the original paper. Pass@ performance on MATH500. Power sampling (ours) and RL (GRPO) are plotted relative to the base model (Qwen2.5-Math-7B). Our curve is strictly better than both GRPO and the base model, and our pass rate at high matches the base model, demonstrating sustained generation diversity.