Similar teacher is better than a strong but different teacher

AI usage disclosure. I use AI to summarize papers after thorough discussion. The summaries on this page were drafted by Claude Opus 4.7 (or a stronger Claude model) and refined over a few rounds of human prompting.

There are papers showing that a “more similar supervision signal” is easier to learn from. In other words, the teacher shouldn’t be too strong or too distant from the student — even for dense per-token supervision like SFT, a large gap hurts absorption. I record papers possibly related to this observation here.

RL’s Razor: Why Online Reinforcement Learning Forgets Less

This paper11 arXiv:2509.04259 argues that on-policy RL is implicitly biased toward solutions that stay close to the base model, which is why it forgets less than SFT.

Central claim: Among all the ways to solve a new task, on-policy RL is implicitly biased toward solutions that are closest in KL divergence to the base model. This bias is what makes RL forget less than SFT.

Key result: The forward KL divergence between the fine-tuned and the base policy on the new task is a single predictor of catastrophic forgetting across both RL and SFT (with on ParityMNIST), outperforming all tested alternatives (weight changes, representation shifts, reverse KL, total variation).

Co-Evolving Policy Distillation

Co-Evolving Policy Distillation (CoPD)22 arXiv:2604.27083 is a post-training framework for consolidating multiple expert capabilities (text, image, video reasoning) into a single model.

Central claim: “Distillation among experts should occur during training rather than after it, and multiple expert models should serve as teachers and students to one another, co-evolving in synergy.”

Key result: CoPD breaks the conventional ceiling that a unified student cannot surpass its domain-specific experts, turning cross-domain trade-offs into mutual gains.

The authors identify two failures in existing paradigms: mixed-data RLVR suffers from a “capability divergence cost” (gradient conflicts between capabilities), while the static RLVR-then-OPD pipeline trains experts to convergence in isolation, causing the teacher and the student to drift too far apart for distillation to be absorbed. A pilot study confirms this: the post-OPD gain correlates strongly with teacher–student top- token overlap (), measured as

while standard RLVR drives the overlap monotonically downward into a low-absorption regime.

CoPD runs parallel branches from a shared base, alternating per cycle: Phase I — each branch performs GRPO steps on its own capability data (deepening expertise, opening the gap); Phase II — each branch performs steps of mutual OPD, generating rollouts on the other branch’s data and receiving token-level supervision from it (closing the gap). This keeps the overlap above throughout training. After cycles, branches are combined via simple parameter merging.

On Qwen3-VL-4B, CoPD achieves overall accuracy (two-branch) and (three-branch), beating Mixed RLVR, Static OPD, and MOPD, and surpassing the domain-specific experts on their own domains. The best ratio is .

My question: What if we train for more epochs? The teacher–student gap should shrink as training progresses, which — according to the paper’s framework — could itself help training. I am curious about the baseline performance of the student after being trained for more epochs.