Similar teacher is better than a strong but different teacher
AI usage disclosure. I use AI to summarize papers after thorough discussion. The summaries on this page were drafted by Claude Opus 4.7 (or a stronger Claude model) and refined over a few rounds of human prompting.
There are papers showing that a “more similar supervision signal” is easier to learn from. In other words, the teacher shouldn’t be too strong or too distant from the student — even for dense per-token supervision like SFT, a large gap hurts absorption. I record papers possibly related to this observation here.
RL’s Razor: Why Online Reinforcement Learning Forgets Less
This paper11 arXiv:2509.04259 argues that on-policy RL is implicitly biased toward solutions that stay close to the base model, which is why it forgets less than SFT.
Central claim: Among all the ways to solve a new task, on-policy RL is implicitly biased toward solutions that are closest in KL divergence to the base model. This bias is what makes RL forget less than SFT.
Key result: The forward KL divergence between the fine-tuned and the base policy on the new task is a single predictor of catastrophic forgetting across both RL and SFT (with on ParityMNIST), outperforming all tested alternatives (weight changes, representation shifts, reverse KL, total variation).
Co-Evolving Policy Distillation
Co-Evolving Policy Distillation (CoPD)22 arXiv:2604.27083 is a post-training framework for consolidating multiple expert capabilities (text, image, video reasoning) into a single model.
Central claim: “Distillation among experts should occur during training rather than after it, and multiple expert models should serve as teachers and students to one another, co-evolving in synergy.”
Key result: CoPD breaks the conventional ceiling that a unified student cannot surpass its domain-specific experts, turning cross-domain trade-offs into mutual gains.
The authors identify two failures in existing paradigms: mixed-data RLVR suffers from a “capability divergence cost” (gradient conflicts between capabilities), while the static RLVR-then-OPD pipeline trains experts to convergence in isolation, causing the teacher and the student to drift too far apart for distillation to be absorbed. A pilot study confirms this: the post-OPD gain correlates strongly with teacher–student top- token overlap (), measured as
while standard RLVR drives the overlap monotonically downward into a low-absorption regime.
CoPD runs parallel branches from a shared base, alternating per cycle: Phase I — each branch performs GRPO steps on its own capability data (deepening expertise, opening the gap); Phase II — each branch performs steps of mutual OPD, generating rollouts on the other branch’s data and receiving token-level supervision from it (closing the gap). This keeps the overlap above throughout training. After cycles, branches are combined via simple parameter merging.
On Qwen3-VL-4B, CoPD achieves overall accuracy (two-branch) and (three-branch), beating Mixed RLVR, Static OPD, and MOPD, and surpassing the domain-specific experts on their own domains. The best ratio is .
My question: What if we train for more epochs? The teacher–student gap should shrink as training progresses, which — according to the paper’s framework — could itself help training. I am curious about the baseline performance of the student after being trained for more epochs.