In a nutshell: Hidden-state alignment reduces sampling variance, closes the student-teacher gap more effectively, and trains with less memory and computational time than output-only distillation.
Researchers propose On-Policy Representation Distillation (OPRD), which trains student models not only via output probabilities but through alignment of hidden representations—while saving memory and computational time.
Conventional On-Policy Distillation (OPD) supervises student models exclusively in output space by matching probability distributions for the next token between teacher and student. However, with large vocabularies like Qwen’s ~150,000 tokens, this method leads to persistent sampling variance from Monte Carlo KL estimates throughout training. Additionally, all hidden states of the teacher after the language model head are discarded, so only black-box information is utilized.
OPRD addresses these problems by shifting distillation into hidden-state space and aligning representations of student and teacher across selected layers on the same rollouts—entirely without the LM head. Theoretically, this approach eliminates sampling variance and delivers structural information per layer. In practice, OPRD closes the student-teacher gap on benchmarks like AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher level.
Regarding efficiency, OPRD trains 1.44× faster and requires 54% less memory than top-K OPD. Code is available on GitHub (github.com/ShenzhiYang2000/OPRD).
Source: arxiv.org · Published 3 June 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.2.9.