Different layers perform different roles and could therefore enable non-uniform distribution of parameters and computational resources as an alternative to constant architectural width.
Hidden-state alignment reduces sampling variance, closes the student-teacher gap more effectively, and trains with less memory and computational time than output-only distillation.
ThoughtFold identifies and removes redundant exploration steps in reasoning chains, reducing token consumption by 56% for DeepSeek-R1-Distill-Qwen-7B while maintaining state-of-the-art accuracy.