Variable-Width Transformers: Non-Uniform Capacity Distribution Across Layers

17. June 2026
AI Models

Different layers perform different roles and could therefore enable non-uniform distribution of parameters and computational resources as an alternative to constant architectural width.

Share on:

OPRD: Representation Distillation with Hidden States Outperforms Output-Only Method

5. June 2026
AI Models, Claude Code

Hidden-state alignment reduces sampling variance, closes the student-teacher gap more effectively, and trains with less memory and computational time than output-only distillation.

Share on:

ThoughtFold: Shortened Reasoning Chains through Preference Learning

4. June 2026
AI Models, Claude AI

ThoughtFold identifies and removes redundant exploration steps in reasoning chains, reducing token consumption by 56% for DeepSeek-R1-Distill-Qwen-7B while maintaining state-of-the-art accuracy.

Share on:

Variable-Width Transformers: Non-Uniform Capacity Distribution Across Layers

OPRD: Representation Distillation with Hidden States Outperforms Output-Only Method

ThoughtFold: Shortened Reasoning Chains through Preference Learning

Lumi AI News

Legal

Topics