iLLaDA demonstrates that fully bidirectional diffusion training from scratch can be a competitive path to strong language models, even without autoregressive training.
Blackwell’s 180–268 GB memory per GPU enables larger batch sizes and longer sequences during model training, reducing communication overhead and allowing single-node training for models that previously required multi-node setups.
Structured curriculum learning strategies that leverage task relationships in latent space achieve better downstream performance than pure difficulty prioritization.
Uniform 4-bit formats eliminate the systematic shrinkage bias of E2M1 in FP4 LLM training and enable consistently better convergence across all model sizes.
STARE uses surprisal metrics and selective advantage reweighting to maintain policy entropy stability across long training sequences while improving accuracy by 4–8%.
A self-learning framework for code-repair agents leverages their solution traces directly to generate targeted training tasks, achieving higher accuracy than previous approaches.
Hidden-state alignment reduces sampling variance, closes the student-teacher gap more effectively, and trains with less memory and computational time than output-only distillation.