Skip to content

iLLaDA: 8B Language Model Trained with Bidirectional Diffusion

Bottom Line: iLLaDA demonstrates that fully bidirectional diffusion training from scratch can be a competitive path to strong language models, even without autoregressive training.

Researchers present iLLaDA, an 8-billion-parameter language model trained with fully bidirectional attention and masked diffusion — not the typical autoregressive approach. The model was pretrained on 12 trillion tokens and fine-tuned on a 25-billion-token corpus with instructions.

iLLaDA was trained with a masked diffusion objective function that was maintained during pretraining and supervised fine-tuning (SFT). The model uses fully bidirectional attention instead of the causally masked attention that is standard in modern large language models. Additionally, variable generation lengths were implemented for efficiency gains and a confidence-based scoring system was introduced for multiple-choice tasks.

Empirical results show consistent improvements: iLLaDA-Base improves by 21.6 points on the BBH benchmark and by 14.9 points on ARC-Challenge compared to the previous LLaDA model. The Instruct version achieves gains of 14.5 points on MATH and 16.5 points on HumanEval. These gains are observed across general, mathematical, and code benchmarks.

Notably, despite non-autoregressive training, iLLaDA competes with Qwen2.5 7B on multiple benchmarks. This suggests that the bidirectional diffusion architecture can serve as a diverse alternative to the established autoregressive factorization. The research thus challenges the assumption that causality and autoregressive decoding are the only paths to powerful language models. Model weights and code are available via GitHub.


Source: arxiv.org · Published June 23, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: