In a nutshell: JetSpec overcomes scaling limits of speculative decoding through parallel tree drafting with causal conditioning, achieving up to 9.64x speedup in LLM inference.

Researchers from UC Berkeley and Alibaba introduce JetSpec, a framework that optimizes speculative decoding for LLMs with parallel tree drafting techniques while transcending previous scaling limitations. The method achieves speedups of up to 9.64x on the MATH-500 benchmark on H100 GPUs.

Speculative decoding accelerates autoregressive language models by drafting and verifying multiple tokens in parallel – however, this approach hits a scaling wall: larger draft budgets yield speed gains only if the acceptance rate remains high and overhead stays low. Previous methods suffer from a dilemma: autoregressive drafters generate path-conditioned candidates with high acceptance, but their costs grow with tree depth. Bidirectional block diffusion drafters, on the other hand, generate all positions in one pass but produce branch-agnostic marginals that are individually plausible yet mutually inconsistent – wasting the budget.

JetSpec combines the efficiency of one-forward drafting with branch-wise causal conditioning. The system trains a causal parallel draft head over fused hidden states of the frozen target model and generates candidate trees whose scores align with the autoregressive factorization of the target model. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedups.

Evaluations on dense and MoE variants of Qwen3 models demonstrate consistent advantages over bidirectional and tree-based SD baselines across math, coding, and chat benchmarks. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open conversation workloads. Further latency improvements are demonstrated under realistic serving loads via vLLM integration. Code and models are available at https://github.com/hao-ai-lab/JetSpec.

Source: arxiv.org · Published June 24, 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.7.1.

Share on:

JetSpec: Parallel Tree Drafting Overcomes Bottleneck in Speculative Decoding

Lumi AI News

Legal

Topics