JetSpec overcomes scaling limits of speculative decoding through parallel tree drafting with causal conditioning, achieving up to 9.64x speedup in LLM inference.
EfficientRollout uses self-speculative decoding with adaptive system utilization to reduce rollout latency in RL scenarios without separate drafter pretraining or jeopardizing the target model.
AWS has developed P-EAGLE, a parallelized variant of speculative decoding that generates draft tokens in a single forward pass instead of sequentially, achieving inference throughput improvements of up to 1.69x on SageMaker AI.
Bebop uses rejection sampling and TV loss optimization to maintain stable MTP acceptance rates during RL training and accelerates rollouts by up to 1.8x.