Skip to content

EfficientRollout: Self-Speculative Decoding for Faster RL Rollouts

The key point: EfficientRollout uses self-speculative decoding with adaptive system utilization to reduce rollout latency in RL scenarios without separate drafter pretraining or jeopardizing the target model.

A new approach reduces latency in generating rollouts for reinforcement learning in large language models by up to 19.6 percent — by having the model itself serve as a fast draft generator and activating speculation only in efficient situations.

Rollout generation is a bottleneck in reinforcement learning (RL) with large language models: because decoding autoregressively samples tokens one at a time sequentially, individual long output sequences determine overall latency. Speculative decoding (SD) is an established method for reducing latency in fixed models — a fast drafter generates token drafts in parallel, a verifier accepts or rejects them, all while preserving the target distribution.

Direct application to RL rollouts fails, however, due to two problems: (1) The target policy changes during training; a fixed drafter thus becomes increasingly misaligned and generates drafts that do not match the current policy. (2) Batch size decreases during rollout decoding, causing memory to become the bottleneck — parallelizing the verifier cannot utilize unused compute capacity.

EfficientRollout is a system-dependent framework for self-speculative decoding: the drafter is generated as a quantized version of the target model and thereby remains coupled to the evolving policy without separate pretraining. The system coordinates an adaptive speculation activation policy and adjusts draft length based on verifier acceptance rates. Speculation is activated only in compute-bound scenarios where parallel verification delivers actual gains.

In experiments, EfficientRollout reduces pure rollout latency by up to 19.6 percent and end-to-end latency (including training) by up to 12.7 percent compared to an optimized autoregressive baseline system, while maintaining final model quality. The approach solves the distribution-matching problem through self-quantization and the memory-compute problem through adaptive gating.


Source: arxiv.org · Published June 16, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrasing and classification by Lumi News Pipeline v1.7.1.

Share on: