REVES leverages intermediate steps from successful error corrections as separate training data, achieving better performance with less computational overhead than conventional multi-turn reinforcement learning methods.
EfficientRollout uses self-speculative decoding with adaptive system utilization to reduce rollout latency in RL scenarios without separate drafter pretraining or jeopardizing the target model.
Bebop uses rejection sampling and TV loss optimization to maintain stable MTP acceptance rates during RL training and accelerates rollouts by up to 1.8x.