The Point: ServiceNow AI resolved a critical train-inference mismatch problem during migration from vLLM V0 to V1. Precise log-probability calculations are essential for consistent reinforcement learning training dynamics.
ServiceNow AI is migrating its reinforcement learning pipeline from vLLM V0 to V1. The central challenge lies in the exact calculation of log-probabilities, as even minimal deviations can fundamentally alter training dynamics.
The ServiceNow AI pipeline uses vLLM as an inference engine to generate training rollouts. The engine produces tokens together with their log-probabilities, which the trainer subsequently uses to calculate policy ratios, KL divergence, clipping rate, entropy, and reward. Any deviation in the calculation of these log-probabilities can alter training dynamics. This was the train-inference mismatch problem that needed to be resolved as part of the vLLM V0 to V1 migration. The migration therefore requires utmost care in validating all numerical computations to ensure correctness guarantees.