A CPU-based RL controller optimizes adaptive sampling during test-time scaling, reducing computational overhead and latency compared to heuristic methods.
PaW trains environment models during policy training using the same RL rollouts, consistently improving agent performance without requiring additional simulators or inference costs.