REVES leverages intermediate steps from successful error corrections as separate training data, achieving better performance with less computational overhead than conventional multi-turn reinforcement learning methods.
KVarN reduces error accumulation when quantizing KV-caches to 2-bit precision through improved token-scale normalization and achieves state-of-the-art results on MATH500, AIME24, and HumanEval.
A CPU-based RL controller optimizes adaptive sampling during test-time scaling, reducing computational overhead and latency compared to heuristic methods.