REVES leverages intermediate steps from successful error corrections as separate training data, achieving better performance with less computational overhead than conventional multi-turn reinforcement learning methods.
RACES enables automatic composition of verifiable environments through recursive combination, with DeepSeek-R1-Distill-Qwen-14B improving by 3.1 points and Qwen3-14B by 2.3 points across six benchmarks.
FlowTracer models information propagation as a directed graph and derives token credits from global flow structure to precisely concentrate reinforcement learning signals on critical reasoning steps.
Reasoning Arena replaces uninformative rewards with head-to-head comparisons of solution attempts and reduces required compute time by 27 to 41 percent.
Streaming-based multi-agent reasoning reduces latency through pipelining while simultaneously improving accuracy because early, more reliable reasoning steps protect against erroneous later steps.
GRAIL uses gradient activation saliency to train relevant reasoning steps more strongly than irrelevant tokens, achieving 3.60% accuracy improvement without separate process-level supervision.