Structured curriculum learning strategies that leverage task relationships in latent space achieve better downstream performance than pure difficulty prioritization.
STARE uses surprisal metrics and selective advantage reweighting to maintain policy entropy stability across long training sequences while improving accuracy by 4–8%.
RACES enables automatic composition of verifiable environments through recursive combination, with DeepSeek-R1-Distill-Qwen-14B improving by 3.1 points and Qwen3-14B by 2.3 points across six benchmarks.
FlowTracer models information propagation as a directed graph and derives token credits from global flow structure to precisely concentrate reinforcement learning signals on critical reasoning steps.
FlowTracer assigns credit to tokens based on their measured information throughput in the attention graph rather than treating all equally, yielding consistent performance gains in reasoning tasks.
Reasoning Arena replaces uninformative rewards with head-to-head comparisons of solution attempts and reduces required compute time by 27 to 41 percent.
RL environments with software bugs (stale cache, reward hacks, false state transitions) generate toxic training data that sabotage agent training – systematic quality validation is necessary.
CHERRL enables reproducible analysis of reward hacking mechanisms through controlled bias injection and automatic detection of exploitation onset in LLM-based training.
ThoughtFold identifies and removes redundant exploration steps in reasoning chains, reducing token consumption by 56% for DeepSeek-R1-Distill-Qwen-7B while maintaining state-of-the-art accuracy.
GRAIL uses gradient activation saliency to train relevant reasoning steps more strongly than irrelevant tokens, achieving 3.60% accuracy improvement without separate process-level supervision.