STARE uses surprisal metrics and selective advantage reweighting to maintain policy entropy stability across long training sequences while improving accuracy by 4–8%.
GRAIL uses gradient activation saliency to train relevant reasoning steps more strongly than irrelevant tokens, achieving 3.60% accuracy improvement without separate process-level supervision.