In a nutshell: GRAIL uses gradient activation saliency to train relevant reasoning steps more strongly than irrelevant tokens, achieving 3.60% accuracy improvement without separate process-level supervision.
Researchers present GRAIL, a method for optimizing large language models in mathematical reasoning that differentially evaluates tokens rather than entire sequences, eliminating the need for more expensive process reward models.
Reinforcement learning with verifiable rewards (such as GRPO) has become the standard method for improving mathematical reasoning in large language models. However, previous approaches typically assign an advantage value at the sequence level equally to all tokens or use computationally intensive reward models (PRMs) for step-by-step supervision. This uniform advantage distribution assumes that all tokens contribute equally to the final outcome.
The problem with this equal treatment: erroneous reasoning steps and filler words receive the same gradient intensity as actually relevant logical conclusions. This dilutes the training signal because critical and non-critical tokens are updated with equal force. GRAIL addresses this through token-wise advantage reweighting via gradient activation saliency – a method that assigns higher weight to tokens that are locally sensitive to the final answer.
Source: arxiv.org · Published June 2, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.2.9.