The Bottom Line: KVarN reduces error accumulation when quantizing KV-caches to 2-bit precision through improved token-scale normalization and achieves state-of-the-art results on MATH500, AIME24, and HumanEval.
Researchers introduce KVarN, a new quantization method for KV-caches that significantly lowers error accumulation during autoregressive decoding in multi-step reasoning tasks. The method leverages Hadamard rotations and dual-scaling variance normalization.
While test-time scaling shows promise for improved reasoning in large language models, a memory bottleneck emerges as the KV-cache grows during long decoding sequences. KV-cache quantization can help address this, but previous methods are evaluated under static conditions that differ from autoregressive decoding.
The core problem: during autoregressive decoding, quantization errors accumulate across time steps. The primary cause is faulty token scales that amplify each other. KVarN tackles this through a calibration-free method that combines Hadamard rotation with dual-scaling variance normalization—applied to both axes of the K and V matrices. This combination corrects token-scale errors and substantially reduces error accumulation across multiple decoding steps.
On established benchmarks—MATH500, AIME24, and HumanEval—KVarN achieves new state-of-the-art results at 2-bit precision. A vLLM implementation is available at https://github.com/huawei-csl/KVarN, enabling the method to be directly integrated into existing inference pipelines.
Source: arxiv.org · Published June 1, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.2.9.