Skip to content

VaSE: Stochastic KV-Cache Eviction for Reasoning Models

Bottom line: VaSE achieves higher accuracy than existing sparse-attention methods at 4x KV-cache compression, thereby reducing the memory bottleneck of reasoning models.

Reasoning models require large amounts of memory for the KV-cache due to their lengthy outputs. A new training-free method called VaSE protects important value states from eviction and uses stochasticity to improve cache efficiency while maintaining accuracy.

Reasoning models, which perform complex problem-solving tasks through extended chains of thought, face a central challenge: their lengthy outputs create high demands on memory and compute, particularly when managing the KV-cache (Key-Value cache) during inference. Previous KV-cache eviction methods attempt to remove less important key-value pairs from the cache, but often achieve lower accuracy than sparse-attention methods that retain the full cache.

A new study identifies two critical factors for successful KV-cache eviction: First, there is a small fraction of value states with unusually large magnitudes, whose eviction leads to catastrophic failure – models enter repetition loops in their reasoning processes. Second, introducing stochasticity in the eviction process improves accuracy by leading to greater cache diversity.

Based on these insights, Value-aware Stochastic KV Cache Eviction (VaSE) was developed. The method requires no additional training and works through two mechanisms: it protects value states with large magnitudes from eviction and makes eviction decisions stochastically rather than deterministically. Tests with Qwen3 models across six reasoning tasks show that VaSE achieves higher average accuracy at 4x KV-cache compression than state-of-the-art selection methods at the same sparsity. Compared to the strongest existing eviction method, accuracy improves by over 4 percentage points.

VaSE is compatible with FlashAttention2 and enables a static memory footprint for reasoning models, thereby resolving the previously difficult trade-off between efficiency gains and accuracy loss.


Source: arxiv.org · Published June 2, 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.2.9.

Share on: