InfoKV combines attention scores with uncertainty signals for KV-cache compression, outperforming pure attention-based methods on long reasoning tasks by measurable margins.
Encoder-decoder compressors with adaptive expansion improve KV-cache compression methods in speed and memory efficiency without significant quality loss.
VaSE achieves higher accuracy than existing sparse-attention methods at 4x KV-cache compression, thereby reducing the memory bottleneck of reasoning models.