InfoKV combines attention scores with uncertainty signals for KV-cache compression, outperforming pure attention-based methods on long reasoning tasks by measurable margins.
MSA reduces attention computation for million-token contexts by a factor of 28.4 through blockwise sparse selection and achieves practical speedups via co-design of algorithm and GPU kernel.
LSA predicts relevant context sections in advance and retains only these in GPU memory, compressing the KV-cache by over 86 percent without sacrificing accuracy.
LCLMs compress KV-caches through encoder-decoder architecture up to 1:16 more efficiently than previous methods while reducing peak memory consumption and processing time.
Encoder-decoder compressors with adaptive expansion improve KV-cache compression methods in speed and memory efficiency without significant quality loss.