InfoKV combines attention scores with uncertainty signals for KV-cache compression, outperforming pure attention-based methods on long reasoning tasks by measurable margins.
Bebop uses rejection sampling and TV loss optimization to maintain stable MTP acceptance rates during RL training and accelerates rollouts by up to 1.8x.
LSA predicts relevant context sections in advance and retains only these in GPU memory, compressing the KV-cache by over 86 percent without sacrificing accuracy.
KVarN reduces error accumulation when quantizing KV-caches to 2-bit precision through improved token-scale normalization and achieves state-of-the-art results on MATH500, AIME24, and HumanEval.