InfoKV: Entropy-Based KV-Cache Compression for Long Reasoning Sequences

26. June 2026
AI Models, Claude Code

InfoKV combines attention scores with uncertainty signals for KV-cache compression, outperforming pure attention-based methods on long reasoning tasks by measurable margins.

Share on:

EvoEmbedding: Context-Dependent Embeddings for Long Sequences

23. June 2026
AI Models, Claude Code

EvoEmbedding uses an updated latent memory during sequential processing to generate adaptive, context-dependent embeddings for the same query.

Share on:

MiniMax Sparse Attention: Efficient Long-Context Processing for Billion-Parameter Models

12. June 2026
AI Models, Claude Code

MSA reduces attention computation for million-token contexts by a factor of 28.4 through blockwise sparse selection and achieves practical speedups via co-design of algorithm and GPU kernel.

Share on:

Hybrid LLMs Lose Long-Context Capabilities Through CoT Fine-Tuning

10. June 2026
AI Models, Claude Code

CoT fine-tuning degrades long-context retrieval in hybrid LLMs by distorting query-key projections; QK-Restore fixes this without additional training.

Share on:

Lookahead Sparse Attention: DeepSeek-V4 Reduces KV-Cache to 13.5 Percent

10. June 2026
AI Models, Claude Code

LSA predicts relevant context sections in advance and retains only these in GPU memory, compressing the KV-cache by over 86 percent without sacrificing accuracy.

Share on:

Latent Context Language Models: Scalable KV-Cache Compression for Long Contexts

10. June 2026
AI Models, Claude Code

LCLMs compress KV-caches through encoder-decoder architecture up to 1:16 more efficiently than previous methods while reducing peak memory consumption and processing time.

Share on:

Encoder-Decoder Architecture for Efficient Context Compression in LLMs

10. June 2026
AI Models, Claude Code

Encoder-decoder compressors with adaptive expansion improve KV-cache compression methods in speed and memory efficiency without significant quality loss.

Share on:

InfoKV: Entropy-Based KV-Cache Compression for Long Reasoning Sequences

EvoEmbedding: Context-Dependent Embeddings for Long Sequences

MiniMax Sparse Attention: Efficient Long-Context Processing for Billion-Parameter Models

Hybrid LLMs Lose Long-Context Capabilities Through CoT Fine-Tuning

Lookahead Sparse Attention: DeepSeek-V4 Reduces KV-Cache to 13.5 Percent

Latent Context Language Models: Scalable KV-Cache Compression for Long Contexts

Encoder-Decoder Architecture for Efficient Context Compression in LLMs

Lumi AI News

Legal

Topics