In a nutshell: LSA predicts relevant context sections in advance and retains only these in GPU memory, compressing the KV-cache by over 86 percent without sacrificing accuracy.
Researchers have developed Lookahead Sparse Attention (LSA), an inference method for DeepSeek-V4 that drastically reduces GPU memory consumption for long contexts. The method reduces the KV-cache to an average of 13.5 percent of its original size while maintaining model accuracy.
Conventional large language models load the complete key-value cache during decoding, which causes significant GPU memory bottlenecks when processing ultralong contexts. Researchers propose Lookahead Sparse Attention, a method that does not passively consider all historical tokens but instead proactively predicts which context sections are relevant to the query. Only these critical KV chunks are retained in GPU memory.
The method uses a Neural Memory Indexer based on the DeepSeek-V4 architecture. Key feature: the indexer is trained feedback-free — formulated as a dual encoder, it works with standard retrieval training frameworks without requiring the massive backbone model to be loaded into GPU memory. This makes training significantly more efficient.
Evaluations across established benchmarks (LongBench-v2, LongMemEval, RULER) show that the physical KV-cache footprint is compressed to an average of 13.5 percent of its original size, while downstream accuracy is maintained or increases by an average of 0.6 percent. At extreme 500K-token length scales, FlashMemory-DeepSeek-V4 reduces KV-cache overhead by over 90 percent without destabilizing the model’s reasoning capabilities.
Source: arxiv.org · Published June 7, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.6.5.