In brief: LCLMs compress KV-caches through encoder-decoder architecture up to 1:16 more efficiently than previous methods while reducing peak memory consumption and processing time.
Researchers have developed a novel compression architecture called Latent Context Language Models (LCLMs) that more efficiently reduces the KV-cache of language models. The encoder-decoder compressors enable memory savings of up to 1:16 without significant quality losses and accelerate the processing of long input sequences.
The main challenge in processing long contexts in language models is the KV-cache (Key-Value Cache), whose memory requirements grow linearly with context length. Previous compression methods have drawbacks: they significantly degrade model quality or require substantial computation time to compress a single long prompt. Additionally, many methods assume that the input fits within the target model’s context window and are not compatible with modern production inference engines.
The researchers pursue an encoder-decoder approach: long token sequences are mapped to shorter sequences of latent embeddings that a decoder processes. Through systematic architecture search with pretraining of multiple variants, they developed the LCLM family with 0.6B-encoder and 4B-decoder models. These were continuously pretrained on over 350 billion tokens with compression ratios of 1:4, 1:8, and 1:16.
LCLMs improve the Pareto frontier across general task performance, compression speed, and peak memory utilization. As a practical application, they serve as efficient backbones for long-horizon agents: these can quickly scan through long compressed contexts and adaptively expand relevant segments as needed. This makes an approach practical that can select and retrieve focused information over extended periods.
Source: arxiv.org · Published June 7, 2026
Lumi AI News — AI-assisted curation in accordance with Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.