In brief: Encoder-decoder compressors with adaptive expansion improve KV-cache compression methods in speed and memory efficiency without significant quality loss.

Anthropic researchers have developed an architecture for compressing long input vectors with Latent Context Language Models (LCLMs) that outperforms existing KV-cache compression methods on the accuracy-efficiency frontier. The method reduces memory-related bottlenecks when processing long contexts.

Inference of Large Language Models with long contexts is limited by the growing KV-cache: with each additional token, memory consumption doubles. Previous KV-cache compression techniques either show significant quality loss or require long processing times for a single long prompt. Moreover, many methods are not compatible with modern production inference engines.

The work revisits encoder-decoder-based compression methods in which long token sequences are mapped to shorter sequences of latent embeddings. The researchers conducted an architecture search and trained various variants from scratch. Based on these findings, they pre-trained a family of models with 0.6-billion-parameter encoders and 4-billion-parameter decoders on over 350 billion tokens total, each with compression ratios of 1:4, 1:8, and 1:16.

The resulting LCLMs improve the Pareto frontier across multiple dimensions: general task performance, compression speed, and maximum memory consumption. The work demonstrates that LCLMs serve as efficient backbones for longer-horizon agents: the agent can search through compressed long contexts and adaptively expand relevant segments as needed.

Source: arxiv.org · Published June 7, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.6.5.

Share on:

Encoder-Decoder Architecture for Efficient Context Compression in LLMs

Lumi AI News

Legal

Topics