Skip to content

Tangram: Static KV-Cache Compression for Faster Multi-Turn LLM Serving

In a nutshell: Tangram achieves statically predictable memory budgets per attention head to eliminate fragmentation and latency drag caused by dynamic KV-cache compression.

Researchers present Tangram, a serving framework for large language models that makes heterogeneous key-value cache compression with structural prediction practical. The system eliminates memory fragmentation and latency drag that previous approaches have caused.

In multi-turn dialogs, the KV-cache grows continuously with each response and user. In modern setups, memory, not compute, becomes the bottleneck for throughput. While non-uniform compression — different compression ratios for individual attention heads — preserves accuracy, it causes massive problems in existing serving stacks: different KV lengths per head lead to memory fragmentation, consume up to 25% of prefill time for memory defragmentation, and cause GPU load imbalance with up to 1.7× longer decode latencies or 15–20% overhead per decode step through rescheduling.

Tangram leverages an observation: the required per-head retention follows a two-tier structural regularity — an input-invariant head ranking with tightly bounded per-head ratios. This can be calibrated offline from just 50 samples. The framework implements three mechanisms: Budget Reservation sets the post-compression memory footprint of each head at scheduling time and eliminates defragmentation; Ragged Paging clusters similarly-budgeted heads into separate page tables and makes fragmentation reclaimable; Ahead-of-Time Load Balancing computes balanced GPU partitions without runtime scheduling.

Implemented as a drop-in substrate on vLLM, Tangram achieves up to 2.6× higher average throughput compared to full KV processing while maintaining the accuracy of existing non-uniform compression methods. The code is publicly available at https://github.com/aiha-lab/TANGRAM.


Source: arxiv.org · Published June 14, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: