Tangram achieves statically predictable memory budgets per attention head to eliminate fragmentation and latency drag caused by dynamic KV-cache compression.
DiffusionGemma replaces the traditional sequential token-generation process with parallel denoising of 256-token blocks, enabling faster inference and improved problem-solving capabilities for complex tasks.
KVarN reduces error accumulation when quantizing KV-caches to 2-bit precision through improved token-scale normalization and achieves state-of-the-art results on MATH500, AIME24, and HumanEval.