The bottom line: Google researchers achieve 3x faster LLM inference on TPUs through diffusion-style speculative decoding. The new method replaces sequential with parallel token predictions, thereby overcoming the previous primary bottleneck of autoregressive decoding.

Google researchers have developed a new method to accelerate Large Language Model inference on TPUs. The innovative approach leverages diffusion-like speculative decoding and achieves a threefold performance improvement over traditional approaches by eliminating the bottlenecks of sequential token prediction.

Previous LLM acceleration was primarily based on autoregressive speculative decoding: a small draft model predicts tokens, which the target model then verifies. However, this sequential process has a fundamental drawback – to generate K candidate tokens, one needs K successive forward passes, which represents a significant bottleneck.

The team led by Weiren Yu (Product Manager) and his colleagues Yarong Mu and Lihao Ran from Google Cloud, along with researchers from UC San Diego – Zhaoxiang Feng, Yiming Zhao and Assistant Professor Hao Zhang – has found a new way. They transfer the principle of diffusion models to speculative decoding. This approach enables parallel token generation instead of the previous sequential process, thereby significantly reducing the required computational time.

The solution was optimized specifically for Google TPUs and demonstrates in practical tests a consistent threefold acceleration in Large Language Model inference.

Share on:

Threefold Acceleration: Google Leverages Diffusion-Style Speculative Decoding for TPU-LLMs

Lumi AI News

Legal

Topics