In a nutshell: Google develops new method for LLM acceleration on TPUs using a diffusion-inspired approach — threefold speedup through parallel token prediction instead of sequential bottleneck.
Google researchers present an innovative method for accelerating language models on Tensor Processing Units. With a diffusion-inspired approach to speculative decoding, they achieve a threefold performance increase over conventional methods.
The team led by Weiren Yu and Yarong Mu of Google Cloud has achieved a breakthrough in optimizing Large Language Model (LLM) inference. The previous industry-standard method of autoregressive speculative decoding uses a compact draft model that predicts tokens sequentially, with a target model subsequently validating them. However, this serial procedure creates a critical bottleneck: to generate K candidate tokens, K forward passes executed sequentially are required.
The new approach overcomes this limitation through a diffusion-like method that can predict multiple tokens in parallel. This eliminates the serial bottleneck and enables significantly faster inference times on Google TPUs. The research was conducted by experts from Google Cloud and research assistants from UC San Diego and presented in Google’s Developer Blog.