Skip to content

Lightning-Fast AI Models Directly on Devices with LiteRT-LM

In a nutshell: Google’s LiteRT-LM enables lightning-fast AI inference directly on devices. With Gemma 4, the system achieves 52-76 tokens per second depending on platform and hardware, leveraging advanced quantization and multi-token prediction for maximum performance.

Google’s new LiteRT-LM brings advanced AI to edge devices and delivers optimized performance for deploying Gemma 4 across all platforms. The system uses advanced quantization techniques and accelerated kernels to run AI applications locally on smartphones, tablets, and the web.

Google AI Edge has developed LiteRT-LM (formerly TensorFlow Lite), a powerful solution for deploying large language models on edge devices. The system is already being used in several Google products — from Chrome and ChromeOS to the Pixel Watch and the viral Google AI Edge Gallery app for Android and iOS.

The technology delivers impressive speeds: On Android (Samsung S26 Ultra), LiteRT-LM with Gemma 4 achieves a decode speed of 52 tokens per second via the GPU backend, on iOS (iPhone 17 Pro) even 56 tokens per second. On the web, WebGPU on a MacBook Pro M4 Max delivers speeds of up to 76 tokens per second.

The system was designed to handle competing demands of limited memory, constrained compute power, and fragmented hardware. To do this, it uses advanced quantization schemes, optimized XNNPACK and MLDrift kernels, and an intelligent orchestration layer that avoids expensive CPU/GPU data transfers.

A key feature is support for multi-token prediction (MTP), which enables significant performance gains. LiteRT-LM provides optimal hardware backend optimizations for CPU, GPU, and NPU (currently on Android) and enables developers to build once and achieve optimal performance everywhere.

Share on: