Google speeds up Gemma 4 threefold with multi-token prediction
Google has released multi-token prediction drafters (MTP) for its open AI model family Gemma 4, designed to speed up text generation by up to three times. LLMs normally generate text one token at a time, loading billions of parameters from memory at each step. The processor's computing core spends most of its time just waiting for data, Google says.
The company's new MTP technology tackles this bottleneck. While the main model waits for its data, a small auxiliary model uses the idle capacity to suggest several tokens at once. The main model then checks all those suggestions in a single pass—if they're correct, they get accepted at once. The smaller model is just filling time that would otherwise go to waste, so the same text gets produced faster with no loss in quality or accuracy, according to Google.
The speedup works on smartphones, local computers, and cloud applications. The drafters are available under the open Apache 2.0 license on Hugging Face and Kaggle. Google's Gemma 4 open-weight model, introduced in early April, has already been downloaded over 60 million times.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now