Accelerating Gemma 4: faster inference with multi-token prediction drafters

blog.google - 590 poäng - 277 kommentarer - 69772 sekunder sedan

Kommentarer (46)

libraryofbabel - 24894 sekunder sedan
Speculative decoding is an amazingly clever invention, almost seems-too-good-to-be-true (faster interference with zero degradation from the quality of the main model). The core idea is: if you can find a way to generate a small run of draft next tokens with a smaller model that have a reasonable likelihood of being correct, it's fast to check that they are actually correct with the main model because you can run the checks in parallel. And if you think about it, a lot of next tokens are pretty obvious in certain situations (e.g. it doesn't take a frontier model to guess the likely next token in "United States of...", and a lot of code is boilerplate and easy to predict from previous code sections).
I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)
WarmWash - 61805 sekunder sedan
I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.
It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.
zdw - 67008 sekunder sedan
MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
msp26 - 63047 sekunder sedan
Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.
However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.
skybrian - 66559 sekunder sedan
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
aleksiy123 - 60130 sekunder sedan
I’m starting to think that googles strategy is a bit different then the other frontier providers.
Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?
Other providers hitting capacity and hitting the limits subsidising their inference.
Google strategy seems to be about scaling and distributing these models to their existing billions of users.
christina97 - 65118 sekunder sedan
I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
zkmon - 7390 sekunder sedan
The "how to get started" asks you to read "documentation" which turns out to be a sales blurb. Am I missing something?
Patrick_Devine - 62363 sekunder sedan
In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.
You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.
fulafel - 11560 sekunder sedan
Looks like DeepSeek did this as well since V3: https://deepwiki.com/deepseek-ai/DeepSeek-V3/4.4-multi-token...
Credit for the MTP technique is due to https://arxiv.org/abs/2404.19737 from 2024:
Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
these - 67687 sekunder sedan
Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.
julianlam - 63363 sekunder sedan
Really excited to try this once it is merged into llama.cpp.
Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.
Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
regexorcist - 59626 sekunder sedan
Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.
vhiremath4 - 62300 sekunder sedan
So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.
great_psy - 17957 sekunder sedan
This might be silly, but … since the assistant models are so much smaller than the full models. What if we just use those smaller models?
Any idea how much worse they will be ? Or is the issue that their error will really diverge as you accept more of their tokens?
mchusma - 68116 sekunder sedan
I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?
recsv-heredoc - 64769 sekunder sedan
CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.
They serve gemma-4-26b-a4b-it.
netdur - 57908 sekunder sedan
I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive
brikym - 51238 sekunder sedan
I wonder what latency and tok/s this model on Groq or Cerebras would be capable of. I have a couple LLM driven games [1][2] where speed is really important to the experience. Currently the best performance I can get is the gpt-oss models on Groq or Cerebras but they need quite a bit of extra context and tools to correct for mistakes. I'm making a bet I'll be able to get the same performance much cheaper in the next few months.
[1] https://sleuththetruth.com [2] https://lextension.net/
nolist_policy - 51945 sekunder sedan
Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases
el_isma - 57549 sekunder sedan
How is this different from the speculative decoding that we had before?
You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.
The blog says something about re-using the big model's data?
AbuAssar - 62832 sekunder sedan
these are the updated models:
google/gemma-4-31B-it-assistant
google/gemma-4-26B-A4B-it-assistant
google/gemma-4-E4B-it-assistant
google/gemma-4-E2B-it-assistant
wrxd - 50440 sekunder sedan
I'm not sure I understand how this work https://huggingface.co/google/gemma-4-E4B-it-assistant has 78.8M parameters while the standard variant https://huggingface.co/google/gemma-4-E4B-it has 8B parameters.
Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?
nalinidash - 65803 sekunder sedan
technical details are here: https://x.com/googlegemma/status/2051694045869879749
disiplus - 67362 sekunder sedan
nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.
julianlam - 58457 sekunder sedan
Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?
joakleaf - 53881 sekunder sedan
Seems like a pull request for vLLM was just approved a few minutes ago:
https://github.com/vllm-project/vllm/pull/41745
("Add Gemma4 MTP speculative decoding support")
pu_pe - 65462 sekunder sedan
So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?
imrozim - 27159 sekunder sedan
3x faster inference means cheaper api costs tooo. For solo dev building ai this matters a lot
sigmar - 61209 sekunder sedan
>try them directly on Google AI Edge Gallery for Android or iOS.
I'm not seeing any update to the app on my android phone... maybe later today?
>We’ve published an in-depth technical explainer
I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...
OliverSmith34 - 11792 sekunder sedan
The best IOS inferencing model comes from Google..
tannhaeuser - 57730 sekunder sedan
Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).
deskamess - 64949 sekunder sedan
Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.
Alonski - 24597 sekunder sedan
This is sort of similar to Ethereum and maybe a bit of zero knowledge proofs but with the LLM handling both sides.
shay_ker - 65926 sekunder sedan
curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron
https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
larnon - 52855 sekunder sedan
Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.
ThouYS - 49701 sekunder sedan
don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work
noashavit - 57121 sekunder sedan
Gemma4:e4b is a huge upgrade
brcmthrowaway - 64928 sekunder sedan
Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?
franze - 60316 sekunder sedan
if someone wants to work with gemma and dont deal with ollama or configs - there is (my baby) https://airplane-ai.franzai.com/
Beta but useable
simianwords - 59634 sekunder sedan
Gemma 4 is really a beast. The 31B version is totally usable like for cases when I'm bored without internet
ActorNightly - 62647 sekunder sedan
I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed.
Gemma:31b was more accurate but speed was horrendous.
m3kw9 - 64898 sekunder sedan
ok so? Anyone got a verdict/review?
momo26 - 32427 sekunder sedan
[flagged]
rahimnathwani - 61809 sekunder sedan
[dead]
Gormers - 51764 sekunder sedan
[flagged]