Accelerating Gemma 4: faster inference with multi-token prediction drafters

Posted by amrrs 20 hours ago

Accelerating Gemma 4: faster inference with multi-token prediction drafters(blog.google)

590 points | 277 commentspage 3

el_isma 16 hours ago|

How is this different from the speculative decoding that we had before?

You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.

The blog says something about re-using the big model's data?

adrian_b 16 hours ago||

Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation.

Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".

The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.

This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.

fulafel 3 minutes ago|||

Multi-token prediction is a refined form of speculative decoding.

Google came up with Speculative decoding in 2022: https://research.google/blog/looking-back-at-speculative-dec... (Fast Inference from Transformers via Speculative Decoding - Yaniv Leviathan, Matan Kalman, Yossi Matias)

Meta came up with MTP, a smarter way of doing speculative decoding in 2024: https://arxiv.org/abs/2404.19737 (Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve)

DeepSeek V3 shipped MTP in a product first, in 2024: https://arxiv.org/abs/2412.19437 (DeepSeek-V3 Technical Report, 100+ authors)

julianlam 8 hours ago|||

So then these models could be used by llama.cpp today with the -md switch?

Interesting, must try tomorrow.

OneDeuxTriSeiGo 16 hours ago|||

As far as I can tell MTP is unique from regular speculative decode because the small model is trained to consume and operate on the big model's hidden state for prediction.

dchftcs 1 hour ago|||

It's the same speculative decoding. The news is that it came out for a popular local model.

AbuAssar 18 hours ago||

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

sigmar 17 hours ago|

for anyone wanting a glossary to explain the naming scheme here:

E4B = 4B effective parameters (using per-layer embeddings)

E2B = 2B (like above)

it = instruction tuned (rlhf and all that jazz)

assistant = Multi-token drafters (the new 2x speed up)

wrxd 14 hours ago||

I'm not sure I understand how this work https://huggingface.co/google/gemma-4-E4B-it-assistant has 78.8M parameters while the standard variant https://huggingface.co/google/gemma-4-E4B-it has 8B parameters.

Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?

gunalx 14 hours ago|

You need the regular gemma model as well. You can think of this as a really small distillation of the original. Useless by its own because it often is wrong, but it is fifth more than not. And because verifying a transformer model can be done faster than running it. We can effectively speed up by using this draft model and only doing the compute where it was wrong.

This is a oversimplification, but tldr you need both yes.

wrxd 14 hours ago||

Thank you!

I already played with Gemma4 on oMLX a while ago. When I have some time I'll check if it supports running MTP models and play a bit more

nalinidash 19 hours ago||

technical details are here: https://x.com/googlegemma/status/2051694045869879749

disiplus 19 hours ago||

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

julianlam 17 hours ago||

Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

adrian_b 15 hours ago||

For each of the 4 gemma-4-*-it models there has been published an associated small model gemma-4-*-it-assistant, to be used for MTP.

If a GGUF file is generated for MTP, it must include both the big model and the small model. There was a reference in another comment to a PR for llama.cpp, which also included updates for the Python program used for conversion from the safetensors files, which presumably can handle the combining of the two paired Gemma 4 models.

jug 15 hours ago||

They have now been released on e.g Hugging Face with model suffixes "-assistant".

joakleaf 15 hours ago||

Seems like a pull request for vLLM was just approved a few minutes ago:

https://github.com/vllm-project/vllm/pull/41745

("Add Gemma4 MTP speculative decoding support")

pu_pe 19 hours ago||

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

tarruda 18 hours ago||

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

coder543 18 hours ago|||

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

a_e_k 18 hours ago||

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

coder543 18 hours ago||

That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.

furyofantares 17 hours ago|||

Is it really no quality degradation?

I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.

I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.

Klaus23 16 hours ago|||

It really is. This is because LLMs with a single output/user are strongly bandwidth limited. Although the hardware can generate multiple tokens simultaneously, it is slowed down if the tokens depend on each other, as is the case with regular text generation.

The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model.

A poor draft model will simply slow down the process without affecting the output.

furyofantares 16 hours ago||

> If the guess is right

This is the crux. What makes the guess "right"?

I think the acceptance criteria is not that the token is exactly the token the big model would have produced. It's accepted of the big model verifies that the probability of that token was high enough.

How close it is to the same output (or same distribution of outputs) you'd get from running the big model would be dependent on temperature, top-k, top-p settings, or other inference parameters.

Klaus23 15 hours ago|||

The token is correct if it matches the one generated by the main model. It works like this:

The draft model quickly generates draft-token 1.

The main model then starts working on two tokens in parallel. It calculates token 1 based on the context, and token 2 based on the context + draft-token 1.

Once the two tokens have been generated, you can check whether the draft-token 1 from the draft model matches token 1 from the main model.

If they match, you have just calculated two tokens in the time it takes to generate one, because the calculation was done in parallel. If they do not match, delete token 2 and generate it again. Since you have already generated the correct token 1 with the big model, you can use the context + token 1 (from the main model). This takes more time, but the result is always the same.

furyofantares 13 hours ago||

Models do not generate tokens. They generate probabilities for each token.

Inference parameters select a token using those.

You can just select the top token all the time or you can do it probabilistically.

How you do that in both the speculative decoding and the main inference changes how likely you get the exact same tokens. And then you can choose to accept only if the token matches exactly, or you can choose to accept if it was reasonably likely to be chosen.

Let's say the main model picked the 2nd most likely token and speculative picked the most likely. You can reject that - but you get less speed up. You can accept it, you get more speed up, but you do change the output. You risk the distribution of your outputs not being what you hope.

I am simplifying. I know in https://arxiv.org/pdf/2302.01318 they specify a probability that you reject a token.

Klaus23 1 hour ago||

In theory, you could do that and increase the speed at higher temperatures, but it would subtly alter your output based on the draft model preferences. Rather than picking randomly from the main model probabilities, you would have to accept a draft model pick if it is close enough.

As far as I know, this is not used in practice. Currently popular implementations always match the main model output, and the draft model only affects the speed.

petu 15 hours ago||||

> What makes the guess "right"?

Matching token that would've been picked without speculative decoding. That seems to be more or less agreed upon.

e.g. vLLM docs list tests they run to ensure that output doesn't change if spec. decoding is used: https://github.com/vllm-project/vllm/blob/main/docs/features...

But introducing some threshold to accept other high probability tokens is interesting idea.

furyofantares 12 hours ago||

By "lossless" I believe they mean "stays within the target distribution". Thats what their validation test says it tests. Maybe that means there is no loss in quality in practice. I don't think it means there is no change in output.

The paper they link to in that first paragraph says you compare logits to accept or reject.

basiccalendar74 11 hours ago||||

it is only "right" statistically as in conforming to the same distribution. but there is no guarantee of exact same output.

dist-epoch 15 hours ago|||

There is more compute available than bandwidth when computing LLMs.

It's like branch prediction - the CPU predicts what branch you'll take and starts executing it. Later you find out exactly what branch you took. If the prediction was correct, the speculative executed code is kept. If the prediction was wrong, it's thrown away, the pipeline is flushed, and the execution resumes from the branch point.

The same with this thing: 3 tokens, A-B-C were "predicted", you start computing ALL them 3 at the same time, hoping that the prediction checks out. And because of the mathematical structure of the transformer, it costs you almost the same to compute 3 tokens at a time or just one - you are limited by bandwidth, not compute. But CRITICALLY, each token depends on all the previous ones, so if you predicted wrongly one of the tokens, you need to discard all tokens predicted after (flush the pipeline). This is why a prediction is required and why you can't always compute 3 tokens simultaneously - the serial dependency between consecutive tokens. If you were to start computing 3 tokens simultaneously without a prediction, for token C you need to assume some exact values for tokens A and B, but those were not computed yet! But if they were speculatively predicted you can start and hope the prediction was correct.

petu 16 hours ago|||

Speculative decoding batches multiple completions on all possible outcomes (0/1/2 draft tokens accepted) and sees if big model deviates at any point -- thus verifying each token. So there's no difference in output.

ac29 12 hours ago|||

Memory and compute/energy overhead

moffkalast 15 hours ago||

It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.

Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.

And if you don't have enough compute, then you get negative speedup from all the extra overhead.

imrozim 8 hours ago||

3x faster inference means cheaper api costs tooo. For solo dev building ai this matters a lot

ydj 8 hours ago|

Not necessarily. Servers serving the model likely has enough traffic that they are batching decodes already. MTP reduces latency and increase efficiency only when the server can’t batch enough concurrent streams to be compute bound rather than memory bound.

imrozim 8 hours ago||

Fair didn't think about batching makes more sense for self hosted models then.

sigmar 17 hours ago|

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

nolist_policy 15 hours ago|

It's up on GitHub: https://github.com/google-ai-edge/gallery/releases

More comments...