Accelerating Gemma 4: faster inference with multi-token prediction drafters

Posted by amrrs 19 hours ago

Accelerating Gemma 4: faster inference with multi-token prediction drafters(blog.google)

581 points | 273 commentspage 2

these 18 hours ago|

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

dvt 18 hours ago||

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

AlphaSite 18 hours ago|||

Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.

adrian_b 15 hours ago||

I do not know what you mean by sparse models.

All 4 gemma-4-*-it models, regardless whether they are dense models or MoE models, have associated small models for MTP, whose names are obtained by adding the "-assistant" suffix.

https://huggingface.co/google/gemma-4-E2B-it-assistant

https://huggingface.co/google/gemma-4-E4B-it-assistant

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant

https://huggingface.co/google/gemma-4-31B-it-assistant

Havoc 18 hours ago|||

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

svachalek 18 hours ago||

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

julianlam 17 hours ago||

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

VHRanger 17 hours ago|

On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average

regexorcist 16 hours ago||

Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.

ch_sm 15 hours ago|

gemma4 has a specific problem with toolcalls that affects most runtimes. fixes for ollama and vllm are being worked on right now

adrian_b 14 hours ago|||

The chat templates of all Gemma 4 models have been updated 7 days ago, to fix some bugs related to invoking tools.

So any tests done with models that have not been updated during the last days are no longer relevant and they must be repeated after updating the models and regenerating any other file formats, like GGUF files.

apexalpha 15 hours ago|||

I read somewhere you need to drop temp to 0.1 on gemma for tools.

Not sure why (too amateur sorry).

Though I think qwen was natively trained on toolcalling.

great_psy 5 hours ago||

This might be silly, but … since the assistant models are so much smaller than the full models. What if we just use those smaller models?

Any idea how much worse they will be ? Or is the issue that their error will really diverge as you accept more of their tokens?

amdivia 4 hours ago||

I think they'll be extremely worse on their own

Predicting "America" in "The United States of ..." Is a different task from predicting the whole sentence.

So the small model is laying the blocks, and the bigger model would be cementing them in place or kicking them down. The bigger model's course correction is what keeps the smaller models predictions relatively on track

zozbot234 4 hours ago||

I assume these are just output layers that are trained on the hidden state from the larger model - that's how MTP works. It's not a separate drafting model.

vhiremath4 17 hours ago||

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

Lihh27 15 hours ago|

similar idea, but the failure mode is better. a branch mispredict burns cycles. a bad guess here usually just means no bonus tokens. https://arxiv.org/abs/2211.17192

TOMDM 12 hours ago|||

As long as you're not bound on parallelism or bandwidth then it's "free", but if you're constrained on either resource then your lighter predictor model just needs to save you more cycles than it congests on average.

dchftcs 8 hours ago|||

A bad guess still costs cycles, but the penalty is smaller compared to branch mispredict in the current state. But if we have some kind of pipelining, like if we have something that assumed the speculative decode is correct, then it'll be expensive again.

mchusma 18 hours ago||

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

WarmWash 17 hours ago||

A key thing to understand about Google is that under the hood is a collection of extremely powerful fiefdoms (many of which would stand as their own fortune 500, hell 100) that are all trying to act in their own interest. It's almost closer to a conglomerate than a company, where Google needs to bid internally against external players for resources.

If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.

anthonypasq 16 hours ago||

interesting. presumably this is why google is selling TPUs externally instead of hoarding them for deepmind.

Havoc 18 hours ago|||

There is a decent yt here going through what google's logic with gemma overall might be

https://www.youtube.com/watch?v=sXgZhGzqPmU

As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

Farmadupe 18 hours ago|||

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

mchusma 12 hours ago|||

If you were to believe a lot of metrics Gemma 31B it’s much better than flash lite. It seems like I should be able to pay Google to use it and that should be at least a secretary, called action how I can do that but it’s missing from both the blog post entirely.

disiplus 18 hours ago|||

i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.

Farmadupe 17 hours ago|||

If it helps, I mean it in a really literal sense. qwen3.6 27b is currently $3.20 per million tokens on openrouter right now which is way overpriced. As good as the 27b is, kimi k2.5 $3.00 and it's just in another league in terms of capability. There's no reason to spend money on it.

And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.

And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.

dakolli 18 hours ago|||

Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.

disiplus 18 hours ago||

depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification. what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus. what i usually do is basically distillation of a big one onto smaller one.

whoahwio 16 hours ago|||

Makes me wonder about the partnership with apple to use gemini. safe to assume apple has a preference for on-device, and the best open model (for consumer hardware at least) is a google property with an apache 2 license. Interesting dynamic and seemingly a bright spot in the market

fomoz 8 hours ago|||

You can use it for free with Google AI studio (free tier or paid tier accounts with different limits). Or use the paid version from Vertex AI which is around 3x cheaper than Gemini 3 Flash.

I'm using Gemma 4 31B in my app with 5 agents, 1.5k requests per day, each.

djyde 6 hours ago||

I'm curious what tasks you use this model for?

nolist_policy 17 hours ago|||

What do you mean? It just works with Google AI Studio.

mchusma 12 hours ago||

Part of the issue is Google complex web of products. There’s vertex Gemini Google AI studio Google edge. But I literally had trouble finding how to use this in my existing paid Gemini API account.

seamossfet 18 hours ago||

[dead]

recsv-heredoc 18 hours ago||

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

brikym 14 hours ago||

It doesn't seem that compelling to me. I can get the gpt-oss models cheaper from the openrouter nitro providers like groq and cerebras. The model you mention on Cloudflare infra is the same price through open router or directly.

andruby 17 hours ago||

They do indeed. See https://developers.cloudflare.com/workers-ai/models/ They seem to allow some free usage without user account. Do they list limits anywhere?

netdur 16 hours ago||

I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive

brikym 14 hours ago||

I wonder what latency and tok/s this model on Groq or Cerebras would be capable of. I have a couple LLM driven games [1][2] where speed is really important to the experience. Currently the best performance I can get is the gpt-oss models on Groq or Cerebras but they need quite a bit of extra context and tools to correct for mistakes. I'm making a bet I'll be able to get the same performance much cheaper in the next few months.

[1] https://sleuththetruth.com [2] https://lextension.net/

nolist_policy 14 hours ago|

Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases

More comments...