Accelerating Gemma 4: faster inference with multi-token prediction drafters

Posted by amrrs 21 hours ago

Accelerating Gemma 4: faster inference with multi-token prediction drafters(blog.google)

600 points | 284 commentspage 4

sigmar 19 hours ago|

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

nolist_policy 16 hours ago|

It's up on GitHub: https://github.com/google-ai-edge/gallery/releases

tannhaeuser 18 hours ago||

Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).

OliverSmith34 5 hours ago||

The best IOS inferencing model comes from Google..

deskamess 20 hours ago||

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

logickkk1 17 hours ago|

i think this is mixing two separate ideas. MTP is the training-side piece. speculative decoding is the inference trick. DeepSeek V3 used MTP as an auxiliary loss. the 2022 Google paper is speculative decoding. now Google is combining them. https://arxiv.org/abs/2404.19737

deskamess 16 hours ago||

Oh... so MTP is not speculative decoding? The (T)oken (P)rediction made me think it was on the inference side. I shall read the paper.

Edit: Ok, I understand now. You are saying that MTP has two aspects. 1) The training (for the mini-models to generate tokens), and 2) The actual speculative decoding implementation on the inference side (which uses those trained mini-models).

shay_ker 20 hours ago||

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

zargon 20 hours ago|

They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)

Alonski 8 hours ago||

This is sort of similar to Ethereum and maybe a bit of zero knowledge proofs but with the LLM handling both sides.

larnon 16 hours ago||

Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.

ThouYS 15 hours ago||

don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work

brcmthrowaway 13 hours ago|

What is UD?

ac29 13 hours ago||

Unsloth Dynamic, just some branding from Unsloth for their quants (other people use similar techniques)

noashavit 17 hours ago||

Gemma4:e4b is a huge upgrade

brcmthrowaway 20 hours ago|

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

whoahwio 18 hours ago|

dumping money into Gemma and shorting new data center buildouts is a level of Corporate Vision that ends up in an HBS case study

More comments...