Top
Best
New

Posted by ray__ 15 hours ago

TurboQuant: Redefining AI efficiency with extreme compression(research.google)
428 points | 119 commentspage 2
zeeshana07x 11 hours ago|
The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer
om8 10 hours ago||
These are very different media types with very different goals.
dev_tools_lab 10 hours ago||
Agreed. The practical implications are often more interesting than the math anyway — smaller models running locally means you can afford to run multiple models in parallel for cross-validation, which changes how you approach tasks like code analysis or bug detection.
ssijak 9 hours ago||
For my grug brain can somebody translate this to ELIgrug terms?

Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality?

x_may 9 hours ago|
KV cache compression, so how much memory the model needs to use for extending its context. Does not affect the weight size.
macleginn 9 hours ago||
"TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy" -- what do each 3 bits correspond to? Hardly individual keys or values, since it would limit each of them to 8 different vectors.
carlosvega 8 hours ago||
Is the number of bits per coordinate. So, 1 bit is 2x2 grid. 3 bit is a 64 cell grid (2^3 x 2^3). Here you have a demo.

https://mesuvash.github.io/blog/2026/turboquant-interactive/

jbellis 9 hours ago||
The explanation is terrible, but it's clear that it's not actually lossless.
maurelius2 13 hours ago||
I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?
dryarzeg 12 hours ago||
If in short, for many inference tasks the bottleneck is memory bandwidth. Suppose you have a machine with a memory bandwidth of 256 GB/s, and let's say you want to do inference for 4B model (model with 4 billion parameters). If you will load the model in BF16 format (16 bits), each forward pass (i.e. each token generated) will require roughly ~8 GB of memory bandwidth. So, 256/8 = 32 t/s, and that's the generation speed you will be strictly capped at even if your processing power is measured in exaFLOPS. But let's say now that you have decided to instead quantize the model and then run the quantized version. Suppose you have made a Q4_K_M version (4 bits + some weights will take more). Now each of your forward passes will take roughly 2-3 GB (rough approximations, reality is different) of memory bandwith (actually, it will be around 2 GB), and even in the worst case 256/3 = 85.3, while 256/2 = 128 t/s. Quants can reduce quality of the model and lower it's performance, but in most modern quantization methods those losses are usually negligible (although, of course, they're still present). So, as you can see, it can be concluded that quantization "widens" (it's not removing it fully) memory bottleneck while still preserving (not always though) acceptable quality.

(Sorry for my terrible English, it's not my native language)

rohansood15 4 hours ago||
The paper is about vector quantization, which affects KV cache not model weights/sizes.
valine 12 hours ago||
So let’s start with a really simple decoder transformer with a single layer and single attention head, and train it to predict the next token in a sequence of text. To predict the next token you need a few things: a query for the very last token in the sequence, and a key and value for every prior token. You take your query and compute a dot product with every prior key (two large vectors in, scaler attention score out). That scaler attention score first goes through softmax, and then becomes the weight you use to compute a weighted average of your values, new value goes through the mlp, mlp output is projected into the logits from which you sample your next token (that’s the general idea at least skipped a few steps).

The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both.

edg5000 12 hours ago||
So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.
valine 12 hours ago||
Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.
lwhi 8 hours ago||
Will this help us run models locally?
moktonar 13 hours ago||
Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?
amitport 13 hours ago|
r is a single value per vector. You don't have to quantize it, you can keep it and quantize the billion+ other coordinates of the vector.
mungoman2 12 hours ago||
What they're saying is that the error for a vector increases with r, which is true.

Trivially, with r=0, the error is 0, regardless of how heavily the direction is quantized. Larger r means larger absolute error in the reconstructed vector.

amitport 12 hours ago||
Yes, the important part is that the normalized error does not increase with the dimension of the vector (which does happen when using biased quantizers)

It is expected that bigger vectors have proportionally bigger error, nothing can be done by the quantizer about that.

lucrbvi 12 hours ago||
Sounds like Multi-Head Latent Attention (MLA) from DeepSeek
veunes 11 hours ago|
Nah, those are completely different beasts. DeepSeek's MLA solves the KV cache issue via low-rank projection - they literally squeeze the matrix through a latent vector at train time. TurboQuant is just Post-Training Quantization where they mathematically compress existing weights and activations using polar coordinates
esafak 7 hours ago||
No, it is about compressing the KV cache; see How TurboQuant works.
_s_a_m_ 7 hours ago||
has the word "advanced", gotta be good
naasking 7 hours ago||
This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware.

[1] https://github.com/z-lab/paroquant

Yanko_11 2 hours ago|
[dead]
More comments...