TurboQuant: Redefining AI efficiency with extreme compression

Posted by ray__ 14 hours ago

TurboQuant: Redefining AI efficiency with extreme compression(research.google)

428 points | 119 comments

amitport 11 hours ago|

This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.

eecc 7 hours ago||

Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?

So storing the diagonal as a matrix and the new bases is more compact?

amitport 6 hours ago|||

In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.

eecc 5 hours ago||

ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.

tripplyons 2 hours ago|||

There are papers that try to quantize angles associated with weights because angles have a more uniform distribution. I haven't read this specific paper, but it looks like it uses a similar trick at a glance.

busfahrer 8 hours ago|||

I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?

yorwba 7 hours ago|||

Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.

tripplyons 2 hours ago|||

MLA makes it so the keys and values used are a function of a smaller latent vector you cache instead of a key and a value for each token. KV cache quantization reduces the size of the values in the cache by using less bits to store each value. These two approaches operate on different parts of the process so they can be used in combination. For example, you can quantize the latents that are stored for MLA.

jmalicki 6 hours ago|||

If they didn't cite your paper that's bullshit.

But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem.

amitport 5 hours ago|||

To be clear, I am not claiming they stole an idea. They have made significant independent research. However, a specific part regarding the treatment of rotation with bias correction relates to prior work, and it would be appropriate to have that recognized.

CyberDildonics 3 hours ago||||

That's rationalizing like crazy. If they knew about it they should have cited it.

ekjhgkejhgk 6 hours ago||||

Doesn't matter, you should still cite. It's basic manners in science.

kleiba 5 hours ago||

Exactly, that's why the section is called "Related Work".

efavdb 5 hours ago||||

The earlier paper was from 2021!

cubefox 5 hours ago|||

> But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it

That's more than a stretch. They likely invited them because someone thought the abstract sounded interesting, or something like that.

sva_ 6 hours ago||

Schmidhuber'd

gavinray 6 hours ago||

Can someone ELI5 these two concepts please, which make no sense to me:

  > "TurboQuant starts by randomly rotating the data vectors. This clever step simplifies the data's geometry"

I don't understand how taking a series of data and applying a random rotation could mathemetically lead every time to "simpler" geometry.

If I throw a bunch of shapes on the ground, tightly packed and touching each other, then rotate all of them, you can't guarantee that the new conglomerate shape is any more/less "simple" than before, right?

  > "Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points. It reduces each resulting vector number to a single sign bit (+1 or -1)."

How can a boolean value preserve all of the relational and positional information between data points?

kingstnap 3 hours ago||

Other people have answered here but the real answer is that deep neural networks don't learn isotropic distributions of activations.

What happens is that you get very spikey activations, there are so called "outlier" activations. A easy to read paper that tells you about this is SmoothQuant [0]. Another source from Anthropic and the Mechanistic Interperability people is calling these "privileged basis" [1].

Now based on the weight symmetries of a typical transformer, these actually don't need to exist. Weight symmetries means the ways you can change the weights without actually affecting the mathematical function, there are a broad class of these because the linear algebra has a lot of redundancies in it.

But the behaviour of the Adam optimizer is such that you do end up w/ these things because it sort of more quickly optimizes to produce them. This comes from the fact it is an elementwise dynamic learning rate (and probably partly to do with the epsilon).

[0] https://arxiv.org/pdf/2211.10438 [1] https://transformer-circuits.pub/2023/privileged-basis/index...

gavinray 1 hour ago|||

From your second paper:

  > In particular, we can generate fixed random rotation matrices at initialization, and multiply them into the activations any time we read from or write to the residual stream.

I guess I was mistaken in assuming this part was part of the TurboQuant-specific innovations. Still an interesting concept though

Bolwin 3 hours ago|||

Do you know if this also applies to the muon optimizer? It seems to be replacing adamw

kingstnap 1 hour ago||

My guess is that probably not for Muon. What I said about ADAM was partly based on this blogpost I read some time ago, should have cited it as well [0].

The thing about Muon is that it doesn't have this specific feature of ADAM that causes it to "move along the diagonal". Basically if you flatten weights as a huge vector of a few billion elements. SGD moves along the gradient, which isn't biased. ADAM normalizes everything elementwise, so it sort of moves along a vector of +-1.

This isn't a proof or anything, but what you can imagine might be happening is that if you move along +-1, then you find spikey solutions somehow. Not sure how to prove that. Muon doesn't really do this, but it has its own sort of funky reshaping of the update (it moves along low rank directions).

[0] https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optim...

lumost 5 hours ago|||

They are saying that models should be invariant to data's orientation - and only sensitive to the distance between vectors. This has a pretty significant effect on reducing the set of possible models, and may stabilize the optimization.

In simple terms, large ML models like LLMs often learn trivial rules such as "if the 21st decimal place of the 5th dimension in the embedding vector is 5 - then the image is of a cat." Learning such a memorization function is usually not what we are trying to do, and there are a variety of techniques to avoid these trivial solutions and "smooth" the optimization geometry.

photon_lines 5 hours ago|||

The whole goal of quantisation is to put the data into 'bins' so that it can easily be 'packed' so that you can represent it using less bits (less information). You can think of it like rounding essentially (3.14159 -> 3). Now, sometimes within data, the distribution will be non-ideal for separating it out into bins (let's say that our rounding rules are simple -- we simply use a floor function so 2.45 maps to 2 and 6.4543 maps to 6 etc...) and our bins simply map to the floor -- if we had a set of numbers which look like this: [3.11, 4.43, 5.78, 12.33, 34.32], they would simply map to [3, 4, 5, 12, 34]. Now, we have one huge outlier in our data (34) so to create bins for those sets of numbers, we would need 6 bits of information (2 to the power of 6 = 64), but this is mostly due to the fact that we have one huge outlier (34.32). To get rid of this -- the algorithms applies a random rotation matrix which 'distorts' the original data so that it is more evenly distributed among the possible bins which are assigned to the data set. In linear algebra, a rotation matrix is an orthogonal matrix. When you multiply your vector by this matrix, you aren't changing the "amount" of data (the length of the vector remains the same), but you are recalculating every single number in that vector as a weighted sum of the originals. According to the Central Limit Theorem, when you sum up many random things, the result always starts looking like a bell curve. This is the magic TurboQuant relies on: they don't know what your data looks like, but they know that after the rotation, the data must look like a Beta Distribution and they use this fact to transform the original data into a more 'tightly packed' distribution which allows them to more efficiently pack (or quantise) the information. If most of the transformed data is huddled together into a predictable Bell curve shape, you can pack your bins tightly around that shape leading to much higher precision with fewer needed bits to store it. For example, after applying a rotation matrix, our original transform [3.11, 4.43, 5.78, 12.33, 34.32] might get mapped to something like [8.12, 8.65, 9.25, 10.53, 12.86] and we can crate bins which both are more accurate and need less bits in order to hold our original data set. To create the most optimal bins -- the Lloyd-Max algorithm is used. This algorithm is the gold standard for 1D quantisation. Its goal is to find the best places to put your "boundaries" (where you cut the data) and your "reconstruction values" (the number you store) to minimise the Mean Squared Error (MSE). After applying this, you have your 'rounded' values (or quantized data), but there is still an error value which is missing from our data set: and this is where the residual bit comes in. That bit doesn't represent the original data (or vector) - it simply represents our 'bias' after we apply the above algorithms. It's basically like a '1-bit note' which allows you to perfectly cancel out all the bias terms which our above quantisation algorithm produces to make the 'interactions' (or inner products) when we multiply our values together extremely accurate again even after transforming our original data. Does this make sense?

nico 2 hours ago|||

Amazing explanation! Thank you so much for taking the time to put it together. It makes a lot of sense. I’m not the one who asked the question, but I was impressed by such eloquent and clearly explained answer

gavinray 1 hour ago||||

I had to read this over a few times to piece it together, thanks for the thorough and digestable explanation!

rohansood15 3 hours ago|||

Thank you.

wordpad 5 hours ago||

They are not doing random rotation, simplification here means they are aligning the outliers. If you threw a bunch of shapes on the ground they are picking up one that rolled away and putting it with the others.

>How can a boolean value preserve all of the relational and positional information between data points?

They aren't reducing entire vector to a bollean only each of its dimensions.

akhenakh 6 hours ago||

Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...

GistNoesis 5 hours ago||

He even attempts to improve on the paper by replacing the random rotation operation which is O(d^2), by a Subsampled Randomized Hadamard Transform which can be computed in O(d*log d).

Hopefully Johnson–Lindenstrauss lemma applies in the same way for SRHTransformed vectors as they do for randomly rotated vectors and the independence of the distribution laws of the coordinates remains and therefore the quantization of each coordinates independently is still theoretically sound.

cpburns2009 6 hours ago|||

For some reason I thought the implementation would be way more complicated than that. I obviously lack the domain knowledge to tackle something like this, but it looks straight forward.

qingcharles 2 hours ago||

Agreed. Actual LOC is tiny. Very impressive PR.

vibe42 3 hours ago||

The pace of development in llama.cpp is really high, could see an implementation being merged in 4-6 weeks.

pstoll 7 hours ago||

And a group has published an independent working implementation today, nice to see:

https://github.com/tonbistudio/turboquant-pytorch

ilija139 4 hours ago|

It has a lot clearer explanation of the method than Google's own post.

ramon156 3 hours ago||

Well, yeah. Claude simplified it. That doesn't mean it's a better explanation.

benob 12 hours ago||

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

davesque 2 hours ago||

Yeah, and some parts of the article are just bizarre:

> Instead of looking at a memory vector using standard coordinates (i.e., X, Y, Z) that indicate the distance along each axis, PolarQuant converts the vector into polar coordinates using a Cartesian coordinate system. This is comparable to replacing "Go 3 blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree angle”

Why bother explaining this? Were they targeting the high school and middle school student reader base??

BenoitP 11 hours ago|||

It is AI generated. Or was written by someone a bit far from the technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very specific and powerful concept, when in the article the QLJ explanation is vacuous. A knowledgeable human would not have left the reader wanting for how that relates to the Lemma.

hrmtst93837 2 hours ago||

Honestly, the bigger miss is people treating JL as some silver bullet for "extreme" compression, as if preserving pairwise distances for a fixed point set somehow means you still keep the task-relevant structure once you're dealing with modern models.

Try projecting embeddings this way and watch your recall crater the moment you need downstream task performance instead of nearest-neighbor retreival demos. If you're optimizing for blog post vibes instead of anything measurable sure, call it a breakthrough.

spencerflem 12 hours ago||

I think it is though-

“ TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds.”

NoahZuniga 9 hours ago|||

Genius new idea: replace the em-dashes with semicolons so it looks less like AI.

tux3 8 hours ago|||

You're absolutely right. That's not just a genius idea; it's a radical new paradigm.

Quarrel 5 hours ago|||

Damnit.

There goes another bit of my writing style that will get mistaken for an LLM.

integralid 11 hours ago||||

I also instinctively reacted to that fragment, but at this point I think this is overreacting to a single expression. It's not just a normal thing to say in English, it's something people have been saying for a long time before LLMs existed.

nvme0n1p1 11 hours ago|||

There are tells all over the page:

> Redefining AI efficiency with extreme compression

"Redefine" is a favorite word of AI. Honestly no need to read further.

> the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels

No competent engineer would describe a cache as a "cheat sheet". Cheat sheets are static, but caches dynamically update during execution. Students don't rewrite their cheat sheets during the test, do they? LLMs love their inaccurate metaphors.

> QJL: The zero-overhead, 1-bit trick

> It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead.

Why does it keep emphasizing zero overhead? Why is storing a single bit a "trick?" Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely.

It's 1:30am and I can't sleep, and I still regret wasting my time on this slop.

TeMPOraL 5 hours ago|||

I say you're fixating on the wrong signal here. "Redefine" and "cheat sheet" are normal words people frequently use, and I see worse metaphors in human-written text routinely.

It's the structure and rhythm at the sentence and paragraph levels that's the current tell, as SOTA LLMs all seem to overuse clarification constructs like "it's not X, it's Y" and "it's X, an Y and a Z", and "it's X, it's essentially doing Y".

Thing is, I actually struggle to find what's so off-putting about these, given that they're usually used correctly. So far, the best hypothesis I have for what makes AI text stand out is that LLM output is too good. Most text written by real humans (including my own) is shit, with the best of us caring about communicating clearly, and most people not even that; nobody spends time refining the style and rhythm, unless they're writing a poem. You don't expect a blog post or a random Internet article (much less a HN comment) to be written in the same style as a NYT bestseller book for general audience - but LLMs do that naturally, they write text better at paragraph level than most people ever could, which stands out as jarring.

> Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely.

Or, those things matter to authors and possibly the audience. Which is reasonable, because LLMs made the world suddenly hit hard against global capacity constraints in compute, memory, and power; between that and edge devices/local use, everyone who pays attention is interested in LLM efficiency.

snovv_crash 3 hours ago|||

LLM prose is very bland and smooth, in the same way that bland white factory bread is bland and smooth. It also typically uses a lot of words to convey very simple ideas, simply because the data is typically based on a small prompt that it tries to decompress. LLMs are capable of very good data transformation and good writing, but not when they are asked to write an article based on a single sentence.

TeMPOraL 3 hours ago||

That's true. I.e. it's not that they're not capable of doing better, it's just whoever's prompting them is typically too lazy to add an extra sentence or three (or a link) to steer it to a different region of the latent space. There's easily a couple dozen dimensions almost always left at their default values; it doesn't take much to alter them and nudge the model to sample from a more interesting subspace style-wise.

(Still, it makes sense to do it as a post-processing style transfer space, as verbosity is a feature while the model is still processing the "main" request - each token produced is a unit of computation; the more terse the answer, the dumber it gets (these days it's somewhat mitigated by "thinking" and agentic loops)).

spencerflem 2 hours ago|||

Because it’s a lot of fluff to convey things in a way that’s not very accurate.

radarsat1 2 hours ago||||

> "Redefine" is a favorite word of AI. Honestly no need to read further.

You're not wrong, but it certainly is an annoying outcome of AI that we're not allowed to use.. words.. anymore.

veunes 10 hours ago||||

Looks like Google canned all their tech writers just to pivot the budget into H100s for training these very same writers

snovv_crash 8 hours ago||

Capex vs. opex

roywiggins 5 hours ago||||

"The X Trick" or "The Y Dilemma" or similar snowclones in a header is also a big AI thing. Humans use this construction too, but LLMs love it out of all proportion. I call it The Ludlum Delusion (since that's how every Robert Ludlum book is titled).

pqs 10 hours ago|||

There is also the possibility that the article when through the hands of the company's communication department which has writers that probably write at LLM level.

awesomelvin 9 hours ago||

[dead]

g-mork 9 hours ago|||

Another instinctual reaction here. This specific formulation pops out of AI all the time, there might as well have been an emdash in the title

zarzavat 9 hours ago||||

I read "this clever step" and immediately came to the comments to see if anyone picked up on it.

It reads like a pop science article while at the same time being way too technical to be a pop science article.

Turing test ain't dead yet.

TeMPOraL 5 hours ago||

> Turing test ain't dead yet.

Only because people are lazy, and don't bother with a simple post-processing step: attach a bunch of documents or text snippets written by a human (whether yourself or, say, some respected but stylistically boring author), and ask the LLM to match style/tone.

benob 12 hours ago|||

Maybe they quantized a bit too much the model parameters...

Serhii-Set 4 hours ago||

Compression research keeps producing surprisingly practical results. The interesting parallel in image formats — AVIF and JPEG XL both came from video codec research (AV1 and JPEG committee respectively), and the compression gains translated almost directly. Makes me wonder how much of the current AI quantization work will eventually land in production inference the same way.

computerbuster 2 hours ago|

JPEG XL is mainly based on unique image-specific research, but you're right to say a lot of the techniques are compatible with videos in theory (the XYB color space comes to mind). AVIF is an AV1 OBU in an image-specific container, and required a lot of image-specific engineering to make AV1's tools useful for images; see libaom's tune "iq", and the same in SVT-AV1. The compression gains translated when engineering effort went into creating bespoke implementations, and the same may happen for LLMs if I had to guess.

Serhii-Set 20 minutes ago||

The XYB color space detail is really interesting — I wasn't aware of how much image-specific engineering went into making AV1 tools work for stills. The libaom 'iq' tuning makes sense in retrospect. So the compression gains in AVIF weren't just inherited from AV1 video work but required significant additional optimization. That makes the JXL comparison more nuanced too — JXL was designed image-first from the start, which might explain why it encodes faster despite similar or better compression ratios.

bilsbie 6 hours ago||

It seems like most breakthroughs I see are for efficiency? What are the most importsnt breakthroughs from the past two or three years for intelligence?

Lerc 6 hours ago||

If you think of it from the point of view of the universal approximation theorem, it's all efficiency optimisation. We know that it works if we do it incredibly inefficiently.

Every architecture improvement is essentially a way to achieve the capability of a single fully-connected hidden layer network n wide. With fewer parameters.

Given these architectures usually still contain fully connected layers, unless they've done something really wrong, they should still be able to do anything if you make the entire thing large enough.

That means a large enough [insert model architecture] will be able to approximate any function to arbitrary precision. As long as the efficiency gains with the architecture are retained as the scale increases they should be able to get there quicker.

ertgbnm 6 hours ago|||

Most breakthroughs that are published are for efficiency because most breakthroughs that are published are for open source.'

All the foundation model breakthroughs are hoarded by the labs doing the pretraining. That being said, RL reasoning training is the obvious and largest breakthrough for intelligence in recent years.

WarmWash 4 hours ago||

With all the floating around of AI researchers though, I kind of wonder how "secret" all these secrets are. I'm sure they have internal siloing, but even still, big players seem to regularly defect to other labs. On top of this, all the labs seem to be pretty neck and neck, with no one clearly pulling ahead across the board.

irthomasthomas 6 hours ago|||

Efficiency gains can be used to make existing models more profitable, or to make new larger and more intelligent models.

cubefox 5 hours ago||

Some yes, others no. Distillation and quantization can't be used to make new base models since they require a preexisting one.

irthomasthomas 1 hour ago||

it enables models larger than was previously possible.

cubefox 1 hour ago||

No because the base model from which the distilled or quantized models are derived is larger.

cubefox 5 hours ago||

> What are the most importsnt breakthroughs from the past two or three years for intelligence?

The most important one in that timeframe was clearly reasoning/RLVR (reinforcement learning with verifiable rewards), which was pioneered by OpenAI's Q* aka Strawberry aka o1.

bluequbit 12 hours ago||

I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

Maxious 12 hours ago||

https://mesuvash.github.io/blog/2026/turboquant-interactive/ has a little visualisation

Rapzid 3 hours ago|||

Awesome! So it nudges the vectors into stepped polar rays.. It's effectively angle snapping? Plus a sort of magnitude clustering.

pstoll 7 hours ago||||

Good post but link at the end is broken.

“”” For the full technical explanation with equations, proofs, and PyTorch pseudocode, see the companion post: TurboQuant: Near-Optimal Vector Quantization Without Looking at Your Data.“

spencerflem 12 hours ago|||

I like the visualization, but I don’t understand the grid quantization. If every point is on the unit circle aren’t all the center grid cords unused?

fc417fc802 6 hours ago|||

Yeah that's odd. It seems like you'd want an n-1 dimensional grid on the surface of the unit sphere rather than an n dimensional grid within which the sphere resides.

Looking at the paper (https://arxiv.org/abs/2504.19874) they cite earlier work that does exactly that. They object that grid projection and binary search perform exceptionally poorly on the GPU.

I don't think they're using a regular grid as depicted on the linked page. Equation 4 from the paper is how they compute centroids for the MSE optimal quantizer.

Why specify MSE optimal you ask? Yeah so it turns out there's actually two quantization steps, a detail also omitted from the linked page. They apply QJL quantization to the residual of the grid quantized data.

My description is almost certainly missing key details; I'm not great at math and this is sufficiently dense to be a slog.

vincnetas 11 hours ago|||

i think grid can be a surface of the unit sphere

mrugge 12 hours ago|||

1. Efficient recursive transform of kv embeddings into polar coordinates 2. Quantize resulting angles without the need for explicit normalization. This saves memory via key insight: angles follow a distribution and have analytical form.

quotemstr 12 hours ago||

Reminds me vaguely of Burrows-Wheeler transformations in bzip2.

Rapzid 3 hours ago|||

That overview is frustratingly high-level. I know what a vector is, a bit, and yet that compression description is crazy uninformative. And that PolarQuant visualization is.. Very abstract.

viktorcode 10 hours ago||

The way I understand it, it's a way of compressing vectors by switching from their per-component representation to polar coordinates representation, where the nearby vectors are clumped together to a single line, allowing to describe them by different lengths

mmastrac 5 hours ago||

Is this a tradeoff between GPU-computation-expense vs accuracy? ie: you could quantize into segments or grids on the unit circle/sphere/etc, but that's too expensive so it's better to just quantize to a Cartesian grid because the GPU can decompress cheaper?

iddan 6 hours ago|

I am guessing as Google is vertically integrated and "actually pays" for AI infra (compared to OpenAI & Anthropic that receives hardware as partnerships) they have a more urgent incentive to reduce model sizes. Also, Google and Apple will be the first to gain from running model on-device

skybrian 3 hours ago||

This seems to be an inference-time optimization and they are putting AI on every search result page. That seems like plenty of incentive to optimize.

mrcwinn 6 hours ago||

I can assure you OpenAI and Anthropic pay for hardware. They don’t receive it for free.

More comments...