Posted by ray__ 15 hours ago
Does this mean I would be able to run 500b model on my 48gb macbook without loosing quality?
https://mesuvash.github.io/blog/2026/turboquant-interactive/
(Sorry for my terrible English, it's not my native language)
The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both.
Trivially, with r=0, the error is 0, regardless of how heavily the direction is quantized. Larger r means larger absolute error in the reconstructed vector.
It is expected that bigger vectors have proportionally bigger error, nothing can be done by the quantizer about that.