Posted by TechTechTech 1 day ago
Do the runes make it smarter or just run faster (or both)?
...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?
I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.
But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!
1. Reduce the number of parameters
2. Reduce the resolution of each parameter (quantization)
For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).
Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”
Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.
Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.
Parameter counts = world knowledge, quantization = “smarts.”
This is a soft rule of thumb, the difference isn’t very strong.
Here's the du output for GLM-5.2:
$ du -s -BG /cube/models/zai-org/GLM-5.2/
1099G /cube/models/zai-org/GLM-5.2/TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.
Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.
But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.
But I do like Unsloth Studio, quite a lot. It's nicely designed.
Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.
The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.
And yet Apple won't sell them to you anymore. And I'm not too confident it will be even possible to hand then 10k to get one again.