Posted by TechTechTech 1 day ago
There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.
First, it's not really "1 bit", actually much closer to 2-bit. IQ1_M is actually 1.75bit and IQ2_XXS is 2.06bit This is from the ./llama-quantize --help with most of the quant types and their size in bpw: https://pastebin.com/bCUqGfeE
And to elaborate on the "dynamic" aspect inconito said in the other comment, if you click on one of the .gguf files in huggingface:
https://huggingface.co/unsloth/GLM-5.2-GGUF/blob/main/UD-IQ1...
There are a lot of Q5_K, Q6_K, etc tensors. Only the routed experts (ffn_gate_exps.weight, ffn_up_exps.weight, ffn_down_exps.weight) are heavily quantized, and it looks like the down_proj is actually iq3_xxs for this model.
Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.
As I type this my local GLM5.2 is troubleshooting bugs that Qwen would not be able to handle.
But I don't know how usable GLM 5.2 is vs the Big 2.