GLM-5.2 – How to Run Locally

Posted by TechTechTech 1 day ago

GLM-5.2 – How to Run Locally(unsloth.ai)

578 points | 277 commentspage 2

numlock86 19 hours ago|

Is this really worth it, though? Throughout the years my experience with quantized models has been that they feel like a lobotomized version of the original. Doesn't matter if it's an LLM, dedicated diffusion model or some other dedicated task. Sure, they get the job done. But a lot worse. The only ones that can somewhat hold up are the ones provided by the vendor directly. Gemma4 comes to mind. However I suspect they have some secret sauce other than just "let's quantize this" since they have the original model and its data at hand.

There should be more native 4bit, 1.25bit and likewise models. Those actually work great while making them smaller in comparison. But I guess there is some reason for them being pretty niche.

iaw 8 hours ago||

On a model of this size quantization has much less impact on quality of output. I'm running a 3bit version and find it comparable to sonnet, almost opus.

nicman23 19 hours ago||

it is not a flat quant but a dynamic

CGamesPlay 1 day ago||

Can somebody help me understand the Quantization Analysis? It says "dynamic 4-bit UD-Q4_K_XL and dynamic 5-bit UD-Q5_K_XL are generally lossless" while showing a top-1% token agreement on the chart of 97.5%. Not what I would consider "generally lossless". Is this implying that some post-processing is going to account for the 2.5% loss? Beam search?

dannyw 20 hours ago|

Generally 97.5% token agreement is very positive. Like the article explains, the difference isn’t the model thinking the capital of France isn’t Paris, but rather maybe saying “The capital of France is Paris” instead of “Paris is the capital of France”.

c7b 18 hours ago||

Can someone explain the math to me? Why is 1-bit only ten percent less memory than 2-bit?

idonotknowwhy 17 hours ago||

2 reasons.

First, it's not really "1 bit", actually much closer to 2-bit. IQ1_M is actually 1.75bit and IQ2_XXS is 2.06bit This is from the ./llama-quantize --help with most of the quant types and their size in bpw: https://pastebin.com/bCUqGfeE

And to elaborate on the "dynamic" aspect inconito said in the other comment, if you click on one of the .gguf files in huggingface:

https://huggingface.co/unsloth/GLM-5.2-GGUF/blob/main/UD-IQ1...

There are a lot of Q5_K, Q6_K, etc tensors. Only the routed experts (ffn_gate_exps.weight, ffn_up_exps.weight, ffn_down_exps.weight) are heavily quantized, and it looks like the down_proj is actually iq3_xxs for this model.

incognito124 17 hours ago||

Keyword dynamic, the parameters are quantized on a case by case basis

andai 1 day ago||

How is this model half the size of DeepSeek V4 Pro? Is it because DeepSeek did more aggressive cost cutting on the attention mechanism?

drudolph914 20 hours ago||

GLM 5.2 is the first time I'm actually excited about AI! I'm not the most bullish on AI code for several few reasons, but the biggest reason is the ownership model. We all know we're near the tail end of the "subsidized pricing" window for AI, and I've been hoping for so long to get an open weight model that is _close enough_ to the SOTA before this window closes - and we actually got it! I'm excited to be able to in the near future run GLM locally, and use these things like a tool instead of living in this for-rent model for the rest of my life. I'm excited to actually enjoy programming again

zkmon 16 hours ago||

I have high respect for unsloth's work, helping millions to get started with local AI, but this post appears kind of download bait.

Offloading too many layers to CPU is not going to work at all. I have tried this many times and had to rm -rf on those heavy hf cache folders. Also I doubt 1-bit or 2-bit quants of GLM 5.2, running mostly outside of VRAM can beat Q8_0 of Qwen3.6-27B fully loaded in VRAM - on usefulness.

iaw 7 hours ago|

I run 3bit GLM5.2 and full precision Qwen3.6-27B. GLM is much much closer to frontier models in it's breadth and ability to plan. If you just need to implement Python code from an existing plan Qwen is your choice but it has problem succeeding with more complex tasks that GLM5.2 does not.

As I type this my local GLM5.2 is troubleshooting bugs that Qwen would not be able to handle.

zkmon 3 hours ago||

Not sure how much of your GLM is offloaded to CPU. I was contending the suggestion of using system RAM + VRAM.

snootypoot 23 hours ago||

if sam altman didnt exist i could afford to run this

numlock86 19 hours ago|

if sam altman didn't exists this model would most likely not exist as well

walrus01 16 hours ago||

I really don't think anyone is going to have a good time trying to run it on anything with 256GB of RAM no matter what the post says. 512 is the much more realistic minimum. I'm fortunate enough to have two 512GB RAM dual xeon workstations in my home office that I bought cheap before the price rise to mess around with things...

edg5000 19 hours ago|

One advantage about local LLM: You could serialize the context yourself, without being constrained by APIs. And let's not forget, the Big 2 encrypt their thinking. If you use custom clients, which is a very grey area alreay, being able to produce the context string raw is a big bonus. Takes away a lot of annoying constraints and needless mystique/obfuscation.

But I don't know how usable GLM 5.2 is vs the Big 2.

More comments...