GLM-5.2 – How to Run Locally

Posted by TechTechTech 1 day ago

GLM-5.2 – How to Run Locally(unsloth.ai)

580 points | 278 commentspage 3

cjbprime 16 hours ago|

I've got access to a 192GB RAM Mac Studio, which is below the stated minimum RAM. Can swapping off fast disk be used to make it work out, especially since it's MoE?

walrus01 16 hours ago|

Seems like a good way to shorten the lifespan of an NVME SSD significantly by using up its TB written lifespan, if you let it extensively swap. Also the performance will be absolutely abysmal like 0.1 tok/second.

smallerize 13 hours ago||

The LLM tools are smart enough to keep the weights on the disk and read-write stuff in RAM.

jonathanhefner 1 day ago||

> Runing GLM-5.2 on local hardware

Do the runes make it smarter or just run faster (or both)?

nicman23 20 hours ago|

depends on the color

jzer0cool 17 hours ago||

1 bit requirement (1-bit 223 GB wowza). What you all recommend with 24-48 vram, or is this approach much out dated now.

ramgine 1 day ago||

I have up to 1tb of ddr4 in my server but it only has a 12gb vram 3060. Would getting a 24gb vram make this a viable system or am I throwing money away?

segmondy 23 hours ago||

You can run it today with that 12gb vram 3060, but I would suggest getting 2 3090s. Use cmoe option. This will keep the attention/route tensors on the GPU and offload the rest to system memory. Try it now and see the performance.

rnewme 23 hours ago||

Should work yes.

maxignol 17 hours ago||

Lucky me, I never go out without my 256gb unified ram mac x)

Wowfunhappy 1 day ago||

> The full model requires 1.51TB of disk space

...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?

I definitely don't have the hardware to run this model at any kind of reasonable speed (and I don't want to use a super aggressive quantization that would kill performance). Even so, I think it would be cool to retain an offline copy, in case... I don't really know, a solar flare destroys the internet some day, or maybe a zombie apocalypse. It would just be cool to have.

But 1.5 TB is a bit too much! If it could be compressed down into something semi kind of reasonable, that would be fun!

gcr 1 day ago||

There are two forms of compression relevant to LLMs:

1. Reduce the number of parameters

2. Reduce the resolution of each parameter (quantization)

For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).

Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”

Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.

Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.

Parameter counts = world knowledge, quantization = “smarts.”

This is a soft rule of thumb, the difference isn’t very strong.

SirMadam 1 day ago|||

SOTA LLM specific compression achieves around ~54%! https://arxiv.org/abs/2505.06252v3

throwdbaaway 22 hours ago|||

On ZFS with zstd compression, I am getting 1.34x compressratio for the BF16 weights (across multiple models).

Here's the du output for GLM-5.2:

    $ du -s -BG /cube/models/zai-org/GLM-5.2/
    1099G   /cube/models/zai-org/GLM-5.2/

walrus01 17 hours ago|||

> ...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?

TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.

Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.

But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.

redox99 1 day ago||

Probably not at all, considering weights are randomly initialized.

suyash 18 hours ago||

We really need a quantized version for regular laptop

dofm 1 day ago||

Can't run this myself.

But I do like Unsloth Studio, quite a lot. It's nicely designed.

hxii 1 day ago||

Any time I see one of these posts about models of this size a quote comes to mind – "Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should".

Only a select few have the hardware required to run this to begin with, and even then the forecasted performance makes me wonder if it’s worth it at all.

segmondy 23 hours ago|

Completely worth it. At 6tk a second. If I can get 2 hrs of token generation. That's 2hrs * 3600secs * 6tk = 43200 tokens, at about 10tk to a line of code, that's about 4320 lines. Let's even trim it more and slice it by half. That's 2160 lines of code a day. Most professional programmers can't deliver that much consistently in a day.

The key to a model this large is (1) Use it to plan, generate lots of plan and farm out to a smaller model. Then for very specific and complicated portions precisely prompt for what you need.

uberex 21 hours ago||

Thats not a complete reasoning. Even frontiers need to revisit and fix things. Add 10 loops to that and it is 20 hours. Still great compared to a 2023 human, but why am I not just paying pocket money for Claude Pro instead?

segmondy 21 hours ago||

You're talking about agentic workflow. Agentic is cruise controls. Race car drivers shift manually for more precision and to go faster. If the only way you know how to code with AI is agentic, then you are putting yourself on a crutch.

uberex 21 hours ago||

You are saying you can one shot without loops on something like GLM-5.2?

bilekas 13 hours ago|

> this can directly fit on a 256GB unified memory Mac

And yet Apple won't sell them to you anymore. And I'm not too confident it will be even possible to hand then 10k to get one again.

More comments...