GLM-5.2 – How to Run Locally

Posted by TechTechTech 1 day ago

GLM-5.2 – How to Run Locally(unsloth.ai)

549 points | 265 comments

antirez 9 hours ago|

DwarfStar work in progress numbers: I see 14 tokens/sec generation, that slopes to 10 t/s with longer 10k or more context size. Consider that the indexed attention requires evaluating 2048 selected rows, 2x DeepSeek and with less compression, so the performances with larger contexts here to south faster. Prefill can be 180 t/s on small contexts to 150 t/s and less with larger contexts. I used DeepSeek v4 PRO in this conditions, it is usable but it is far from the 35 t/s 400 t/s prefill you get with DeepSeek v4 Flash 2 bit on a MacBook m5 max. But likely my implementation is yet not optimized enough, so a bit more performance can be obtained. I'm using 4 bit quants. The model is also definitely less sparse than DeepSeek v4, so it activates a bigger percentage of parameters. If it works decently at 2-bit, that would be a win even for machines where 4-bit fits, since this would mean 2x memory (equivalent) bandwidth basically for the routed experts.

Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.

zozbot234 5 hours ago||

10k context is not a whole lot, this model theoretically supports up to 1M. But the KV cache storage takes up a whole lot more memory capacity at full context than DeepSeek V4 Pro, let alone Flash. (About ~96GB according to readily available KV cache calculators, might be more in practice. For comparison DeepSeek Flash is ~10GB and Pro is at least in that ballpark.) So I'm not sure that this model is a good deal for memory-constrained machines unless you're specifically interested in very short contexts only. This could still be worth it if it came with a game-changing increase in smarts but that seems a bit unlikely so far.

It will be interesting to see how this model does under a SSD streaming scenario, the lower sparsity should ideally be favorable.

> Local inference needs really hard a 1.2 / 1.5 T/s memory bandwidth system with 512GB and 2/3 times the GPU compute of Mac Studio M3 Ultra, at an affordable 10/15k price point. A variant with 1TB memory would also be welcomed at 20k price point.

Are these realistic specs at present? Not that clear to me, 1.5 T/s seems really high.

reasonabl_human 7 hours ago||

Thank you for your work on DwarfStar! It is truly helping democratize access to frontier tech.

segmondy 20 hours ago||

I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.

effisfor 13 hours ago||

I applaud all you tinkerers for pushing on the state of the home-brewed art here. Like crypto, AI is drowned out by hucksters, very few people talk about developing resilience. Or the researchers who will push on open source models in efforts to cram them onto an electric toothbrush or tamagotchi. Bravo to you all.

discordance 15 hours ago|||

Running that full load is at least 600 W, so in a day ~14 kWh. At $0.2 a kWH, that would be $2.80/day or $1k a year of op-ex in electricity.

Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.

That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.

segmondy 8 hours ago|||

No one locally runs full load all day. The only way to see that is if you're training. We are talking about inference. I limit my GPU to 300watts. You can limit them down to 200w. Since everything is not in GPU and the bottleneck is between CPU/system ram. The GPUs don't even get to spike, I see 160w-180w for each GPU during inference. So redo your calculation again. Figure about 6 hrs of daily inference, and we are down to roughly $125 a year. Thanks again for your speculation.

walrus01 14 hours ago||||

Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.

At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.

Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.

Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.

Incipient 6 hours ago|||

Unless the token estimates I get from using Claude are wayyy out, I burn through 5m+ tokens/day, and I'm not doing a lot of time. 500k tokens in a 24h period for $5k of hardware seems quite poor?

kristjansson 6 hours ago||

Be sure you compare inputs tokens to pre-fill rates and output tokens to generation rates.

discordance 13 hours ago||||

Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.

On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.

Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.

walrus01 13 hours ago||

That's true, there's a lot of places where power is considerably more expensive than $0.20 USD/kWh. But also the 600W figure assumes that it's fully loaded 24x7x365.

Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.

If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day

But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.

If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.

nearbuy 4 hours ago|||

The usage is irrelevant if we're interested in cost per token. If you use it half as much, you get half as many tokens at half the cost. It's still $5.56 in electricity per million output tokens either way (using $0.20/kWh, adjust accordingly if you have cheaper electricity). If you use the API, you also pay half as much if you use half as much.

wqaatwt 12 hours ago||||

> person is using it in bursts and intermittently throughout an 8 hour workday.

You can’t do that with 6 tps, though.

AbsurdCensor 12 hours ago||||

I think that's the biggest difference for most. If you can amortize the hardware costs, then 'burst usage' is cheaper at home to a degree, because you are paying a fixed monthly rate elsewise. Overall thought for most, it is likely cheaper to use the cloud than at home, but really depends on what you want.

nomel 7 hours ago||

> because you are paying a fixed monthly rate elsewise

No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.

re-thc 8 hours ago|||

> But also the 600W figure assumes that it's fully loaded 24x7x365.

It isn't 100% efficient. Even the best PSUs aren't.

tmountain 13 hours ago||||

Lots of people have solar. Green AI, imagine that!

cultofmetatron 13 hours ago|||

if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.

nomel 7 hours ago|||

The largest geothermal plant in the world is only 1.5GW, in the United States, which is over double all the plants combined in Iceland. The second largest is 1/3 that, in Mexico. [1]

There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.

[1] https://en.wikipedia.org/wiki/List_of_geothermal_power_stati...

walrus01 13 hours ago||||

To be fair, Vancouver is such a magical place in terms of electrical cost, but the cost of living and real estate are otherwise through the roof, with decrepit and nasty (would need $100k in renovations immediately if you're not treating it as a teardown) single family detached homes on the east side of the city selling for 3.2 million.

brailsafe 6 hours ago||

Shhh don't forget we have a water shortage. But it is nice to have electricity wrapped into my relatively cheap basement suite rent ;)

fghorow 7 hours ago|||

You aren't, perchance, from Iceland, are you?

matheusmoreira 12 hours ago||||

We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I think the main reason not to run locally is to get the full models instead of quantized versions.

traceroute66 11 hours ago||

> We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.

Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].

Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.

Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.

Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.

Tinfoil have not been independently audited, it is somewhere on their long-term radar.

Privatemode have been thoroughly independently audited with documentation available on request.

Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.

[1]https://tinfoil.sh/ [2] https://www.privatemode.ai/

patates 9 hours ago||

> Apple next product refresh

I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.

traceroute66 9 hours ago||

> because their pricing is part of their brand

I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.

However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.

When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.

Ultimately I still blame the chip-hoarding hyperscalers though. :)

bawana 5 hours ago||||

Even on a macStudio w 512 gig memory?

SXX 15 hours ago||||

I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.

Or cloud LLM might just refuse to sell to you because it dont like your passport.

yorwba 14 hours ago|||

So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?

brookst 11 hours ago|||

It’s rationalization for what people want to do anyway.

Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.

CamperBob2 4 hours ago|||

Yes, it is understandable that people who are subject to being kicked off the bus at random times through no fault of their own, or who sometimes find that the bus slows to 8 miles per hour and makes them late for work, or who are tired of arguing with the bus driver who refuses to take them to the liquor store, the casino, or the titty bar, may aspire to own a car, even a crappy one.

Any more tortured metaphors in store for us?

drptech 11 hours ago|||

[dead]

drptech 11 hours ago|||

[dead]

swiftcoder 14 hours ago|||

This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction

throwawayffffas 10 hours ago||||

So in my experience with 2 7900XTs with models that sit fully in VRAM it's more like 400W the gpus spend a lot of time waiting for each other.

DrScientist 9 hours ago||||

Depends on whether you've also gone for self-hosted electricity generation or not.

downut 5 hours ago||||

I have rooftop solar and I have been building credit with my electric utility even though the daily high temperature is well over 100F outside and a comfortable 75F inside. That includes running three AMD 12 thread 128GB systems with obsolete GPUs 24x7x365. I'm not a gamer, so 6 years ago I went low-end low-power GPUs. Boy am I dumb. Currently running the qwen3.6:27b, 35b, and gemma4:31b models just fine.

As soon as VRAM prices drop to sanity I'm going to load up and I could care less about the power draw.

Some parts of the future are absolutely great.

poulpy123 11 hours ago||||

which hyper scaler would you suggest ?

dzjkb 14 hours ago|||

how do you rent 2 3090s for $2.80/day?

zozbot234 17 hours ago|||

AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.

Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.

trollbridge 9 hours ago||

Particularly DeepSeek 4.1, which they appear to be A/B testing on the API and which also seems available on the free chat interface.

It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.

Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.

MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.

GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.

DeepSeek will probably keep their pricing model and just keep getting better and better.

Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.

The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.

SalariedSlave 7 hours ago||

Competing and innovating in the fast moving SOTA end of the llm space requires a ruthless disregard for copyright, IP, bureaucracies, formalities, risk assurances and other slowdowns. It requires a risk tolerant, quick and large flowing investment of capital. It requires a scoped focus that is pragmatic and sharp about key concerns, and efficiently dismissive of meaningless details.

Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.

I say this as a software engineer from Europe.

leansensei 6 hours ago||

Europe was never at the frontier of any tech? Huh what now?

CamperBob2 4 hours ago|||

Not since the salad days of Nokia. Ancient history at this point.

SalariedSlave 6 hours ago|||

A hyperbole born of frustration, I admit.

Qualify it to software, rather than all tech, if you will.

dxuh 17 hours ago|||

"All it takes to run" might be fair if you paid $2400, but right now the total price is way closer to $10k (almost 5k for the RAM and 2k each for the GPUs). Today that is a lot of expensive hardware.

segmondy 17 hours ago|||

512gb 2400mhz ddr4 ram = $1600 not $5000. https://www.ebay.com/itm/188284985172 You can get creative and source 2-3 2080ti 22gb from China for about $250 a piece. You can either be resourceful and find a way or find a whole bunch of excuses.

officialchicken 14 hours ago||

> You can either be resourceful and find a way or find a whole bunch of excuses.

How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.

SwellJoe 4 hours ago|||

6 tokens per second is not fit for interactive use. I find Gemma 4 (QAT 4-bit, MTP) to be tolerable at about 30 tokens per second on my old GPUs. Anything slower than 15 is annoying. I tried DS4 on my Strix halo (1-bit quantization of DeepSeek V4 Flash, the biggest model that can realistically run on 128GB, right now), and it tops out at something like 10 or 11 with a long time to first response, and that's quite painful to use. I'd definitely rather spend money to use the big models on cloud infrastructure.

And, the several thousand dollars it costs to run these things unusably slowly buys a lot of tokens on the cheap Chinese models.

pizza234 11 hours ago|||

LOL, sure this works if one has a time machine or a LOT of money to burn.

32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.

This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).

I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.

fsuts 18 hours ago|||

6 tokens per second?

Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones

segmondy 18 hours ago|||

I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

manmal 18 hours ago|||

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

all2 17 hours ago|||

Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.

AussieWog93 12 hours ago||||

Not a Local LLM user, but I regularly kick off meaty jobs in Claude Code then check on them 1-2hrs later.

wqaatwt 12 hours ago||

In this case it would be 20-40 hours to accomplish the same amount in f work when running locally

Mashimo 16 hours ago|||

Run one task, while you do another? Or while you sleep / eat / rave?

manmal 10 hours ago||

While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?

nozzlegear 5 hours ago|||

Do you work at Facebook and happen to find yourself in a token burning competition with your colleagues?

nijave 9 hours ago||||

I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.

Mashimo 8 hours ago||||

Why would you use this when your company has access to actual SOTA? I don't get it.

segmondy 7 hours ago|||

Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?

Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.

froh 18 hours ago|||

do you use caveman or similar?

walrus01 14 hours ago|||

I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.

radku 13 hours ago|||

I have pretty much almost this exact setup with 2x3090s and with slightly faster DDR4 512GB and 64 core Epyc! [0] I've been enjoying it a lot. Can't wait to give this model a try.

Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.

[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...

nextaccountic 19 hours ago|||

How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?

Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?

I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software

nodja 18 hours ago|||

Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.

xrd 17 hours ago||||

This is a good place to start reading about dual gpus.

https://github.com/noonghunna/club-3090/blob/master/docs/DUA...

nextaccountic 16 hours ago||

But in this case he used a cpu too

segmondy 18 hours ago|||

checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.

edg5000 19 hours ago|||

Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).

segmondy 18 hours ago|||

you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.

nijave 9 hours ago|||

Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.

Cloud offerings are 80-200tk/sec versus single digit tk/sec.

That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.

edg5000 9 hours ago||

I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.

nijave 1 hour ago||

I think we'll see

- better hardware

- more efficient model runtime algorithms/code

- smarter/more efficient models (same capability with less parameters)

So ideally these will all come together and help.

redox99 20 hours ago|||

That's crazy good for $2400.

ikari_pl 8 hours ago||

I can work out max 90GB to the agents. Advise. :)

draginol 8 hours ago||

The most interesting part of this to me is not the benchmark table, but the packaging.

A model like GLM-5.2 being available as GGUF, usable through llama.cpp/Ollama/vLLM/SGLang/LM Studio, and wrapped for local agent workflows changes the category. It stops being an impressive open model exists and starts becoming this is something a small team can actually put into its development stack.

For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.

It doesn't need to be as good as frontier-best. Just good enough.

I could see a business of people packaging this and handing it to companies who want Help Desk bots without any extra setup.

giancarlostoro 8 hours ago|

> For instance, company buys an RX6000 setup for say $15k total. They could use this for handling data heavy sifting that would otherwise be a lot of Claude tokens.

Considering they might be spending thousands per month on API costs already, dropping 15K to save on one process might not be bad. On the other hand, also an opportunity to sell GLM 5.2 inference at near cost to other companies for less than whatever Claude costs. In theory it costs anywhere from $0.51 to less than $2 an hour to run it and use it 24/7 that's still wildly cheaper than calling Opus which doesn't bill per hour, but per million tokens, drastically higher. Hell, you could probably bill at $5 per GPU hour and still be cheaper. Whether you're looking to self-host or sell hosting for it, it looks way cheaper regardless. I think most decent open models will continue to fit in at least 32GB of VRAM so a 6000 Pro GPU is more than enough. alternatively, even on a 5090 you can get a reasonable amount of inference for way less than paying for Opus, Qwen would be your friend there though.

xrd 1 day ago||

So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.

https://unsloth.ai/docs/models/glm-5.2#usage-guide

In a prior thread, someone said it would take $500k in hardware:

https://news.ycombinator.com/item?id=48629970

elliotbnvl 23 hours ago||

$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.

NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.

You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.

hbbio 22 hours ago|||

Yes, a single GB300 workstation also does it, probably even more than 120tok/s.

Official price 85k...

simpaticoder 8 hours ago||

Actual price $100k and everything is very closed and proprietary. Oddly this MSI system provides "only" 252G vram and 500G ram. I would have expected more vram for this price. Also why 252 instead of 256? https://www.centralcomputer.com/msi-xpertstation-ws300-ai-wo...

throwawayffffas 11 hours ago||||

You can get a 1TB of HBM2 vram for like 10k, https://www.ebay.com/itm/177571378959

The problem is the backplane I have not managed to find a single baseboard, and getting a random baseboard to work with random modules is probably a crap shoot.

__m 22 hours ago||||

How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?

easygenes 22 hours ago|||

M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.

In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.

In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.

hajile 7 hours ago|||

It might be M6 Ultra and I think the real reason for stopping selling top-tier units was to avoid mid-generation price hikes and increasing demand for the more expensive next-gen systems that I assume will come with 512gb (maybe 1TB) of RAM and a massive markup to match.

jiqiren 19 hours ago||||

I hope all this speculation comes true. Right now this ram crunch is ridiculous.

digitaltrees 21 hours ago||||

I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.

Tepix 18 hours ago||||

I think there is a gap right now for running large models such as GLM 5.2 in Q4 or Q8. My hope is on Intel Crescent Island 480GB cards. Let‘s see how expensive they‘ll be.

npodbielski 9 hours ago||

480GB? Probably like 100k$ each? :D

segmondy 20 hours ago|||

P40 was release 2016 and still selling like hotcakes!

easygenes 21 hours ago|||

[dead]

mgambati 1 day ago|||

With 2 wouldn’t have good results. Ideal range for coding is at least Q8.

kibibu 1 day ago||

According to this very article, 4-bit dynamic is essentially lossless

Aurornis 22 hours ago||

Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.

I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.

ijidak 21 hours ago|||

Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.

I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

Most of the money and energy went to mobile for the last fifteen years.

Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.

0xbadcafebee 20 hours ago|||

Definitely the stagnation was due to a lack of use cases, but this isn't a bad thing. We don't need most of the hardware advancement we got.

Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.

Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.

Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.

omnimus 16 hours ago||

The natural progression when performance is enough would be price. We were starting to see that but not anymore. I wonder if somebody is afraid the future where generally useful computation is cheap.

gruez 21 hours ago||||

>I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.

No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.

horsawlarway 19 hours ago||

It's true we hit limits, but I feel like a lot of it was "limits" in the sense that the tradeoff stopped being worth the cost, so we optimized in other areas.

So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.

Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).

But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.

My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.

I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.

In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.

But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.

So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.

My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.

I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).

BobbyTables2 18 hours ago||||

Yeah, even Windows managed to not drive terribly dramatic upgrades in general computing (besides Windows’ absurd RAM usage and now requiring a TPM).

In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.

linzhangrun 21 hours ago|||

Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago

bbor 10 hours ago|||

I’m kinda lost here… do y’all really have machines in your houses with hundreds of gigs of RAM?? Am I just behind the times?

The page advertises the 8-bit quant as taking ~800GB, which seems like it would require at least 3 consumer motherboards fully stacked w/ 4x64GB cards each.

Maybe “locally” has slowly come to imply “…on your homelab”?

numpad0 9 hours ago|||

DRAM prices at mid-2025 rates were ~$2.5/GB for DDR5, and ~$1.5/GB for DDR4. "Hundreds of gigs" of RAM used to be under $500. 128GB of cheapest RAM used to be like $200. It seemed to go over heads for a lot of people that you could get hypothetical future machines on CS/CE textbooks were attainable for that little, for some reason - there seemed to be some fixation on the idea that 16GB is all you need.

Gracana 9 hours ago||||

You don't have to have a server, workstation motherboards support lots of memory channels.

I was lucky to buy a lot of RAM before prices skyrocketed. I knew I wanted to play with this stuff, so I spent what felt like a lot of money at the time to buy 8x96GB DDR5-6400 RDIMMs. Now the same RAM costs at least 6x more.

woodrowbarlow 8 hours ago||

[dead]

oceanplexian 8 hours ago||||

As soon as Llama came out I had a realization what was coming and went all-in on hardware with the assumption open source would catch up with GPT4. Surprise, it did, we now have small models that absolutely crush GPT4 in performance.

It wasn’t that absurdly expensive for a hobby, I bought 64GB DDR4 ECC sticks between $70-$100 on eBay before everything took off. Now everyone is in here debating if open source is 1 month or 3 months behind SOTA. The future is obviously local.

nijave 9 hours ago||||

I got a 2U rackmount with 192Gi DDR4 for $1.1k USD in 2023. Around 1.5 yrs ago, server RAM could be had pretty cheap--especially slower LRDIMMs (I wanna say 512Gi DDR4 was <$500 USD). I checked a couple old ServeTheHome threads and seeing maybe around $50/32GB RDIMM although thought it was cheaper than that for a little while

cpburns2009 10 hours ago|||

RAM wasn't expensive even a year ago. I maxed out a used Dell Precision T5610 with 128 GB DDR3 for $250 in 2021.

cheema33 1 day ago|||

I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.

ekidd 15 hours ago|||

A GPU with 24GBs of RAM is mostly useful for running a very carefully squeezed Qwen3.6 27B (4-bit Unsloth quants, 8-bit K/V cache, possibly MTP, 128k context). This is a fun little model that's smart enough to do debugging, refactoring, and implementing "clean" specs that don't force it to make complicated design choices. I've seen it rip through a 9-year-old Terraform AWS config, and (without using the network) correctly identify nearly everything that would need to be upgraded or migrated for modern AWS. But if I give it some poorly conceived spec with lurking design headaches, then it goes on an endless thinking binge and ultimately fails.

Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.

Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.

phamilton 22 hours ago|||

Generation is basically just memory bandwidth math.

Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

SlavikCA 18 hours ago||

And with MTP (or other speculation techniques) you can ~double that.

phamilton 8 hours ago||

MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.

uberex 22 hours ago||

Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.

stymaar 21 hours ago|||

This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).

andy_ppp 19 hours ago||

But I did my research online and the sun cycle is every 11 years and something something global warming is a hoax every single year now.

nijave 9 hours ago||||

That's fair for new hardware. You probably want to prompt "homelab" or "used hardware" to compare what's in this thread.

colinsane 21 hours ago||||

i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."

matheusmoreira 12 hours ago|||

Safety from competition!

digitaltrees 19 hours ago|||

I asked and it said “403 forbidden - careful peon attempts to bypass the late stage capitalism api with your monetary offerings in exchange for you daily tokens will get you perma banned right to jail”.

j45 21 hours ago|||

LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.

uberex 15 hours ago||

Good job I didn't use a vanilla LLM without tool use harness then.

skiing_crawling 23 hours ago||

"it can fit" on 256GB of RAM, but it will be heavily quantized and still run very slowly. The headline number is not token generation, its prompt processing. So if you get 10 tok/s and an API gives you 20-30 tok/s, it doesn't seem that bad on its face, but a mac studio or any other machine that's not loading all of it into GPU will do PP 20-50X slower than a purely GPU based setup, which is what actually makes this unusable without $50k in GPUs.

On top of that, you will still be heavily quantized.

gerdesj 23 hours ago|

A nvidia spark thingie has 128GB unified RAM. They also have a dual port version of one of these things: https://www.nvidia.com/content/dam/en-zz/Solutions/networkin.... ie 2 x 100GB/s ports, they may even be 2 x 200GB/s. Once I've got my paws on one, I'll know more.

You can cluster these beasts too. Two and three (with two IP subnets) is fairly obvious. Four or more might need a switch depending on how much network latency affects things.

Apple seem to have forgotten about M series with gobs of RAM. I can't get the Apple shop to show more than 96GB of unified RAM and that costs a kidney.

mapontosevenths 22 hours ago|||

I have one, and I love it. That said my buddies Mac smokes it for inference workloads in terms of tokens per second AND its more usable for other things.

If you are training and doing research it's great, if you want to cluster them it cant be beat, but if you just want local inference on a single box buy a mac or even a strix halo device.

colinsane 21 hours ago|||

can those macs boot linux? i've heard about Asahi but have no idea how far along they are. i've got my fleet configured with nix and sure, nix can target darwin, but there's a _lot_ of sharp edges there: i don't really want to pull that thread unless i have to...

theYipster 5 hours ago|||

Not the new ones. Only the M1 and M2 have good support for Asahi. But you really don't need it. If you need Linux, use a VM (UTM is free and is equivalent to KVM/QEMU in speed, despite being a Type-2 Hypervisor.)

mapontosevenths 20 hours ago|||

I don't know. I think he just uses LMStudio most of the time on his, but that's one place I can say the spark really shines for me.

I'm a Linux guy, but also don't always have alot of time. The Spark comes out of the box with a nice Linux distro that's pre-configured to be easy to setup and the guides and online resources make getting up and running trivial, for even some complex tasks. You would have to do a LOT of tinkering just to figure out some of the things the nvidia resources walk you through natively. They have guides for a ton of stuff that include the optimal settings so you don't have to figure it all out through trial and error.

Check out these "playbooks" for some examples. [0] There's a lot to be said for not having to piece all that together yourself.

https://build.nvidia.com/spark

I think between unboxing mine setting it up to run headless, and generating tokens was like 20 minutes total for me.

Fizz43 22 hours ago|||

which mac is smoking the spark?

theYipster 5 hours ago|||

Mine, for one. M5 Max MacBook Pro 128GB with a 4TB SSD. $5100 after a $1000 discount at Microcenter. Great deal if you can find it in stock.

pmarreck 21 hours ago|||

pretty much any of them, dude, as long as you have enough RAM, since it uses unified RAM and a powerful SoC CPU/GPU. Literally any M-class model, but the M5 is currently top tier.

dannyw 19 hours ago|||

The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.

Only the M3 Ultra really beats it, and once you start scoping out the cost of a M3 Ultra with 128GB or 256GB, the DGX Spark doesn’t look bad after all.

entrope 10 hours ago|||

> The DGX Spark has basically the same memory bandwidth as a M5 Pro, and far more than a M5.

I see ~274 GB/sec for the DGX Spark[1], versus 307 GB/sec for M5 Pro and 460 or 614 GB/sec for M5 Max[2]. One might call 90% "basically the same", but there are nominally two tiers above "Pro".

Yes, a MacBook Pro with 128 GB and M5 Max costs $5100 (14") or $5400 (16") versus currently $4700 for the DGX Spark, but the MBP includes keyboard, mouse, battery and portability. I believe its prefill is slower and you get 2 TB vs 4 TB SSD, but overall one gives up a lot to save 10% of the cost.

[1]- https://docs.nvidia.com/dgx/dgx-spark/hardware.html [2]- https://support.apple.com/en-us/126319

pmarreck 10 hours ago|||

I looked, but a sibling comment just provided the links. ~274 GB/sec for the DGX Spark, vs. 307 GB/sec for M5 Pro, and max 614 GB/sec (!!!) for M5 Max? Why would you completely friggin’ lie about this, or at minimum, not double-check your facts before bullshitting? Plus, you get a full-fledged computer along with it!

Apple could actually be a good deal and you folks would still make up something to not justify it. In a way, it’s amazing what Apple has accomplished- Baseless negatively-tainted perception in certain influential tech circles.

(To be fair, they’re kind of earning it. I’m glad Tim “Sweet T” Cook is departing.)

Plus, my original comment got downvoted despite being factually-correct. Thanks, Reddit. Oh, wait…

mapontosevenths 20 hours ago||||

Yep. Memory bandwidth is what decides how fast LLM's generate tokens (mostly). The DGX Spark has something like 270 GB/s of memory bandwidth, and the m5 ultra is ~615 GB/s. Theoretically DOUBLE the speed. In practice he only generates like 25% more tok/s, but that's still very impressive.

The spark can fine tune models in 1/4 the time and excels at other compute tasks in ways that Mac never can. Plus the high bandwidth ConnectX-7 ports would be like $1700 to buy on a card just for the network adapters... But for generating tokens, it just plain loses.

fsuts 18 hours ago|||

How noisy does his fan get…

pmarreck 10 hours ago||

it doesn’t get noisy at all

mapontosevenths 2 hours ago||

In case anyone was wondering my spark is basically silent as well. It's great at being ignored, if that's really important to you. I've run mine completely headless since I bought it, including setup.

justincormack 12 hours ago||||

It is 2x200Gb/s physically but the PCIe bandwidth is basically only 200Gb/s so it may as well be one, and actually its a weird 2xPCIe4 not 1xPCIe8 so it appears in software as dual 100Gb/s. Its a bit odd.

jauntywundrkind 21 hours ago||||

200 Gb / s (not GB/s)!

(Still potentially very useful! But not magically ultra fast.)

Computer0 22 hours ago|||

128 gb of much slower ram than Apple.

dannyw 19 hours ago||

DGX Spark is ~273GB/s. That’s about M5 Pro territory, and twice as fast as the M5. You’d have to go to the M5 Max, or M3 Ultra, to get higher memory bandwidth than the Spark.

hajile 7 hours ago||

If you are trying to get more than 64gb of RAM or doing tons of inferencing, you're getting a Max or Ultra anyway.

Frannky 19 hours ago||

There is a push from multiple directions at the same time:

- new AI desktops with GB10s. They are relatively cheap and you can cluster them and load 1TB of VRAM

- Nvidia, amd, intel, Cerebras etc pushing new hardware

- oss models getting crazy good, like glm 5.2

- flash models getting very good like deepseek V4 flash

- quantizations

- harnesses being able to use different models (big for difficult stuff, small for grunt work)

So hopefully soon for the ones who want to break free from APIs, we will be able to host at home a cluster of AI desktops at a reasonable price with Opus-level capabilities, can't wait!!

khafra 12 hours ago||

I feel like "relatively" is doing a lot of work, there: at about $4k per GB10, that's $36k for a 1TB cluster. Cheap compared to equivalent H200's, but out of reach for home labs that aren't funded with OpenAI or Anthropic RSUs.

snarfy 10 hours ago||

When the AI bubble pops those hardware prices will pop too.

Tepix 9 hours ago|||

My hope is on Intel Crescent Island with 480GB. I don't need 8x H200 performance (and cost), but I would like to run GLM 5.2 Q8.

MaKey 6 hours ago||

I'd love to too, but I guess Crescent Island with 480 GB will cost something like $10-12k or even more.

matheusmoreira 12 hours ago||

Hope you're right! Can't wait!

pheggs 1 day ago||

I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

UncleOxidant 23 hours ago||

If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.

gpm 23 hours ago|||

The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

UncleOxidant 23 hours ago|||

> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

DougN7 19 hours ago||

Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.

pheggs 10 minutes ago|||

thats pretty scary to me. what will the data centers be used for if people run that stuff offline? maybe new models? but will there even be any demand? I guess we will see

r0b05 12 hours ago|||

I think this is the play

mannanj 23 hours ago|||

> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

colinsane 20 hours ago|||

the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.

selectodude 20 hours ago||

The solution for high prices is high prices.

If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.

Tepix 9 hours ago|||

What's the irrational part? There's sky high demand.

mannanj 5 hours ago||

maybe the irrational part is the amount of demand for consumer hardware, wouldn't the market for professional ML/AI used hardware go away from consumer hardware over time? (I can talk more about what I mean consumer hardware to be)

Also irrational parts of this market (would love to hear your thoughts):

- the purchase of hardware that isn't power efficient or gives an ROI for ML/AI use cases by companies buying it, who would be priced out of using that hardware over time

- many people and companies are buying the hardware due to hype and scarcity/FOMO reasons over rational reasons

bawana 5 hours ago||||

is it possible that ai companies ordered a bunch of ram just so that models cannot be run locally? they are betting new fabs wont be built before quantum takes hold.

elorant 22 hours ago||||

Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.

dannyw 19 hours ago||||

When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.

verdverm 22 hours ago|||

I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

UncleOxidant 20 hours ago||

> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.

Infernal 20 hours ago|||

Do we know where those key players went?

verdverm 19 hours ago|||

Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.

cogman10 1 day ago|||

I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.

For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.

twelvechairs 23 hours ago|||

Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.

matheusmoreira 12 hours ago||

> LLM provider that doesnt store or sell their queries

> As long as that is allowed to happen

It won't be. Only we can provide that, and only for ourselves.

eventualcomp 23 hours ago|||

Where is $50k coming from again?

stingraycharles 23 hours ago|||

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

cogman10 23 hours ago||

The hardware requirements aren't evolving and the local models have only been improving.

It's not like you'd lose capabilities, if anything this solution just gets better with time.

chatmasta 22 hours ago||

If the newer models require more/better hardware then you’ll lose capabilities.

I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.

cogman10 21 hours ago||

The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.

The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.

dannyw 19 hours ago||

Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.

Tepix 9 hours ago||||

$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.

You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.

theYipster 5 hours ago||

You don't need all of the model in VRAM. 1 or 2 RTX Pro 6000s will do. $50K will get you there very nicely, and on a 1600 watt PSU if you go for the MAX-Q versions. (The same wattage PSU I'm typing this on, and have been using over the last 5 years.)

cogman10 23 hours ago|||

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.

simplyluke 21 hours ago|||

You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.

fny 1 day ago|||

The RAM requirements are still pretty painful.

yieldcrv 23 hours ago||

equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

stingraycharles 23 hours ago||

Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.

3stacks 22 hours ago|||

Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity

stingraycharles 13 hours ago||

This really sounds like “640kb should be enough”.

I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.

hajile 7 hours ago||

We've already hit RAM power and size limits (about 40k electrons which is the limit before we get noise messing up the amplifier).

If a model needs 2x more memory, but serves the same number of customers, the cost is going to go up per customer to cover the increased hardware and power costs. Companies are starting to implement AI limits to keep costs under control.

Anthropic and OpenAI are rumored to be considering cutting inference prices trying to retain customers as LLMs commoditize and race to the bottom. It reminds me of the Chinese bike wars where bike-share companies were losing massive amounts of money, but kept running sales and lowering prices in an attempt to compete and drive out their competitors. The end of that story was a bunch of major bankruptcies and giant bike graveyards.

Nvidia's hard pivot to "in the near future, everyone will run their AI at home" seems to indicate that they also see the market shifting. We've already had AI ingest everything out there. The real challenge becomes how to better optimize their algorithm to get more useful data in less space.

yieldcrv 22 hours ago|||

have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range

a lot of innovation occurring

scosman 21 hours ago|||

It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

oceanplexian 8 hours ago|||

It depends what you’re using it for. Real time interactive Claude code session? No, it’s kind of impractical.

But if you already have agent loops dialed in (For example I have one that uses a browser testing framework), it wouldn’t really affect me at all if it crunched away at 7 tokens per second all night long.

leansensei 6 hours ago|||

Not really, you can do great things without them. I've been summarizing hundreds of documents. I've added MCP servers to my internal business tools (Elixir apps) and can chat with the Nous Hermes agent over Telegram about pending orders, inventory level, historical product prices, etc., without having to click/dick around with a web UI.

Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!

CamouflagedKiwi 1 day ago|||

The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.

stymaar 21 hours ago|||

Honestly, Qwen3.6 is already what you need for the large majority of tasks.

(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).

notatoad 22 hours ago|||

locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

oceanplexian 8 hours ago|||

Yeah, 20 months of Claude Max until they rugpull you. I’m spending 7-10k/month in raw token costs on Claude Max. Having an alternative is a nice insurance policy.

chatmasta 22 hours ago||||

Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.

fc417fc802 17 hours ago||||

> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.

You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.

tomr75 22 hours ago||||

people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already

c7b 17 hours ago||||

You can get a 128GB Strix Halo for under $3k. Used to be under $2k. Even if you believe it'll be completely obsolete for AI in two years, it'll still be good for many other things. Games for at least several more years, a great home server and/or desktop almost indefinitely. Plus, we might actually reach good enough levels for some AI use cases, if we're not already there.

And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.

SXX 14 hours ago|||

You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.

Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.

On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.

fsuts 18 hours ago||

Why do you think they are rushing to IPO!!

storus 8 hours ago||

So a minimum of 3x RTX Pro 6000 to run 1-bit at ~76% accuracy or MacStudio 512GB RAM to run 4-bit at ~97% accuracy.

iaw 6 hours ago|

No. Unsloth has CPU offloading. It'll be slow but it'll work even with SSD offloading.

Havoc 15 hours ago||

I bet OpenAI and Anthropic hate the timing of glm 5.2.

Kinda shows they have a headstart rather than a magic moat

achrono 7 hours ago|

Nope, GLM 5.2 is only the latest and greatest in a long line of open-weights models. There are even fully open source models that are comparable to o1-mini (OLMo), or almost-fully-open ones that are comparable to o3 (Nemotron).

I'm super grateful to the open labs (who, importantly, do not have the word 'Open' in their name), all the more so to the likes of Ai2.

There is no magic moat indeed. It is math, engineering and of course copious amounts of data (and the political maneuvering required to secure it, e.g. how most everyone has trained on Anna's Archive by this point).

jessinra98 9 hours ago|

Anyone here tried both Qwen and GLM families on the same setup and found a clear winner for one task vs the other?

More comments...