Running local models is good now

Posted by jfb 13 hours ago

Running local models is good now(vickiboykis.com)

1064 points | 439 commentspage 9

frollogaston 10 hours ago|

"Good" refers to the speed and not the quality. There's so much hype about Macs being great for LLMs, but nobody seems to be seriously using them for that because the open models are unfortunately so far behind.

drchaim 12 hours ago||

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

tennfown 10 hours ago|

I have some decent specs, but I’m stuck with AMD graphics card which I’ve been told is a non-starter

aleksandrm 7 hours ago||

Clickbait title, because running local models is still not good now.

atulmy 9 hours ago||

Exact reason I'm building csuite.so, do check it out and let me know if you need early access!

Computer0 6 hours ago||

I have 16GB VRAM and 96GB Ram on all my computers and I do enjoy local models. I would not use them for coding, though I have experimented with it, it is largely a waste of time on my hardware. I love local chat with different models however, when using the model in this way it is much easier to experiment with the largest models near the limit of your hardware, and I do find it useful on the airplane somewhat. I have also used local models for data classification tasks and let it run over the weekend etc and the results were acceptable.

matrix12 8 hours ago||

gemma:12b at 75% of frontier? Yeah....

Mr_Eri_Atlov 8 hours ago||

I think this is a pivotal moment for LLMs.

Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.

Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.

jmyeet 8 hours ago||

It's not "good". A more accurate description would be "sometimes useful and not far from being good". The author is using pretty small models. There have been a lot of improvements that scale in any case (eg MTP) but ultimately this is still hardware limited by 3 factors:

1. Memory bandwidth

2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;

3. Raw FLOPS, including quantization.

Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year

Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.

NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.

So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.

But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.

ZionBoggan 11 hours ago||

This is actually a really insightful post !

jingw222 11 hours ago|

open source must win

More comments...