Running local models is good now

Posted by jfb 6 hours ago

Running local models is good now(vickiboykis.com)

628 points | 295 commentspage 3

richbradshaw 5 hours ago|

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

simonw 4 hours ago||

I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

AbsurdCensor 4 hours ago|||

I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks

pizza234 5 hours ago||

[dead]

b3ing 2 hours ago||

They are ok for simple stuff, coding is weak, chat is alright, writing is ok. But I had many of them write stories for ideas and they kept using the same names regardless of what the story was about. I can’t complain, it’s free. Can’t wait till they get even better, but for local image generation they are good, slow but just create a bunch in the background while you do other things otherwise it’s like 14.4k modems

MrKoby07 2 hours ago||

I think a lot of people just don't have specs like that, making it still painful.

ltononro 4 hours ago||

Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.

The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).

I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.

IDK, might have gone a little bit off-topic here.

jlengrand 2 hours ago||

Just wanna say it's always fun and nostalgic to see authors pass by here who I was reading back when I started my career. I was reading Vicki's blogs way back, even remember learning some email parsing in python from her over 10 years ago. TY!

abalashov 3 hours ago||

And if you want to dial in a setting in between: I've switched to Kimi K2.6 (now K2.7) and DeepSeek through OpenRouter and Reasonix for pretty much everything, with no discernible loss of analytical quality or utility.

However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.

I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.

jszymborski 2 hours ago||

I run local models and they work fine for me, but specifically for use in coding harnesses, I'm having a hard time. Tools tend to end up in the same loop, trying to `ls` the same folder or `grep` the same file, over and over and eating up the whole context. Super hard to get it to do anything but that. Any tips?

huydotnet 3 hours ago||

I love that local LLMs are being discussed more often on HN recently. But for the post, I find it strange that the author claimed they were working with local models from day 1, but wrote a post that still links to Qwen2.5 and Qwen3 in mid June 2026.

simonw 4 hours ago||

I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

chrisweekly 4 hours ago|

Obtaining that 64GB RAM is a meaningful obstacle for many.

simonw 4 hours ago|||

I'm still amazed that you can run LLMs of this quality on a machine that costs less than $3,000.

I used to assume that anything GPT-4 equivalent or higher would need $30,000+ of server-class hardware.

That said... gemma-4-12b-qat is 7.15GB on disk so should run reasonably well in 16GB, that takes it down to MacBook Air territory https://lmstudio.ai/models/google/gemma-4-12b-qat

frollogaston 1 hour ago|||

Not just RAM, VRAM, right? Though they're one and the same on the Mac.

andix 2 hours ago|

Because I've seen too many people spending a lot of money on expensive hardware, without really using it in the end:

Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.

More comments...