Running local models is good now

Posted by jfb 4 hours ago

Running local models is good now(vickiboykis.com)

495 points | 244 commentspage 2

chrismarlow9 3 hours ago|

You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.

Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee

andix 1 hour ago||

Because I've seen too many people spending a lot of money on expensive hardware, without really using it in the end:

Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.

pjmlp 1 hour ago||

Only if blessed with enough RAM and disk space,

> 64 GB RAM and 1TB storage

Ah ok, not something regular joe and jane happen to have lying around at home.

Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.

sparkling 8 minutes ago|

Even if i had such a machine, im not sure i would be willing to sacrifice 80% of my RAM and 50% of my disk to run a semi-okay model locally.

aquarious_ 1 hour ago||

I support local models and enjoy playing around with them, but even for personally development it is just more viable for me to pay $200 a month to Anthropic for the latest models. It seems to me with the cost of hardware needed to run local models that, for now, it is pure hobbyist and exploratory (which is fun in its own right)

segmondy 3 hours ago||

It's more than good. As of today, it's great. Those models listed in the blog are horrible compared to what you can run today, There's absolutely no reason to run those, you have Qwen3.6, Gemma4, and plenty other sized comparable models.

If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B

0xc0c0c0 3 hours ago||

I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.

You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.

One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.

failbuffer 3 hours ago|

So which harness did you end up choosing?

ngxson 2 hours ago||

My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.

The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.

As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.

And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.

I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?

[1]: https://github.com/ngxson/llama-companion

phainopepla2 2 hours ago|

Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.

ngxson 2 hours ago||

The dsv4 flash is 158B params in total. It is possible to run locally but will require all my system RAM.

Also, a lot of my day-to-day tasks perform the same on both small and bigger models: summarize a web page, draft a response, translations, quick web search, etc.

phainopepla2 1 hour ago||

Sorry, I meant non-locally.

I'm assuming privacy is not a concern since you mentioned using Deepseek already. The cost of V4 Flash for small tasks is so minuscule as to be almost free, and you don't have to deal with a churning laptop (or even buying a high-end laptop, for someone who doesn't already have one).

I guess what I'm really asking is, what's the advantage of using these small local models if privacy isn't a concern?

ngxson 1 hour ago||

I do use both DSv4 the "normal" and the flash variant, non-locally. It works well, not exceptionally. And while it's cheap, I'd say that the difference between $1 per month vs $5 per month is not a big concern to me. IMO pricing is pretty competitive among open-weight models: https://huggingface.co/inference/models

Depending on use cases, but for me I found 2 use cases where a local model is a must and not optional:

- Running offline without internet access: for example, I have this project that allow transcribe and summarize audio in real time. I already used it in some events where wifi is not available: https://github.com/ngxson/llama.cpp-realtime-audio-recap

- Handle private personal data, for example health records. This is the same category of "privacy" that you mentioned, but I just want to bring up the fact that people value their privacy differently.

ridruejo 1 hour ago||

Local models are one of the main drivers for our installer / Desktop app for OpenClaw https://holaclaw.ai (disclaimer I am one of the founders). The smaller models are really only suitable for the most basic tasks, but if you have 32gb-64gb you can get real work done (ie complex web workflows) without third party hosted models

bthornbury 1 hour ago||

the qwopus 27b model is good for grunt work style tasks, even across multiple files. Piping a bunch of things through, small factoring changes, stuff that just takes time to type out.

I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.

_doctor_love 3 hours ago|

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

swatcoder 3 hours ago||

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.

frollogaston 40 minutes ago||

But you can get that return from a paid service too, in fact it'll be better. So just comparing costs, what's the annualized ROI on the Mac Studio assuming it means you avoid paying $240/y for Claude? Cause I can always set aside the Mac's price in some investments and pay for Claude out of that.

swatcoder 16 minutes ago||

Same with many and their shop tools in other trades.

Most hobbyists and many professionals could end up far ahead financially by leveraging makerspaces, tool rentals, and co-op shops or even by hiring out a professional to prep certain intermediates for them, but they get psychological value -- as well as flexibility, reliability, and resale opportunity -- from having their own well-outfitted shop.

And they can afford that premium, so they do. At the scale of individuals and small shops, not everything that matters gets captured in financial models.

frollogaston 8 minutes ago||

Yeah but the local model doesn't have those advantages for the coding use cases, at least not yet. In theory you could post-train one on your codebase or something, but nobody cares to do that when any vanilla coding agent service can read and understand the whole thing. I was already being very generous towards the Mac in pretending it does the same thing as the paid service.

Aside, physical tools tend to be financially advantageous to own if you're going to use them a lot. Even if the owner were targeting 0 profit, they'd have to charge more to factor in the cost of dealing with customers and increased risk of wear/damage by users who don't care as much.

amalcon 3 hours ago|||

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

AbsurdCensor 3 hours ago||

At least for me, it's been pretty great, but I bought my system when it was $1800, now looks like the same system is $2700 and out of stock. I still haven't quite been able to run 120B parameter models under Windows, but for Qwen Coder 30B, it works pretty darn well for my at home needs.

amalcon 3 hours ago||

Yeah, they have gone up a lot since I bought mine too. I did get Qwen3.5-122b running on all-GPU (on a 128GB machine) under a minimal Arch Linux setup (I do my GUI work on a much cheaper box). It worked, but Qwen3.6-35b is performing almost as well and a lot faster.

Still cheaper than a new Mac. Maybe not cheaper than a used one.

AbsurdCensor 1 hour ago||

I've certainly thought about just moving the box to Linux, but it took far to long personally to get everything running under AMD and it works 'well enough' that I don't want to make the switch. I tried playing with GAIA on it, felt a bit limited, and now have Hermes up and running, and that seems to work quite well. All the tools are changing so quickly, it's sometimes difficult to settle in on 'what's best', so I certainly can understand folks that just want to pay for a AI subscription and be done with it.

techscruggs 3 hours ago|||

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

Shekelphile 3 hours ago|||

She

psychoslave 3 hours ago|||

Global Affordability Estimate:

Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.

Top 25% (~2B people) could afford it with some budget adjustments.

Bottom 50% (~4B people) would find it prohibitively expensive.

So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.

frollogaston 34 minutes ago||||

Bottom 50% aren't paying for Claude either, probably also don't own PCs or write code

disgruntledphd2 3 hours ago||||

The maths changes if you're working for yourself. Because I live in Europe, I've ended up working as a contractor due to the lack of a legal entity in my country. While that mostly sucked for a bunch of reasons, I was able to get a 64Gb Mac M2 a few years back with approximately a 52% discount, which was kinda nice.

weego 3 hours ago||

If you're working for yourself paying monthly is exactly the same as amortising an asset. Personally I'd rather my business just pay $100 a month than have to deal with additional hardware and software maintenance while using a depreciating asset that is break-even after 3-5 years depending on the spec.

richwater 3 hours ago|||

Yes, because the bottom 50%, mostly impoverished or near impoverished folks were spending money on Claude Code subscriptions instead /s

p-e-w 3 hours ago|||

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

tjwebbnorfolk 3 hours ago|||

AI and budgets don't mix well at the moment

themythfable 3 hours ago|||

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

embedding-shape 3 hours ago|||

> Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

There are segments, everything from "Average person in world" to "Average creative professional using computers for work" and more on HN, with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.

sublinear 3 hours ago|||

If we define "typical" as the median HN budget, it's probably about the same as yours. Maybe the answer would have been different 10 or 20 years ago, but the era of truly needing a big budget PC has been over for a while.

It's just for gaming and AI now. Maybe not even gaming as much anymore.

Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.

anarticle 3 hours ago|||

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

dofm 3 hours ago||

[dead]

More comments...