Top
Best
New

Posted by mpweiher 1 day ago

A guide to local coding models(www.aiforswes.com)
581 points | 341 commentspage 2
cloudhead 1 day ago|
In my experience the latest models (Opus 4.5, GPT 5.2) Are _just_ starting to keep up with the problems I'm throwing at them, and I really wish they did a better job, so I think we're still 1-2 years away from local models not wasting developer time outside of CRUD web apps.
OptionOfT 1 day ago|
Eh, these things are trained on existing data. The further you are from that the worse the models get.

I've noticed that I need to be a lot more specific in those cases, up to the point where being more specific is slowing me down, partially because I don't always know what the right thing is.

cloudhead 1 day ago||
For sure, and I guess that's kind of my point -- if the OP says local coding models are now good enough, then it's probably because he's using things that are towards the middle of the distribution.
dkdcio 1 day ago||
similar for me —- also how do you get the proper double dashes —- anyway, I’d love to be able to run CLI agents fully local, but I don’t see it being good enough (relative to what you can get for pretty cheap from SOTA models) anytime soon
cloudhead 1 day ago||
What’s wrong with your keyboard haha
dkdcio 1 day ago||
iphone :/ I see others with the same problem too, oh well, at least people won’t accuse me of being an LLM probably
d4rkp4ttern 1 day ago||
I recently found myself wanting to use Claude Code and Codex-CLI with local LLMs on my MacBook Pro M1 Max 64GB. This setup can make sense for cost/privacy reasons and for non-coding tasks like writing, summarization, q/a with your private notes etc.

I found the instructions for this scattered all over the place so I put together this guide to using Claude-Code/Codex-CLI with Qwen3-30B-A3B, 80B-A3B, Nemotron-Nano and GPT-OSS spun up with Llama-server:

https://github.com/pchalasani/claude-code-tools/blob/main/do...

Llama.cpp recently started supporting Anthropic’s messages API for some models, which makes it really straightforward to use Claude Code with these LLMs, without having to resort to say Claude-Code-Router (an excellent library), by just setting the ANTHROPIC_BASE_URL.

andix 1 day ago||
I wouldn't run local models on the development PC. Instead run them on a box in another room or another location. Less fan noise and it won't influence the performance of the pc you're working on.

Latency is not an issue at all for LLMs, even a few hundred ms won't matter.

It doesn't make a lot of sense to me, except when working offline while traveling.

snoman 1 day ago|
Less of a concern these days with hardware like a Mac Studio or Nvidia dgx which are accessible and aren’t noisy at all.
andix 21 hours ago||
I'm not fully convinced that those devices don't create noise at full power. But one issue still remains: LLMs eating up compute on the device you're working on. This will always be noticeable.
embedding-shape 21 hours ago||
> because GPT-OSS frequently gave me “I cannot fulfill this request” responses when I asked it to build features.

This is something that frequently comes up and whenever I ask people to share the full prompts, I'm never able to reproduce this locally. I'm running GPT-OSS-120B with the "native" weights in MXFP4, and I've only seen "I cannot fulfill this request" when I actually expect it, not even once had that happen for a "normal" request you expect to have a proper response for.

Has anyone else come across this when not using the lower quantizations or 20b (So GPT-OSS-120B proper in MXFP4) and could share the exact developer/system/user prompt that they used that triggered this issue?

Just like at launch, from my point of view, this seems to be a myth that keeps propagating, and no one can demonstrate a innocent prompt that actually triggers this issue on the weights OpenAI themselves published. But then the author here seems to actually have hit that issue but again, no examples of actual prompts, so still impossible to reproduce this issue.

nzeid 1 day ago||
I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.

I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.

a_victorp 1 day ago|
If you ever do it, please make a guide! I've been toying with the same notion myself
suprjami 1 day ago|||
If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.

Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.

Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.

For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.

If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.

le-mark 1 day ago||
I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?
hackstack 1 day ago||
This is essentially vast.ai, no?
MrDrMcCoy 1 day ago|||
A quick glance at their homepage says they run in "secure datacenters", so no.
gkbrk 1 day ago||
Then you glanced too quickly, vast.ai absolutely has non-datacenter GPUs.

https://vast.ai/hosting#gpu-farms-homelabs

MrDrMcCoy 3 hours ago||
Very interesting, thanks! Definitely something to consider for my environment.
whitehexagon 21 hours ago||||
SimonW used to have more articles/guides on local LLM setup, at least until he got the big toys to play with, but well worth looking through his site. Although if you are in parts of Europe, the site is blocked at weekends, something to do with the great-firewall of streamed sports.

https://simonwillison.net/

Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.

I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.

I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.

satvikpendem 1 day ago|||
Jeff Geerling has (not quite but sort of) guides: https://news.ycombinator.com/item?id=46338016
a96 20 hours ago||
Also worth looking is stuff from Donato Capitella : https://github.com/kyuz0 https://www.youtube.com/@donatocapitella https://llm-chronicles.com/ etc
amarant 1 day ago||
Buying a maxed out MacBook Pro seems like the most expensive way to go about getting the necessary compute. Apple is notorious for overcharging for hardware, especially on ram.

I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.

Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.

kube-system 1 day ago||
You need memory hooked up to the GPU. Apple’s unified memory is actually one of the cheaper ways to do this. On a typical x86-64 desktop, this means VRAM… for 100+ GB of VRAM you’re deep into tens of thousand of dollars.

Also, if you think Apple’s RAM prices are crazy… you might be surprised at what current DDR5 pricing is today. The $800 that Apple charges to upgrade a MBP from 64-128GB is the current price of 64GB desktop DDR5-6000. Which is actually slower memory than the 8533 MT/s memory you’re getting in the MacBook.

nl 1 day ago|||
You want unified RAM.

On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.

These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.

You also have limited upgradeability anyway - the RAM is soldered.

Renaud 1 day ago||
Can any x86 based system actually comes with that much unified memory?

Not an Apple fanboy, but I was under the impression that having access to up to 512GB usable GPU memory was the main feature in favour of the mac.

And now with Exo, you can even break the 512GB barrier.

SamDc73 1 day ago||
If privacy is your top priority, then sure spend a few grand on hardware and run everything locally.

Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)

I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind

That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.

eblanshey 1 day ago|
What hardware can you buy for $5k to be able to run K2? That's a huge model.
SamDc73 1 day ago||
This older HN thread shows R1 running on a ~$2k box using ~512 GB of system RAM, no GPU, at ~3.5-4.25 TPS: https://news.ycombinator.com/item?id=42897205

If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.

nl 1 day ago|||
Is 4 TPS actually useful for anything?

That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)

I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.

SamDc73 16 hours ago|||
I only run small models (70b at my hardware gets me around 10-20 TOPS) for just random things (personal assistant kind of thing) but not for coding tasks.

For coding related tasks I consume 30-80M tokens per day and I want something as fast as it gets

zarzavat 1 day ago|||
3.5-50M tokens a day? What are you doing with all those tokens?

Yesterday I asked Claude to write one function. I didn't ask it to do anything else because it wouldn't have been helpful.

nl 12 hours ago|||
https://github.com/nlothian/Vibe-Prolog chews a lot of tokens.

Have a bunch of other side projects as well as my day job.

It's pretty easy to get through lots of tokens.

KronisLV 22 hours ago|||
Here’s my own stats, for comparison: https://news.ycombinator.com/item?id=46216192

Essentially migrating codebases, implementing features, as well as all of the referencing of existing code and writing tests and various automation scripts that are needed to ensure that the code changes are okay. Over 95% of those tokens are reads, since often there’s a need for a lot of consistency and iteration.

It works pretty well if you’re not limited by a tight budget.

BoredPositron 1 day ago|||
Stop recommending 3090s they are all but obsolete now. Not having native bf16 is a showstopper.
qayxc 1 day ago|||
Hard disagree. The difference in performance is not something you'll notice if you actually use these cards. In AI benchmarks, the RTX 3090 beats the RTX 4080 SUPER, despite the latter having native BF16 support. 736GiB/s (4080) memory bandwidth vs 936 GiB/s (3090) plays a major role. Additionally, the 3090 is not only the last NVIDIA consumer card to support SLI.

It's also unbeatable in price to performance as the next best 24GiB card would be the 4090 which, even used, is almost tripple the price these days while only offering about 25%-30% more performance in real-world AI workloads.

You can basically get an SLI-linked dual 3090 setup for less money than a single used 4090 and get about the same or even more performance and double the available VRAM.

BoredPositron 1 day ago||
If you run fp32 maybe but no sane person does that. The tensor performance of the 3090 is also abysmal. If you run bf16 or fp8 stay away from obsolete cards. Its barely usable for llms and borderline garbage tier on video and image gen.
qayxc 22 hours ago||
Actual benchmarks show otherwise.

> The tensor performance of the 3090 is also abysmal.

I for one compared my 50-series card's performance to my 3090 and didn't see "abysmal performance" on the older card at all. In fact, in actual real-world use (quantised models only, no one runs big fp32 models locally), the difference in performance isn't very noticeable at all. But I'm sure you'll be able to provide actual numbers (TTFT, TPS) to prove me wrong. I don't use diffusion models, so there might be a substantial difference there (I doubt it, though), but for LLMs I can tell you for a fact that you're just wrong.

BoredPositron 21 hours ago|||
To be clear, we are not discussing small toy models but to be fair I also don't use consumer cards. Benchmarks are out there (phoronix, runpod, hugginface or from Nvidias own presentation) and they say it's at least 2x on high and nearly 4x on low precision, which is comparable to the uplift I see on my 6000 cards, if you don't see the performance uplift everyone else sees there is something wrong with your setup and I don't have the time to debug it.
qayxc 15 hours ago||
> To be clear, we are not discussing small toy models but to be fair I also don't use consumer cards.

> if you don't see the performance uplift everyone else sees there is something wrong with your setup and I don't have the time to debug it.

Read these two statements and think about what might be the issue. I only run what you call "toy models" (good enough for my purposes), so of course your experience is fundamentally different from mine. Spending 5 figures on hardware just to run models locally is usually a bad investment. Repurposing old hardware OTOH is just fine to play with local models and optimise them for specific applications and workflows.

SamDc73 16 hours ago|||
Even with something like a 5090, I’d still run Q4_K_S/Q4_K_M because they’re far more resource-efficient for inference.

Also, the 3090 supports NVLink, which is actually more useful for inference speed than native BF16 support.

Maybe if you're training bf16 matters?

BoredPositron 15 hours ago||
That's a smart thing todo considering a 5090 has native tensor cores for 4bit precision...
maranas 1 day ago||
Cline + RooCode and VSCode already works really well with local models like qwen3-coder or even the latest gpt-oss. It is not as plug-and-play as Claude but it gets you to a point where you only have to do the last 5% of the work
rynn 1 day ago|
What are you working on that you’ve had such great success with gpt-oss?

I didn’t try it long because I got frustrated waiting for it to spit out wrong answers.

But I’m open to trying again.

embedding-shape 21 hours ago|||
> What are you working on that you’ve had such great success with gpt-oss?

I'm doing programming on/off (mostly use Codex with hosted models) with GPT-OSS-120B, and with reasoning_effort set to high, it gets it right maybe 95% of the times, rarely does it get anything wrong.

maranas 1 day ago|||
I use it to build some side-projects, mostly apps for mobile devices. It is really good with Swift for some reason.

I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own

throw-12-16 1 day ago||
I never see devs containerize their coding agents.

It seems so obvious to me, but I guess people are happy with claude living in their home directory and slurping up secrets.

onion2k 1 day ago|
The devs I work with don't put secrets in their home directories. ;)
rester324 2 hours ago|||
How do you know? Do you snoop on their work machines?
throw-12-16 1 day ago||||
many many tools default to this, claude included
littlestymaar 1 day ago|||
And where are all their software putting their data then? Unless you consider only private keys to be secrets…

(In particular the fact that Claude Code has access to your Anthropic API key is ironic given that Dario and Anthropic spend a lot of time fearmongering about how the AI could go rogue and “attempt to escape”).

ineedasername 1 day ago|
I’ve been using Qwen3 Coder 30b quantized down to IQ3_XSS to fit in < 16gb vram. Blazing fast 200+ tokens per second on a 4080. I don’t ask anything complicated, but one off scripts to do something I’d normally have to do manually by hand or take an hour to write the script myself? Absolutely.

These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.

More comments...