A guide to local coding models

Posted by mpweiher 12/21/2025

A guide to local coding models(www.aiforswes.com)

607 points | 351 commentspage 2

d4rkp4ttern 12/22/2025|

I recently found myself wanting to use Claude Code and Codex-CLI with local LLMs on my MacBook Pro M1 Max 64GB. This setup can make sense for cost/privacy reasons and for non-coding tasks like writing, summarization, q/a with your private notes etc.

I found the instructions for this scattered all over the place so I put together this guide to using Claude-Code/Codex-CLI with Qwen3-30B-A3B, 80B-A3B, Nemotron-Nano and GPT-OSS spun up with Llama-server:

https://github.com/pchalasani/claude-code-tools/blob/main/do...

Llama.cpp recently started supporting Anthropic’s messages API for some models, which makes it really straightforward to use Claude Code with these LLMs, without having to resort to say Claude-Code-Router (an excellent library), by just setting the ANTHROPIC_BASE_URL.

SamDc73 12/21/2025||

If privacy is your top priority, then sure spend a few grand on hardware and run everything locally.

Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)

I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind

That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.

eblanshey 12/22/2025|

What hardware can you buy for $5k to be able to run K2? That's a huge model.

SamDc73 12/22/2025||

This older HN thread shows R1 running on a ~$2k box using ~512 GB of system RAM, no GPU, at ~3.5-4.25 TPS: https://news.ycombinator.com/item?id=42897205

If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.

nl 12/22/2025|||

Is 4 TPS actually useful for anything?

That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)

I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.

zarzavat 12/22/2025|||

3.5-50M tokens a day? What are you doing with all those tokens?

Yesterday I asked Claude to write one function. I didn't ask it to do anything else because it wouldn't have been helpful.

KronisLV 12/22/2025|||

Here’s my own stats, for comparison: https://news.ycombinator.com/item?id=46216192

Essentially migrating codebases, implementing features, as well as all of the referencing of existing code and writing tests and various automation scripts that are needed to ensure that the code changes are okay. Over 95% of those tokens are reads, since often there’s a need for a lot of consistency and iteration.

It works pretty well if you’re not limited by a tight budget.

nl 12/23/2025|||

https://github.com/nlothian/Vibe-Prolog chews a lot of tokens.

Have a bunch of other side projects as well as my day job.

It's pretty easy to get through lots of tokens.

SamDc73 12/22/2025|||

I only run small models (70b at my hardware gets me around 10-20 TOPS) for just random things (personal assistant kind of thing) but not for coding tasks.

For coding related tasks I consume 30-80M tokens per day and I want something as fast as it gets

BoredPositron 12/22/2025|||

Stop recommending 3090s they are all but obsolete now. Not having native bf16 is a showstopper.

qayxc 12/22/2025|||

Hard disagree. The difference in performance is not something you'll notice if you actually use these cards. In AI benchmarks, the RTX 3090 beats the RTX 4080 SUPER, despite the latter having native BF16 support. 736GiB/s (4080) memory bandwidth vs 936 GiB/s (3090) plays a major role. Additionally, the 3090 is not only the last NVIDIA consumer card to support SLI.

It's also unbeatable in price to performance as the next best 24GiB card would be the 4090 which, even used, is almost tripple the price these days while only offering about 25%-30% more performance in real-world AI workloads.

You can basically get an SLI-linked dual 3090 setup for less money than a single used 4090 and get about the same or even more performance and double the available VRAM.

BoredPositron 12/22/2025||

If you run fp32 maybe but no sane person does that. The tensor performance of the 3090 is also abysmal. If you run bf16 or fp8 stay away from obsolete cards. Its barely usable for llms and borderline garbage tier on video and image gen.

qayxc 12/22/2025||

Actual benchmarks show otherwise.

> The tensor performance of the 3090 is also abysmal.

I for one compared my 50-series card's performance to my 3090 and didn't see "abysmal performance" on the older card at all. In fact, in actual real-world use (quantised models only, no one runs big fp32 models locally), the difference in performance isn't very noticeable at all. But I'm sure you'll be able to provide actual numbers (TTFT, TPS) to prove me wrong. I don't use diffusion models, so there might be a substantial difference there (I doubt it, though), but for LLMs I can tell you for a fact that you're just wrong.

BoredPositron 12/22/2025|||

To be clear, we are not discussing small toy models but to be fair I also don't use consumer cards. Benchmarks are out there (phoronix, runpod, hugginface or from Nvidias own presentation) and they say it's at least 2x on high and nearly 4x on low precision, which is comparable to the uplift I see on my 6000 cards, if you don't see the performance uplift everyone else sees there is something wrong with your setup and I don't have the time to debug it.

qayxc 12/22/2025||

> To be clear, we are not discussing small toy models but to be fair I also don't use consumer cards.

> if you don't see the performance uplift everyone else sees there is something wrong with your setup and I don't have the time to debug it.

Read these two statements and think about what might be the issue. I only run what you call "toy models" (good enough for my purposes), so of course your experience is fundamentally different from mine. Spending 5 figures on hardware just to run models locally is usually a bad investment. Repurposing old hardware OTOH is just fine to play with local models and optimise them for specific applications and workflows.

SamDc73 12/22/2025|||

Even with something like a 5090, I’d still run Q4_K_S/Q4_K_M because they’re far more resource-efficient for inference.

Also, the 3090 supports NVLink, which is actually more useful for inference speed than native BF16 support.

Maybe if you're training bf16 matters?

BoredPositron 12/22/2025||

That's a smart thing todo considering a 5090 has native tensor cores for 4bit precision...

nzeid 12/21/2025||

I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.

I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.

a_victorp 12/21/2025|

If you ever do it, please make a guide! I've been toying with the same notion myself

suprjami 12/21/2025|||

If you want to do it cheap, get a desktop motherboard with two PCIe slots and two GPUs.

Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.

Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.

For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.

If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.

le-mark 12/22/2025||

I was going to post snark such as “you could use the same hardware to also lose money mining crypto” then realized there are a lot of crypto miners out their that could probably make more money running tokens then they do on crypto. Does such a market place exist?

hackstack 12/22/2025||

This is essentially vast.ai, no?

MrDrMcCoy 12/22/2025|||

A quick glance at their homepage says they run in "secure datacenters", so no.

gkbrk 12/22/2025||

Then you glanced too quickly, vast.ai absolutely has non-datacenter GPUs.

https://vast.ai/hosting#gpu-farms-homelabs

MrDrMcCoy 12/23/2025||

Very interesting, thanks! Definitely something to consider for my environment.

satvikpendem 12/21/2025||||

Jeff Geerling has (not quite but sort of) guides: https://news.ycombinator.com/item?id=46338016

a96 12/22/2025||

Also worth looking is stuff from Donato Capitella : https://github.com/kyuz0 https://www.youtube.com/@donatocapitella https://llm-chronicles.com/ etc

whitehexagon 12/22/2025|||

SimonW used to have more articles/guides on local LLM setup, at least until he got the big toys to play with, but well worth looking through his site. Although if you are in parts of Europe, the site is blocked at weekends, something to do with the great-firewall of streamed sports.

https://simonwillison.net/

Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.

I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.

I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.

amarant 12/22/2025||

Buying a maxed out MacBook Pro seems like the most expensive way to go about getting the necessary compute. Apple is notorious for overcharging for hardware, especially on ram.

I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.

Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.

kube-system 12/22/2025||

You need memory hooked up to the GPU. Apple’s unified memory is actually one of the cheaper ways to do this. On a typical x86-64 desktop, this means VRAM… for 100+ GB of VRAM you’re deep into tens of thousand of dollars.

Also, if you think Apple’s RAM prices are crazy… you might be surprised at what current DDR5 pricing is today. The $800 that Apple charges to upgrade a MBP from 64-128GB is the current price of 64GB desktop DDR5-6000. Which is actually slower memory than the 8533 MT/s memory you’re getting in the MacBook.

nl 12/22/2025|||

You want unified RAM.

On Linux your options are the NVidia Spark (and other vendor versions) or the AMD Ryzen AI series.

These are good options, but there are significant trade-offs. I don't think there are Ryzen AI laptops with 128GB RAM for example, and they are pricey compared to traditional PCs.

You also have limited upgradeability anyway - the RAM is soldered.

Renaud 12/22/2025||

Can any x86 based system actually comes with that much unified memory?

Not an Apple fanboy, but I was under the impression that having access to up to 512GB usable GPU memory was the main feature in favour of the mac.

And now with Exo, you can even break the 512GB barrier.

embedding-shape 12/22/2025||

> because GPT-OSS frequently gave me “I cannot fulfill this request” responses when I asked it to build features.

This is something that frequently comes up and whenever I ask people to share the full prompts, I'm never able to reproduce this locally. I'm running GPT-OSS-120B with the "native" weights in MXFP4, and I've only seen "I cannot fulfill this request" when I actually expect it, not even once had that happen for a "normal" request you expect to have a proper response for.

Has anyone else come across this when not using the lower quantizations or 20b (So GPT-OSS-120B proper in MXFP4) and could share the exact developer/system/user prompt that they used that triggered this issue?

Just like at launch, from my point of view, this seems to be a myth that keeps propagating, and no one can demonstrate a innocent prompt that actually triggers this issue on the weights OpenAI themselves published. But then the author here seems to actually have hit that issue but again, no examples of actual prompts, so still impossible to reproduce this issue.

maranas 12/21/2025||

Cline + RooCode and VSCode already works really well with local models like qwen3-coder or even the latest gpt-oss. It is not as plug-and-play as Claude but it gets you to a point where you only have to do the last 5% of the work

rynn 12/21/2025|

What are you working on that you’ve had such great success with gpt-oss?

I didn’t try it long because I got frustrated waiting for it to spit out wrong answers.

But I’m open to trying again.

maranas 12/22/2025|||

I use it to build some side-projects, mostly apps for mobile devices. It is really good with Swift for some reason.

I also use it to start off MVP projects that involve both frontend and API development but you have to be super verbose, unlike when using Claude. The context window is also small, so you need to know how to break it up in parts that you can put together on your own

embedding-shape 12/22/2025|||

> What are you working on that you’ve had such great success with gpt-oss?

I'm doing programming on/off (mostly use Codex with hosted models) with GPT-OSS-120B, and with reasoning_effort set to high, it gets it right maybe 95% of the times, rarely does it get anything wrong.

ineedasername 12/22/2025||

I’ve been using Qwen3 Coder 30b quantized down to IQ3_XSS to fit in < 16gb vram. Blazing fast 200+ tokens per second on a 4080. I don’t ask anything complicated, but one off scripts to do something I’d normally have to do manually by hand or take an hour to write the script myself? Absolutely.

These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.

throw-12-16 12/22/2025||

I never see devs containerize their coding agents.

It seems so obvious to me, but I guess people are happy with claude living in their home directory and slurping up secrets.

onion2k 12/22/2025|

The devs I work with don't put secrets in their home directories. ;)

rester324 12/23/2025|||

How do you know? Do you snoop on their work machines?

littlestymaar 12/22/2025||||

And where are all their software putting their data then? Unless you consider only private keys to be secrets…

(In particular the fact that Claude Code has access to your Anthropic API key is ironic given that Dario and Anthropic spend a lot of time fearmongering about how the AI could go rogue and “attempt to escape”).

throw-12-16 12/22/2025|||

many many tools default to this, claude included

BoredPositron 12/22/2025||

Not worth it yet. I run a 6000 black for image and video generation, but local coding models just aren't on the same level as the closed ones.

I grabbed Gemini for $10/month during Black Friday, GPT for $15, and Claude for $20. Comes out to $45 total, and I never hit the limits since I toggle between the different models. Plus it has the benefit of not dumping too much money into one provider or hyper focusing on one model.

That said, as soon as an open weight model gets to the level of the closed ones we have now, I'll switch to local inference in a heartbeat.

fny 12/22/2025|

My takeaway is that clock is ticking on Claude, Codex et al's AI monopoly. If a local setup can do 90% of what Claude can do today, what do things look like in 5 years?

maranas 12/22/2025||

I think they have already realized this, which is why they are moving towards tool use instead of text generation. Also explains why there are no more free APIs nowadays (even for search)

ukuina 12/22/2025||

Exactly, imagine what Claude can do in five years!

rester324 12/22/2025||

10% on top of what we have now and the same things that the local models can do of those times ahead of us?

More comments...