A guide to local coding models

Posted by mpweiher 12/21/2025

A guide to local coding models(www.aiforswes.com)

607 points | 351 commentspage 3

Roark66 12/22/2025|

I found the winning combination is to use all of them in this way: - first you need a vendor agnostic tool like opencode (I had to add my own vendors as it didn't support it out of the box properly) - second you set up agents with different models. I use: - for architecture and planning - opus, Sonet, gpt 5.2, gemini3 (depending on specifics, for example I found got better in troubleshooting, Sonet better in pure code planning, opus better in DevOps, Gemini the best for single shot stuff) - for execution of said plans (Qwen 2.5 Coder 30B - yes, it's even better in my use cases than Qwen3 despite benchmarks, Sonet - only when absolutely necessary, Qwen3-235B - between Qwen 2.5 and Sonet) - verification (Gemini 3 flash, Qwen3-480B etc)

The biggest saving you make is by making the context smaller and where many turns are required going for smaller models. For example a single 30min troubleshooting session with Gemini 3 can cost $15 if you run it "normally" or it can cost $2 if you use the agents, wipe context after most turns (can be done thanks to tracking progress in a plan file)

jszymborski 12/22/2025||

I just got a RTX 5090, so I thought I'd see what all the fuss was about these AI coding tools. I've previously copy pasted back and forth from Claude but never used the instruct models.

So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.

I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.

kelvie 12/22/2025|

gpt-oss-120b doesn't fit on a 5090 without offloading or crazy quants -- or did you mean you ran it via openrouter or something?

jszymborski 12/22/2025|||

I'm running the MXFP4 [0] quants at like 10-13 toks/sec. It is actually really good, I'm starting to think its a problem with Cline since I just tried it with Qwen3 and the same thing happened. Turns out Cline _hates_ empty files in my projects, although they aren't required for this to happen.

[0] https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...

kube-system 12/22/2025|||

Sounds like a crazy quant. IME 2 bit quants are pretty dumb.

brainless 12/22/2025||

I do not spend $100/month. I spend for 1 Claude Pro subscription and then a (much cheaper) z.ai Coding Plan, which is like one fifth the cost.

I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).

alok-g 12/22/2025|

I have heard about this approach elsewhere too. Could you please provide some more details on the set up steps and usage approach. I would like to replicate. Thanks.

brainless 12/22/2025|||

I simply ask Claude Sonnet, using claudecode, to use opencode. That's it! Example:

  We need to clean up code lint and format errors across multiple files. Check which files are affected using cargo commands. Please use opencode, a coding agent that is installed. Use `opencode run <prompt>` to pass in a per-file prompt to opencode, wait for it to finish, check and ask again if needed, then move to next file. Do not work on files yourself.

baconner 12/22/2025|||

There are a couple of decent approaches to having a planning/reviewer model set (eg. claude, codex, gemini) and an execution model (eg. glm 4.6, flash models, etc) workflow that I've tried. All three of these will let you live in a single coding cli but swap in different models for different tasks easily.

- claude code router - basically allows you to swap in other models using the real claude code cli and set up some triggers for when to use which one (eg. plan mode use real claude, non plan or with keywords use glm)

- opencode - this is what im mostly using now. similar to ccr but i find it a lot more reliable against alt models. thinking tasks go to claude, gemini, codex and lesser execution tasks go to glm 4.6 (on ceberas).

- sub-agent mcp - Another cool way is to use an mcp (or a skill or custom /command) that runs another agent cli for certain tasks. The mcp approach is neat because then your thinker agent like claude can decide when to call the execution agents, when to call in another smart model for a review of it's own thinking, etc instead of it being explicit choice from you. So you end up with the mcp + an AGENTS.md that instructs it to aggressively use the sub-agent mcp when it's a basic execution task, review, ...

I also find that with this setup just being able to tap in an alt model when one is stuck, or get review from an alt model can help keep things unstuck and moving.

KronisLV 12/22/2025||

RooCode and KiloCode also have an Orchestrator mode that can create sub-tasks and you can specify which model to use for what - and since they report their results back after finishing a task (implement X, fix Y), the context of the more expensive model doesn’t get as polluted. Probably one of the most user friendly ways to do that.

A simpler approach without subtasks would be to just use the smart model for Ask/Plan/whatever mode and the dumb but cheap one for the Code one, so the smart model can review the results as well and suggest improvements or fixes.

ardme 12/21/2025||

Isnt the math of buying Nvidia stock with what you pay for all the hardware and then just paying $20 a month for codex with the annual returns better?

phainopepla2 12/22/2025|

If you can see into the future and know the stock price, then sure.

Muromec 12/22/2025||

The line only ever goes up, until we all cry and find a new false messiah. Or die

NumberCruncher 12/22/2025||

I am freelancing on the side and charge 100€ by the hour. Spending roughly 100€ per month on AI subscriptions has a higher ROI for me personally than spending time on reading this article and this thread. Sometimes we forget that time is money...

threethirtytwo 12/21/2025||

I hope hardware becomes so cheap local models become the standard.

layer8 12/22/2025||

I hope that as well, but if cloud AI keeps buying up most of the world’s GPU and RAM production, it might not come to that.

rynn 12/21/2025||

It will be like the rest of computing, some things will move to the edge and others stay on the cloud.

Best choice will depend on use cases.

lelanthran 12/22/2025|||

> It will be like the rest of computing, some things will move to the edge and others stay on the cloud.

It will become like cloud computing - some people will have a cloud bill of $10k/m to host their apps, other people would run their app on a $15/m VPS.

Yes, the cost discrepancy will be as big as the current one we see in cloud services.

Terr_ 12/22/2025|||

I think the long term will depends on the legal/rent-seeking side.

Imagine having the hardware capacity to run things locally, but not the necessary compliance infrastructure to ensure that you aren't committing a felony under the Copyright Technofeudalism Act of 2030.

mungoman2 12/22/2025||

The money argument is IMHO not super strong, here as that Mac depreciates more per month than the subscription they want to avoid.

There may be other reasons to go local, but I would say that the proposed way is not cost effective.

There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.

The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.

altx 12/22/2025||

Its interesting to notice that here https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... we default to measuring LLM coding performance as how long[~5h] a human task a model can complete with 50% success-rate (with 80% fall back for the second chart [~.5h]), while here it seems that for actual coding we really care about the last 90-100% of the costly model's performance.

ljosifov 12/23/2025||

Nah - given the ergonomics + economics, local coding models are not atm that viable. I like all things local even if just for safety of keeping healthy competitive ecosystem. And I can imagine really specialised uses cases where I run an 8B not-so-smart model to process oodles of data on my local 7900xtx or similar. Got older m2 mbp with 96gb (v)ram and try all things local that fit. Usually LMStudio for the speed add in MLX format models on ASI (as end point; plus chat for vibes test; LMStudio omission from the OP blog post makes me question the post), or llama.cpp for GGUF (llama.cpp is the OG; excellent and universal engine and format; recently got even better). Looking at how agents work - an agent smarts of Claude Code or Codex in using the tools feels like it's half its success (the other half the underlying LLM smarts). From the training on baked in 'Tool Use & Interleaved Thinking' on the right tools in a right way, to the trivial 'DONOTDO bad idea to fill your 100K useful context with random content of multi-MB file as prompt'. The $20/mo plans are insanely competitive. OpenaI is generous with Codex, and in addition to terminal that I mostly use, there is the VSCode addon as well as use in Cline or Roo. Cursor offers in-house model fast and good, insane economy reading large codebases, as well BYOK to latest-greatest LLMs afaik. Claude Code $20/mo is stingy with quotas, but can be supplement with Z.ai standing in - glm-4.7 as of yesterday (saw no difference glm-4.6 v.v. sonnet-4.5 already v.good). It's a 3 lines change to ~/.claude/settings.json to flip Z.ai-Anthropic back and forth at will (e.g. when paused on one to switch to the other). Have not tried the Cerebras high tok/s but wd love to - not waiting makes a ton of difference to productivity.

SpaceManNabs 12/22/2025|

I love that this article added a correction and took ownership in it. This encourages more people to blog stuff and then get more input for parts they missed.

The best way to get the correct answer on something is posting the wrong thing. Not sure where I got this from, but I remember it was in the context of stackoverflow questions getting the correct answer in the comments of a reply :)

Props to the author for their honesty and having the impetus to blog about this in the first place.

More comments...