A guide to local coding models

Posted by mpweiher 12/21/2025

A guide to local coding models(www.aiforswes.com)

607 points | 351 commentspage 4

bilater 12/22/2025|

If you are using local models for coding you are midwiting this. Your code should be worth more than a subscription.

The only legit use case for local models is privacy.

I don't know why anyone would want to code with an intern level model when they can get a senior engineer level model for a couple of bucks more.

It DOESN'T MATTER if you're writing a simple hello world function or building out a complex feature. Just use the f*ing best model.

pcl 12/22/2025||

Or if you want to do development work while offline.

bilater 12/22/2025||

Good to have fallbacks but in reality most ppl ( at least in the west) will have internet 99% of the time.

pcl 12/22/2025||

Sure, but I am not one of them. I find myself wanting to code on trains and planes pretty often, and so local toolchains are always attractive for me.

rester324 12/23/2025|||

Is this some kind of mental problem that you want to tell people what they do and how they spend their money? Pretty jerk attitude IMO

groguzt 12/22/2025|||

"senior engineer level model" is the biggest cope I've ever seen

beeboop0 12/22/2025|||

[dead]

jgalt212 12/22/2025||

I will use a local coding model for our proprietary / trade secrets internal code when Google uses Claude for its internal code and Microsoft starts using Gemini for internal code.

The flip side of this coin is I'd be very excited if Jane Street or DE Shaw were running their trading models through Claude. Then I'd have access to billions of dollars of secrets.

Aurornis 12/22/2025||

> I'd be very excited if Jane Street or DE Shaw were running their trading models through Claude. Then I'd have access to billions of dollars of secrets.

Using Claude for inference does not mean the codebase gets pulled into their training set.

This is a tired myth that muddies up every conversation about LLMs

jgalt212 12/22/2025|||

> This is a tired myth that muddies up every conversation about LLMs

Many copyright holders, and the courts would beg to differ.

bilater 12/22/2025|||

lol yeah its weird to me why even ppl on HN can't wrap their heads around stateless calls.

jgalt212 12/22/2025||

unless you control both the client and the server, you cannot prove a call is stateless.

erusev 12/23/2025||

If you're on a Mac and want a simple and open-source way to run models locally, check out our app LlamaBarn: https://github.com/ggml-org/LlamaBarn

holyknight 12/21/2025||

your premise would've been right, if memory wouldn't skyrocketed like 400% in like 2 weeks.

jollymonATX 12/22/2025||

This is not really a guide to local coding models which is kinda disappointing. Would have been interested in a review of all the cutting edge open weight models in various applications.

Simplita 12/22/2025||

One thing that surprised us when testing local models was how much easier debugging became once we treated them as decision helpers instead of execution engines. Keeping the execution path deterministic avoided a lot of silent failures. Curious how others are handling that boundary.

Myrmornis 12/22/2025||

Can anyone give any tips for getting something that runs fairly fast under ollama? It doesn't have to be very intelligent.

When I tried gpt-oss and qwen using ollama on an M2 Mac the main problem was that they were extremely slow. But I did have a need for a free local model.

parthsareen 12/22/2025||

How much ram are you running with? Qwen3 and gpt-oss:20b punch a good bit above their weight. Personally use it for small agents.

am17an 12/22/2025||

Use llama.cpp? I get 250 toks/sec on gpt-oss using a 4090, not sure about the mac speeds

2001zhaozhao 12/22/2025||

Under current prices buying hardware just to run local models is not worth it EVER, unless you already need the hardware for other reasons or you somehow value having no one else be able to possibly see your AI usage.

Let's be generous and assume you are able to get a RTX 5090 at MSRP ($2000) and ignore the rest of your hardware, then run a model that is the optimal size for the GPU. A 5090 has one of the best throughputs in AI inference for the price, which benefits the local AI cost-efficiency in our calculations. According to this reddit post it outputs Qwen2.5-Coder 32B at 30.6 tokens/s. https://www.reddit.com/r/LocalLLaMA/comments/1ir3rsl/inferen...

It's probably quantized, but let's again be generous and assume it's not quantized any more than models on OpenRouter. Also we assume you are able to keep this GPU busy with useful work 24/7 and ignore your electricity bill. At 30.6 tokens/s you're able to generate 993M output tokens in a year, which we can conveniently round up to a billion.

Currently the cheapest Qwen2.5-Coder 32B provider on OpenRouter that doesn't train on your input runs it at $0.06/M input and $0.15/M output tokens. So it would cost $150 to serve 1B tokens via API. Let's assume input costs are similar since providers have an incentive to price both input and output proportionately to cost, so $300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly.

Conclusion: even with EVERY assumption in favor of the local GPU user, it still takes almost 7 years for running a local LLM to become worth it. (This doesn't take into account that API prices will most likely decrease over time, but also doesn't take into account that you can sell your GPU after the breakeven period. I think these two effects should mostly cancel out.)

In the real world in OP's case, you aren't running your model 24/7 on your MacBook; it's quantized and less accurate than the one on OpenRouter; a MacBook costs more and runs AI models a lot slower than a 5090; and you do need to pay electricity bills. If you only change one assumption and run the model only 1.5 hours a day instead of 24/7, then the breakeven period already goes up to more than 100 years instead of 7 years.

Basically, unless you absolutely NEED a laptop this expensive for other reasons, don't ever do this.

rester324 12/22/2025|

These are the comments of the people who will cry a f@cking river when all the f@cking bubbles burst. You really think that it's "$300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly"??? Maybe you forgot to read the news how much fucking money these companies are burning and losing each year. So these kind of comments as "to run local models is not worth it EVER" make me chuckle. Thanks for that!

2001zhaozhao 12/23/2025||

If I were predicting the bubble to burst and API prices to go up in the future, wouldn't it be much better to use (abuse) the cheap API pricing now and then buy some discount AI hardware that everyone's dumping on the market once the bubble actually does burst? Why would I buy local AI hardware now when it is at it's most expensive?

j45 12/22/2025||

The work and interest in local coding models reminds me of the early 3D printer community, whatever is possible may take more than average tinkering until someone makes it a lot more possible.

mungoman2 12/22/2025||

The money argument doesn't make sense here as that Mac depreciates more per month than the subscription they want to avoid.

There may be other reasons to go local, but the proposed way is not cost effective.

Ultimatt 12/22/2025|

For local MLX inference LM Studio is a much nicer option than Ollama

More comments...