Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Posted by cloudking 19 hours ago

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

1005 points | 448 commentspage 12

drnick1 14 hours ago|

Do you recommend Ollama or bare llama.cpp?

jboss10 13 hours ago||

llama.cpp It's faster and more open source. Ollama has some mixed history. I use llama-swap to emulate the Ollama experience.

shironnnn_ 13 hours ago||

if on MacOS I recommend llm-mlx which currently renders tokens 10%-15% faster than llama.cpp.

devmor 5 hours ago||

I’d be surprised if this was useful for much. Claude is already almost too slow to do anything serious I’d consider using it for outside of grunt work without parallelizing.

The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.

devin 15 hours ago||

Anyone here running a tinygrad?

system2 17 hours ago||

Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.

ColonelPhantom 11 hours ago|

Which model class requires an 80 GB VRAM GPU? From my perspective, popular models seem to be either in the ~30B range (Qwen3.6, Gemma 4), while the larger models (MiniMax, MiMo, StepFun, Deepseek) are in the multiple hundreds of billions parameters, for which 80 GB is simply too small.

You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.

system2 4 hours ago|||

Video models.

CamperBob2 5 hours ago|||

This is true. There's not much point in buying only one RTX 6000. You need at least two to run anything interesting that you couldn't run on a 5090. And you can imagine where it goes from there.

christkv 17 hours ago||

Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.

w10-1 15 hours ago||

I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).

For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").

That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.

One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.

lowbloodsugar 9 hours ago||

If you want to try it out before dropping $$$ on a GPU, just run something that would fit on your target GPU but online.

sometimelurker 14 hours ago||

yeah I use one one the small MTP qwens and pi

hacker_homie 11 hours ago||

I do qwen3.6 on an amd ai max laptop getting about 6-10tok/s it’s slow enough that I can follow along. It has issues with design and large piles of code. Otherwise it’s a good programming buddy.

major505 15 hours ago|

Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.

Der_Einzige 13 hours ago|

Secretly the problems many people have with agentic coding are related to poor choice of sampling settings, but the world will wait several more years before this is understood well. top_p and top_k are garbage but they are intentionally kept on purpose because subsequent methods enable coherent high temperature sampling, which is an absolute no go for alignment/safety reasons.

The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.

More comments...