Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Posted by cloudking 16 hours ago

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

837 points | 384 commentspage 9

hegdeezy 13 hours ago|

I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!

_davide_ 14 hours ago||

i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/

sosodev 13 hours ago|

My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.

I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.

_davide_ 11 hours ago||

you can absolutely use it for some workloads, but as soon as you have some extra complexity for a big repo it'll take forever and the economics are so silly to the point that the electricity bill would be comparable to a subscription. I love having the possibility of running things locally if some random dude decide to pull them plug, and give me solice the fact that i can have 100% private inference, but as the main driver during the day? shoot me

agentbc9000 9 hours ago||

Kimi K2.7 is very good - i have been testing it and its very very good, Fable 5 level of goodness.

bentt 8 hours ago|

Say more!

catapart 9 hours ago||

tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.

Rzor 3 hours ago|

RX 9060 XT 16GB here on google/gemma-4-26b-a4b-qat using LM Studio. Context 65k, 23 layers on the GPU, 7 on the CPU, model in memory, mmapped. I'm getting 23-33 tks. Started experimenting 3 days ago (with gemma-4-e4b), don't know what half those settings mean, but 26B, even quantified, feels significantly better at a few small projects I asked it to create ("create a image converter using ffmpeg in bash", "create a canvas animation with real physics, no libraries"[1]).

It's faster than I can read, but it feels slow as hell. I think 40-50 tks is probably much more comfortable and I hope I can reach that when trying this on llamacpp soon enough.

[0] - https://pastes.io/9gaARxE8

[1] - https://jsfiddle.net/pou4nbh9/1/

Model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...

SugarReflex 7 hours ago||

Is anyone using Aider? Is there any decent CLI alternatives to it?

AH4oFVbPT4f8 12 hours ago||

Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.

xeonax 12 hours ago|

Whats .NET doing in between?

AH4oFVbPT4f8 10 hours ago||

Sorry, I meant to say I was writing .NET C# with the setup

chungus 8 hours ago||

Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.

I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.

Power usage is also totally not an issue, AI workload is very different from gaming.

tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.

SkitterKherpi 13 hours ago||

It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.

euroderf 11 hours ago||

Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.

jwr 13 hours ago|

I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.

Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.

More comments...