Top
Best
New

Posted by cloudking 17 hours ago

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)
932 points | 421 commentspage 10
catapart 10 hours ago|
tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.
Rzor 4 hours ago|
RX 9060 XT 16GB here on google/gemma-4-26b-a4b-qat using LM Studio. Context 65k, 23 layers on the GPU, 7 on the CPU, model in memory, mmapped. I'm getting 23-33 tks. Started experimenting 3 days ago (with gemma-4-e4b), don't know what half those settings mean, but 26B, even quantified, feels significantly better at a few small projects I asked it to create ("create a image converter using ffmpeg in bash", "create a canvas animation with real physics, no libraries"[1]).

It's faster than I can read, but it feels slow as hell. I think 40-50 tks is probably much more comfortable and I hope I can reach that when trying this on llamacpp soon enough.

[0] - https://pastes.io/9gaARxE8

[1] - https://jsfiddle.net/pou4nbh9/1/

Model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...

AH4oFVbPT4f8 14 hours ago||
Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.
xeonax 13 hours ago|
Whats .NET doing in between?
AH4oFVbPT4f8 11 hours ago||
Sorry, I meant to say I was writing .NET C# with the setup
SkitterKherpi 15 hours ago||
It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
SugarReflex 9 hours ago||
Is anyone using Aider? Is there any decent CLI alternatives to it?
chungus 9 hours ago||
Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.

I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.

Power usage is also totally not an issue, AI workload is very different from gaming.

tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.

euroderf 12 hours ago||
Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.
jwr 14 hours ago||
I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.

Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.

jmichaelson 14 hours ago||
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).

I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.

devmor 3 hours ago||
I’d be surprised if this was useful for much. Claude is already almost too slow to do anything serious I’d consider using it for outside of grunt work without parallelizing.

The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.

anubhav200 14 hours ago|
Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)
More comments...