Top
Best
New

Posted by cloudking 5 hours ago

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)
191 points | 126 commentspage 2
bravetraveler 1 hour ago|
I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Claude/GPT/etc even in the unlikely event these are the last local models released; simply not needed.
Kostic 2 hours ago||
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.

Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.

Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.

Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.

EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.

cuttysnark 2 hours ago||
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.

Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.

In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.

sowbug 12 minutes ago||
Have you (or anyone else) tried letting agents compete? For example, give the same coding task to two models, or to the same model with a different seed, and have the reviewer choose the better result.

Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.

pianopatrick 1 hour ago||
I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.

Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."

Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.

Like "The Local AI challenge"

xhinker2 1 hour ago||
Yes, I have. 1. Two RTX 3090s in Linux 22.04 2. Running Qwen3.6-27B Q6_K_XL GGUF 3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine 4. Many times it solve problem that Codex can't solve

https://medium.com/p/f237d575e861

arjie 3 hours ago||
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.

I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.

akersten 1 hour ago||
> I have 2x RTX Pro 6000 Blackwell

Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...

leptons 2 hours ago||
Have you measured your electricity consumption for this rig? I have to wonder how much it would cost you per month.
ux266478 1 hour ago||
Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.

The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.

stymaar 2 hours ago||
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).

I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.

And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.

Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).

manmal 1 hour ago|
Have you tried the 27B dense version? It’s way better for coding.
anana_ 1 hour ago||
Unfortunately on Strix Halo or any similar unified memory set up, dense models are gonna be dirt slow due to the tiny memory bandwidth... But I agree, 27B is superior.
stymaar 50 minutes ago||
Exactly. That's why I'm disappointed there wasn't a 122B version, it's 27B but for Strix Halo users.
jmward01 1 hour ago||
Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
abidlabs 1 hour ago|
Yes! https://huggingface.co/changelog/agent-trace-viewer
jmward01 1 hour ago||
Didn't realize they did this. I have avoided pushing data to huggingface. This is all -deeply- private info and I haven't really reviewed their privacy policies and the like. I'll give them a look.
HappySweeney 3 hours ago||
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
zaptheimpaler 1 hour ago||
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
anana_ 1 hour ago|
Perhaps try a different model? Just from anecdotal experience, I find that the Gemma models smaller than 31B do not tool call as often as they should.

Some of the benchmarks appear to back this up [0]

Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.

[0]: https://artificialanalysis.ai/models/open-source/small?model...

acc_297 3 hours ago|
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies

but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)

I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.

htrp 54 minutes ago||
Cursor is doing that (i think with Fireworks as their provider)

https://cursor.com/blog/real-time-rl-for-composer

rolisz 3 hours ago||
I'm interested in trying something similar. I was thinking to do this for my OpenClaw agent.

About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that

More comments...