Top
Best
New

Posted by cloudking 13 hours ago

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)
735 points | 351 commentspage 7
ryandrake 11 hours ago|
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
riazrizvi 10 hours ago||
All you get here is some market signal from 1 or 2 posts if you already know how to do it. Most of these responses are garbage.
porkloin 10 hours ago||
I have good results with this setup:

Hardware:

- GPU: AMD 7900xtx, 24gb vram

- CPU: AMD 5950x, AM4

- RAM: 64gb DDR4 3600

Software:

- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)

- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units

- Network: tailscale

- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)

- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.

- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.

Models:

- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.

- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?

- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job

Flags (specific for Qwen 27b, since that's primary model):

- `-ngl 99` offload all layers to GPU

- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing

- `-np 1` single slot (no parallel request handling)

- `--no-context-shift` error instead of silently sliding the context window when full

- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)

- `-b 2048` logical batch size (tokens per submission)

- `-ub 1024` physical micro-batch (per GPU pass)

- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling

- `-fa on` flash attention

- `--spec-type draft-mtp` use the model's built-in MTP as the draft model

- `--spec-draft-n-max 3` propose up to 3 draft tokens per step

- `--spec-draft-n-min 0` allow zero drafts if confidence is low

- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path

- `--reasoning-format deepseek` parse <think> blocks in proper format

- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)

- `--jinja` use the GGUF's Jinja chat template

- `--temp 0.6` moderate randomness (Qwen recommended value for coding)

- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)

- `--top-k 20` top-20 candidates (Qwen recommended value for coding)

- `--min-p 0.0 disabled (Qwen recommended value for coding)

Performance (27b, primary model):

- ~65t/s for token generation

- ~600 t/s for prompt processing.

- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.

- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.

I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.

CLI/Harness:

- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)

- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window

- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.

A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.

This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.

Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(

ryandrake 9 hours ago||
Now that's what I'm talking about! Very cool, thank you for the detailed response.
627467 8 hours ago||
So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?

How much does this ware out the hardware?

Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?

qu0b 7 hours ago||
I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.
Lwerewolf 11 hours ago||
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.

Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.

Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.

Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.

overgard 8 hours ago||
I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
kristianpaul 8 hours ago||
Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solved

Nemotron super 3 110B works well for 1M context long vibecoding sessions

I also use Pi harness with no extension

jmward01 9 hours ago||
Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
abidlabs 9 hours ago|
Yes! https://huggingface.co/changelog/agent-trace-viewer
jmward01 9 hours ago||
Didn't realize they did this. I have avoided pushing data to huggingface. This is all -deeply- private info and I haven't really reviewed their privacy policies and the like. I'll give them a look.
shironnnn_ 9 hours ago||
I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.

Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.

The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.

mark_l_watson 9 hours ago||
I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.

I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.

For me, the problem with all local LLM-basic coding agents is slow runtime.

[1] https://leanpub.com/read/local-coding-agents

SugarReflex 5 hours ago|
Is anyone using Aider? Is there any decent CLI alternatives to it?
More comments...