Posted by cloudking 13 hours ago
Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?
Hardware:
- GPU: AMD 7900xtx, 24gb vram
- CPU: AMD 5950x, AM4
- RAM: 64gb DDR4 3600
Software:
- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
- Network: tailscale
- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
Models:
- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
Flags (specific for Qwen 27b, since that's primary model):
- `-ngl 99` offload all layers to GPU
- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
- `-np 1` single slot (no parallel request handling)
- `--no-context-shift` error instead of silently sliding the context window when full
- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
- `-b 2048` logical batch size (tokens per submission)
- `-ub 1024` physical micro-batch (per GPU pass)
- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
- `-fa on` flash attention
- `--spec-type draft-mtp` use the model's built-in MTP as the draft model
- `--spec-draft-n-max 3` propose up to 3 draft tokens per step
- `--spec-draft-n-min 0` allow zero drafts if confidence is low
- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
- `--reasoning-format deepseek` parse <think> blocks in proper format
- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
- `--jinja` use the GGUF's Jinja chat template
- `--temp 0.6` moderate randomness (Qwen recommended value for coding)
- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
- `--top-k 20` top-20 candidates (Qwen recommended value for coding)
- `--min-p 0.0 disabled (Qwen recommended value for coding)
Performance (27b, primary model):
- ~65t/s for token generation
- ~600 t/s for prompt processing.
- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
CLI/Harness:
- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window
- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.