Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Posted by cloudking 12 hours ago

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

735 points | 351 commentspage 6

ndom91 7 hours ago|

Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.

I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.

https://github.com/ndom91/llama-dash

derekered 5 hours ago||

I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.

russelg 2 hours ago|

I've got the same spec, are you running the 27B or the 35B-A3B? I found the 27B was unusably slow (like 10-15t/s not to mention the prefill times)

dabinat 9 hours ago||

There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.

rvnx 7 hours ago|

I start to believe that adding more and more and more and more and more thinking tokens is the hack that works (this is what gave birth to Fable)

bArray 8 hours ago||

I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.

[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

julianlam 3 hours ago||

Of course.

Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.

Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.

tumetab1 10 hours ago||

Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.

Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.

anonymousiam 10 hours ago||

This was posted shortly after your Ask HN post:

My Homelab AI Dev Platform

https://news.ycombinator.com/item?id=48542433

xhinker2 8 hours ago||

Yes, I have. 1. Two RTX 3090s in Linux 22.04 2. Running Qwen3.6-27B Q6_K_XL GGUF 3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine 4. Many times it solve problem that Codex can't solve

https://medium.com/p/f237d575e861

whartung 8 hours ago||

Will the inevitable M5 releases from Apple change this equation in any meaningful way?

I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.

mv4 8 hours ago|

I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.

cmrdporcupine 1 hour ago|

I was just looking and it should be possible to run this one on 3bit quant on my single Spark? Maybe? Depending on context size? Assuming 3-bit doesn't totally lobotomize it.

More comments...