Top
Best
New

Posted by cloudking 8 hours ago

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)
480 points | 245 commentspage 4
627467 2 hours ago|
So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?

How much does this ware out the hardware?

Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?

ndom91 3 hours ago||
Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.

I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.

https://github.com/ndom91/llama-dash

mitchell_h 6 hours ago||
Tried. The context windows just weren't big enough.
coder543 4 hours ago||
Qwen3.6-27B supports a 1 million token context window.

Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.

lysace 5 hours ago|||
Got a similar result (my RTX 4070 only has 12 GB). I'm curious about whether 24/32 GB meaningfully improves this enough to make it useful.
tobyhinloopen 5 hours ago||
Try it on RAM and CPU.

It’s slower but you can run them.

lysace 4 hours ago||
Good idea for evaluating the models, thanks.
deadbabe 6 hours ago||
Prompt more directly instead of open ended.
bArray 4 hours ago||
I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.

[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

anubhav200 5 hours ago||
Yes, llama.cpp, qwen27b, 35b, claude code. Llama-cpp-manager for managing llama.cpp configs (https://github.com/anubhavgupta/llama-cpp-manager)
zaptheimpaler 4 hours ago||
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
anana_ 4 hours ago|
Perhaps try a different model? Just from anecdotal experience, I find that the Gemma models smaller than 31B do not tool call as often as they should.

Some of the benchmarks appear to back this up [0]

Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.

[0]: https://artificialanalysis.ai/models/open-source/small?model...

kristianpaul 3 hours ago||
Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solved

Nemotron super 3 110B works well for 1M context long vibecoding sessions

I also use Pi harness with no extension

pianopatrick 4 hours ago||
I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.

Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."

Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.

Like "The Local AI challenge"

overgard 3 hours ago||
I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
BiraIgnacio 5 hours ago|
I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed).

I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).

More comments...