Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Posted by cloudking 18 hours ago

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

963 points | 431 commentspage 11

deployementeng 2 hours ago|

partially yes.

anubhav200 15 hours ago||

Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)

Razengan 16 hours ago||

Related: Are there any viable distributed AI models?

Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.

Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.

Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.

joshuamoyers 16 hours ago||

I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.

SimianSci 14 hours ago||

This is unlikely to happen in any meaningful fashion for quite some time.

(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)

Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.

To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.

Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.

The dream of distributed AI is a ways off.

salutonmundo 10 hours ago||

it's called your damn brain.

devmor 4 hours ago||

I’d be surprised if this was useful for much. Claude is already almost too slow to do anything serious I’d consider using it for outside of grunt work without parallelizing.

The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.

wmedrano 13 hours ago||

No, but I use GLM5.1 instead of Claude/GPT.

drnick1 13 hours ago||

Do you recommend Ollama or bare llama.cpp?

jboss10 12 hours ago||

llama.cpp It's faster and more open source. Ollama has some mixed history. I use llama-swap to emulate the Ollama experience.

shironnnn_ 12 hours ago||

if on MacOS I recommend llm-mlx which currently renders tokens 10%-15% faster than llama.cpp.

devin 14 hours ago||

Anyone here running a tinygrad?

system2 16 hours ago||

Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.

ColonelPhantom 9 hours ago|

Which model class requires an 80 GB VRAM GPU? From my perspective, popular models seem to be either in the ~30B range (Qwen3.6, Gemma 4), while the larger models (MiniMax, MiMo, StepFun, Deepseek) are in the multiple hundreds of billions parameters, for which 80 GB is simply too small.

You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.

system2 2 hours ago|||

Video models.

CamperBob2 4 hours ago|||

This is true. There's not much point in buying only one RTX 6000. You need at least two to run anything interesting that you couldn't run on a 5090. And you can imagine where it goes from there.

christkv 16 hours ago|

Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.

More comments...