Posted by threeturn 3 days ago
Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop
Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?
What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
I'm conducting my own investigation, which I will be happy to share as well when over.
Thanks! Andrea.
I got sleep working by disabling webcam in the bios for now.
On both i have setup lemonade-server on system start. At work i use Qwen3 Coder 30B-3A with continue.dev. It serves me well in 90% of cases.
At home i have 128GB RAM. I try a bit GPT120B. I host Open WebUI on it and connect via https and wireguard to it, so i can use it as PWA on my phone. I love not needing to think about where my data goes. But i would like to allow parallel requests, so i need to tinker a bit more. Maybe llama-swap is enough.
I just need to see how to deal with context length. My models stop or go into infinite loop after some messages. But then i often start a new chat.
Lemonade-server runs with llama.cpp, vllm seems to be scaling better thoug, but is not so easy to set up.
Unsloth GGUFs are great resource for models.
Also for Strix Halo check out kyuz0 repositorIES! Also has image gen. I didnt try those yet. But the benchmarks are awesome! Lots to learn from. Framework forum can be useful, too.
https://github.com/kyuz0/amd-strix-halo-toolboxes Also nice: https://llm-tracker.info/ It links to some benchmark site with models by size. I prefer such resources, since it is quite easy to see which one fit in my RAM (even though i have this silly thumbrule Billion Token ≈ GB RAM).
Btw. even a AMD HX 370 with non soldered RAM can get some nice t/s for smaller models. Can be helpful enough when disconnected from internet and you dont know how to style a svg :)
Thanks for opening up this topic! Lots of food :)
Give it time, we'll get there, but not anytime soon.
My development flow takes a lot of RAM (and yes I can run it minimally editing in the terminal with language servers turned off), so I wouldn't consider running the local LLM on the same computer.
It's sort of like doing all your work on an 80386. Can it be made to work? Probably. Are you going to learn a whole lot making it work? Without a doubt! Are you going to be the fastest dev on the team? No.
I love local models for some use cases. However for coding there is a big gap between the quality of models you can run at home and those you can't (at least on hardware I can afford) like GLM 4.6, Sonnet 4.5, Codex 5, Qwen Coder 408.
What makes local coding models compelling?
>> motivation
It's the only way to be sure it's not being trained on.
Most people never come up with any truly novel ideas to code. That's fine. There's no point in those people not submitting their projects to LLM providers.
This lack of creativity is so prevalent, that many people believe that it is not possible to come up with new ideas (variants: it's all been tried before; or: it would inevitably be tried by someone else anyway; or: people will copy anyway).
Some people do come up with new stuff, though. And (sometimes) they don't want to be trained on. That is the main edge IMO, for running local models.
In a word: competition.
Note, this is distinct from fearing copying by humans (or agents) with LLMs at their disposal. This is about not seeding patterns more directly into the code being trained on.
Most people would say, forget that, just move fast and gain dominance. And they might not be wrong. Time may tell. But the reason can still stand as a compelling motivation, at least theoretically.
Tangential: IANAL, but I imagine there's some kind of parallel concept around code/concept "property ownership". If you literally send your code to a 3P LLM, I'm guessing they have rights to it and some otherwise handwavy (quasi important) IP ownership might become suspect. We are possibly in a post-IP world (for some decades now depending on who's talking), but not everybody agrees on that currently, AFAICT.
Re:creative competition - that’s interesting. I open source much of my creative work so I guess that’s never been a concern of mine.
Paying money for probabilistically generated tokens is effectively gambling. I don't like to gamble.
But really no host you trust to not keep data? Big tech with no-log guarantees and contractual liability? Companies with no-log guarantees and clear inference business model to protect like Together/Fireworks? Motives seem aligned.
I'd run locally if I could without compromise. But the gap from GLM 4.5 Air to GLM 4.6 is huge for productivity.
Why take a chance?
This all day long.
Plus I like to see what can be done without relying on big tech (relying on someone to create an LLM that I can use, notwithstanding).
I learn a lot about how LLMs work and how to work with them.
I can also ask my dumbest questions to a local model and get a response faster, without burning tokens that count towards usage limits on the hosted services I use for actual work.
Definitely a hobby-category activity though, don't feel you're missing out on some big advantage (yet, anyway) unless you feel a great desire to set fire to thousands of dollars in exchange for spending your evenings untangling CUDA driver issues and wondering if that weird smell is your GPU melting. Some people are into that sort of thing, though.
Anyone could chime in! I just want to have working local model that is at least as good as Sonnet 4.5, or 3.x.
My only complaint is agent mode needs good token gen so I only go agent mode on the RTX machine.
I grew up on 9600baud so I’m cool with watching the text crawl.
I had to create a custom image of llama.cpp compiled with vulkan so the LLMs can access the GPU on my MacBook Air M4 from inside the containers for inference. It's much faster, like 8-10x faster than without.
To be honest so far I've been using mostly cloud models for coding, the local models haven't been that great.
Some more details on the blog: https://markjgsmith.com/posts/2025/10/12/just-use-llamacpp
In terms of models, qwen2.5-coder:3b is a good compromise for autocomplete, as agent choose pretty much just the biggest sota model you can run