Posted by threeturn 2 days ago
Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop
Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?
What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
I'm conducting my own investigation, which I will be happy to share as well when over.
Thanks! Andrea.
The key advantage is that it cancels generation when you continue typing, so invalidated completions don’t waste time. This makes completion latency predictable (about 1.5 seconds for me).
My setup: - MacBook Pro (M3 Max) - Neovim - https://github.com/huggingface/llm.nvim
Models I typically use: - mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx - mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
I haven't found a local model that fits on a 64GB Mac or 128GB Spark yet that appears to be good enough to reliably run bash-in-a-loop over multiple turns, but maybe I haven't tried the right combination of models and tools.
If there's any insight you can share about your AGENTS.md prompting, it may also be helpful for others!
Kept it simple: ollama, whatever the latest model is in fashion [when I'm looking]. Feel silly to name any one in particular, I make them compete. I usually don't bother: I know the docs I need.
If anyone has suggestions on other models, as an experiment I tried asking it to design me a new latex resumé and it struggled for two hours with the request to put my name prominently at the top in a grey box with my email and phone number beside it.
Not only are they a lot more recent than gemma, they seem really good at tool calling, so probably good for coding tools. I haven’t personally tried it myself for that.
The actual page is here: https://huggingface.co/ibm-granite/granite-4.0-h-1b
Yes, the granite 4 models are on ollama:
https://ollama.com/library/granite4
> but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine
The HF Spaces demo for granite 4 nano does run on your local machine, using Transformers.js and ONNX. After downloading the model weights you can disconnect from the internet and things should still work. It's all happening in your browser, locally.
Of course ollama is preferable for your own dev environment. But ONNX and transformers.js is amazingly useful for edge deployment and easily sharing things with non-technical users. When I want to bundle up a little demo for something I typically just do that instead of the old way I did things (bundle it all up on a server and eat the inference cost).
Also my "dev enviornment" is vi -- I come from infosec (so basically a glorified sysadmin) so I'm mostly making little bash and python scripts, so I'm learning a lot of new things about software engineering as I explore this space :-)
Edit: Hey which of the models on that page were you referring to? I'm grabbing one now that's apparently double digit GB? Or were you saying they're not CPU/ram intensive but still a bit big?
I'm running mainly GPT-OSS-120b/20b depending on the task, Magistral for multimodal stuff and some smaller models I've fine-tuned myself for specific tasks..
All the software is implemented by myself, but I started out with basically calling out to llama.cpp, as it was the simplest and fastest option that let me integrate it into my own software without requiring a GUI.
I use Codex and Claude Code from time to time to do some mindless work too, Codex hooked up to my local GPT-OSS-120b while Claude Code uses Sonnet.
> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
Desktop, Ryzen 9 5950X, 128GB of RAM, RTX Pro 6000 Blackwell (96GB VRAM), performs very well and I can run most of the models I use daily all together, unless I want really large context then just GPT-OSS-120B + max context, ends up taking ~70GB of VRAM.
> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
Almost anything and everything, but mostly coding. But then general questions, researching topics, troubleshooting issues with my local infrastructure, troubleshooting things happening in my other hobbies and a bunch of other stuff. As long as you give the local LLM access to a search tool (I use YaCy + my own adapter), local models works better for me than the hosted models, mainly because of the speed and I have better control over the inference.
It does fall short on really complicated stuff. Right now I'm trying to do CUDA programming, creating a fused MoE kernel for inference in Rust, and it's a bit tricky as there are a lot of moving parts and I don't understand the subject 100%, and when you get to that point, it's a bit hit or miss. You really need to have a proper understanding of what you use the LLM for, otherwise it breaks down quickly. Divide and conquer as always helps a lot.
I have to say "continue" constantly.
I use it to do simple text-based tasks occasionally if my Internet is down or ChatGPT is down.
I also use it in VS Code to help with code completion using the Continue extension.
I created a Firefox extension so I can use Open WebUI in my browser by pressing Cmd+Shift+Space too when I am browsing the web and want to ask a question: https://addons.mozilla.org/en-US/firefox/addon/foxyai/
I'm 50% brainstorming ideas with it, asking critical questions and learning something new. The other half is actual development, where I describe very clearly what I know I'll need (usually in TODOs in comments) and it will write those snippets, which is my preferred way of AI-assistance. I stay in the driver seat; the model becomes the copilot. Human-in-the-loop and such. Worked really well for my website development, other personal projects and even professionally (my work laptop has its own Open WebUI account for separation).
I've read that GPT-OSS:20b is still a very powerful model, I believe it fits in your Mac's RAM as well and could still be quite fast to output. For me personally, only the more complex questions require a better model than local ones. And then I'm often wondering if LLMs are the right tool to solve the complexity.
In more cases than expected, the M1/M2 Ultras are still quite capable, especially performance power per watt of electricity, as well as ability to serve one user.
The Mac Studio has better bang for the buck than the laptop for computational power to price.
Depending on your needs, the M5's might be worth waiting for, but M2 Max onward are quite capable with enough ram. Even the M1 Max continues to be a workhorse.