Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Posted by threeturn 2 days ago

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

334 points | 181 commentspage 2

juujian 2 days ago|

I passed on the machine, but we set up gpt-oss-120b on a 128GB RAM Macbook pro and it is shockingly usable. Personally, I could imagine myself using that instead of OpenAI's web interface. The Ollama UI has web search working, too, so you don't have to worry about the model knowing the latest and greatest about every software package. Maybe one day I'll get the right drivers to run a local model on my Linux machine with AMD's NPU, too, but AMD has been really slow on this.

j45 2 days ago|

LM Studio also works well on Mac.

toogle 1 day ago||

I built a custom code-completion server: <https://github.com/toogle/mlx-dev-server>.

The key advantage is that it cancels generation when you continue typing, so invalidated completions don’t waste time. This makes completion latency predictable (about 1.5 seconds for me).

My setup: - MacBook Pro (M3 Max) - Neovim - https://github.com/huggingface/llm.nvim

Models I typically use: - mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx - mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit

simonw 2 days ago||

I'd be very interested to hear from anyone who's finding local models that work well for coding agents (Claude Code, Codex CLI, OpenHands etc).

I haven't found a local model that fits on a 64GB Mac or 128GB Spark yet that appears to be good enough to reliably run bash-in-a-loop over multiple turns, but maybe I haven't tried the right combination of models and tools.

embedding-shape 2 days ago|

I've had good luck with GPT-OSS-120b (reasoning_effort set to "high") + Codex + llama.cpp all running locally, but I needed to do some local patches to Codex as they don't allow configuring and setting the right values for temperature and top_p for GPT-OSS. Also heavy prompting via AGENTS.md was needed to get it to have similar workflow to GPT-5, it didn't seem to pick up that by itself, so I'm assuming GPT-5 been trained with Codex in mind while GPT-OSS wasn't.

Xenograph 2 days ago||

Would love for you to share the Codex patches you needed to make and the AGENTS.md prompting, if you're open to it.

embedding-shape 2 days ago||

Basically just find the place where the inference call happens, add top_k, top_p and temperature to hard-coded numbers (0, 1.0 and 1.0 for GPT-OSS) and you should be good to go. If you really need it, I could dig out patch from it, but it should be really straightforward today, and my patch might be conflicting with the current master of codex, I've diverged for other reasons since I did this.

Xenograph 2 days ago||

That makes sense, wasn't sure if it was as simple as tweaking those two numbers or not, thanks for sharing!

If there's any insight you can share about your AGENTS.md prompting, it may also be helpful for others!

bravetraveler 2 days ago||

I'm more local than anything, I guess. A Framework Desktop off in another room. 96G set aside for VRAM though I barely use it.

Kept it simple: ollama, whatever the latest model is in fashion [when I'm looking]. Feel silly to name any one in particular, I make them compete. I usually don't bother: I know the docs I need.

firefax 2 days ago||

I've been using Ollama, Gemma3:12b is about all my little air can handle.

If anyone has suggestions on other models, as an experiment I tried asking it to design me a new latex resumé and it struggled for two hours with the request to put my name prominently at the top in a grey box with my email and phone number beside it.

james2doyle 2 days ago|

I was playing with the new IBM Granite models. They are quick/small and they do seem accurate. You can even try them online in the browser because they are small enough to be loaded via the filesystem: https://huggingface.co/spaces/ibm-granite/Granite-4.0-Nano-W...

Not only are they a lot more recent than gemma, they seem really good at tool calling, so probably good for coding tools. I haven’t personally tried it myself for that.

The actual page is here: https://huggingface.co/ibm-granite/granite-4.0-h-1b

firefax 2 days ago|||

Interesting. Is there a way to load this into Ollama? Doing things in browser is a cool flex, but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine, with the end goal being those little queries I used to send to "the cloud" can be done offline, privately.

fultonn 2 days ago||

> Is there a way to load this into Ollama?

Yes, the granite 4 models are on ollama:

https://ollama.com/library/granite4

> but my interest is specifically in privacy respecting LLMs -- my goal is to run the most powerful one I can on my personal machine

The HF Spaces demo for granite 4 nano does run on your local machine, using Transformers.js and ONNX. After downloading the model weights you can disconnect from the internet and things should still work. It's all happening in your browser, locally.

Of course ollama is preferable for your own dev environment. But ONNX and transformers.js is amazingly useful for edge deployment and easily sharing things with non-technical users. When I want to bundle up a little demo for something I typically just do that instead of the old way I did things (bundle it all up on a server and eat the inference cost).

firefax 1 day ago||

Thanks for this pointer and explanation, I appreciate it.

Also my "dev enviornment" is vi -- I come from infosec (so basically a glorified sysadmin) so I'm mostly making little bash and python scripts, so I'm learning a lot of new things about software engineering as I explore this space :-)

Edit: Hey which of the models on that page were you referring to? I'm grabbing one now that's apparently double digit GB? Or were you saying they're not CPU/ram intensive but still a bit big?

brendoelfrendo 2 days ago|||

Not the person you replied to, but thanks for this recommendation. These look neat! I'm definitely going to give them a try.

embedding-shape 2 days ago||

> Which model(s) are you running (e.g., Ollama, LM Studio, or others)

I'm running mainly GPT-OSS-120b/20b depending on the task, Magistral for multimodal stuff and some smaller models I've fine-tuned myself for specific tasks..

All the software is implemented by myself, but I started out with basically calling out to llama.cpp, as it was the simplest and fastest option that let me integrate it into my own software without requiring a GUI.

I use Codex and Claude Code from time to time to do some mindless work too, Codex hooked up to my local GPT-OSS-120b while Claude Code uses Sonnet.

> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

Desktop, Ryzen 9 5950X, 128GB of RAM, RTX Pro 6000 Blackwell (96GB VRAM), performs very well and I can run most of the models I use daily all together, unless I want really large context then just GPT-OSS-120B + max context, ends up taking ~70GB of VRAM.

> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

Almost anything and everything, but mostly coding. But then general questions, researching topics, troubleshooting issues with my local infrastructure, troubleshooting things happening in my other hobbies and a bunch of other stuff. As long as you give the local LLM access to a search tool (I use YaCy + my own adapter), local models works better for me than the hosted models, mainly because of the speed and I have better control over the inference.

It does fall short on really complicated stuff. Right now I'm trying to do CUDA programming, creating a fused MoE kernel for inference in Rust, and it's a bit tricky as there are a lot of moving parts and I don't understand the subject 100%, and when you get to that point, it's a bit hit or miss. You really need to have a proper understanding of what you use the LLM for, otherwise it breaks down quickly. Divide and conquer as always helps a lot.

andai 2 days ago|

gpt-oss-120b keeps stopping for me in Codex. (Also in Crush.)

I have to say "continue" constantly.

embedding-shape 2 days ago||

See https://news.ycombinator.com/item?id=45773874 (TLDR, you need to hard-code some inference parameters to be the right ones, otherwise you'd get really bad behaviour + prompting to get the workflow right)

andai 2 days ago||

Thanks. Did you need to modify Codex's prompt?

alexfromapex 2 days ago||

I have a MacBook M3 Max with 128 GB unified RAM. I use Ollama with Open Web UI. It performs very well with models up to 80B parameters but it does get very hot with models over 20B parameters.

I use it to do simple text-based tasks occasionally if my Internet is down or ChatGPT is down.

I also use it in VS Code to help with code completion using the Continue extension.

I created a Firefox extension so I can use Open WebUI in my browser by pressing Cmd+Shift+Space too when I am browsing the web and want to ask a question: https://addons.mozilla.org/en-US/firefox/addon/foxyai/

Greenpants 1 day ago||

I got a personal Mac Studio M4 Max with 128GB RAM for a silent, relatively power-efficient yet powerful home server. It runs Ollama + Open WebUI with GPT-OSS 120b as well as GLM4.5-Air (default quantisations). I rarely ever use ChatGPT anymore. Love that all data stays at home. I connect remotely only via VPN (my phone enables this automatically via Tasker).

I'm 50% brainstorming ideas with it, asking critical questions and learning something new. The other half is actual development, where I describe very clearly what I know I'll need (usually in TODOs in comments) and it will write those snippets, which is my preferred way of AI-assistance. I stay in the driver seat; the model becomes the copilot. Human-in-the-loop and such. Worked really well for my website development, other personal projects and even professionally (my work laptop has its own Open WebUI account for separation).

mark_l_watson 1 day ago|

I like your method of adding TODOs in your code, then using a model - I am going to try that. I only have a 32G M2 Mac so I have to use Ollama Cloud to run some of the larger models but that said I am surprised by what I can do ‘all local’ and it really is magical running all on my own hardware, when I can.

Greenpants 1 day ago||

The TODOs really help me get my logic sorted out first in pseudocode. Glad to inspire someone else with it!

I've read that GPT-OSS:20b is still a very powerful model, I believe it fits in your Mac's RAM as well and could still be quite fast to output. For me personally, only the more complex questions require a better model than local ones. And then I'm often wondering if LLMs are the right tool to solve the complexity.

ThrowawayTestr 2 days ago||

I use the abliterated and uncensored models to generate smut. SwarmUI to generate porn. I can only get a few tokens/s on my machine so not fast enough for quick back and forth stuff.

j45 2 days ago|

The M2/3/4 Max CPUs in a Mac Studio or Macbook Pro when paired with enough ram are quite capable.

In more cases than expected, the M1/M2 Ultras are still quite capable, especially performance power per watt of electricity, as well as ability to serve one user.

The Mac Studio has better bang for the buck than the laptop for computational power to price.

Depending on your needs, the M5's might be worth waiting for, but M2 Max onward are quite capable with enough ram. Even the M1 Max continues to be a workhorse.

More comments...