Top
Best
New

Posted by threeturn 10/31/2025

Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

350 points | 192 commentspage 3
hacker_homie 10/31/2025|
Any halo strix laptop, I have been using the hp zbook ultra g1a with 128gb of unified memory. Mostly with the 20B parameters models but it can load larger ones. I find local models (gpt oss 20B) are good quick references but if you want to refactor or something like that you need a bigger model. I’m running llama.cpp directly and using the api it offers for neovim’s avante plugin, or a cli tool like aichat, it comes with a basic web interface as well.
zamadatix 10/31/2025|
Do you run into hibernation/sleep issues under current mainline Linux kernels by chance? I have this laptop and that's the only thing which isn't working out of the box for me on the Linux side, but it works fine in Windows. I know it's officially supported under the Ubuntu LTS, but I was hoping that wouldn't be needed as I do want a newer+customized kernel.
hacker_homie 10/31/2025||
Under current kernels (6.17) it seems there is an issue with the webcam driver, https://bugzilla.kernel.org/show_bug.cgi?id=220702 . looks like there are still some issues with sleep/webcam at this time, they might be fixed by the 6.18 release.

I got sleep working by disabling webcam in the bios for now.

zamadatix 11/2/2025||
Well shucks, my sleep was still broken after disabling that :/. Will have to keep poking at it - thanks!
sho 10/31/2025||
Real-world workflows? I'm all for local LLM, tinker with it all the time, but for productive coding use no local LLM approaches cloud and it's not even close. There's no magic trick or combination of pieces, it just turns out that a quarter million dollars worth of H200s is just much, much better than anything a normal person could possibly deploy at home.

Give it time, we'll get there, but not anytime soon.

exac 10/31/2025||
I thought you would just use another computer in your house for the flows?

My development flow takes a lot of RAM (and yes I can run it minimally editing in the terminal with language servers turned off), so I wouldn't consider running the local LLM on the same computer.

sho 11/1/2025||
It's not about which of your computers you run it on, it's about the relative capability of any system you're likely to own vs. what a cloud provider can do. The difference is hilarious - probably 100x. Knowing that, unless you have good reasons (and experimenting/playing around IS a good reason) - not many people would choose to actually base their everyday workflow on an all-local setup.

It's sort of like doing all your work on an 80386. Can it be made to work? Probably. Are you going to learn a whole lot making it work? Without a doubt! Are you going to be the fastest dev on the team? No.

starik36 10/31/2025||
You are right. This is the current situation. Plus the downside is that your laptop heats up like a furnace if you use the local LLM a lot.
saubeidl 10/31/2025||
I think local LLM and laptop is not really compatible, for anything useful. You're gonna want a bigger box and have your laptop connect to that.
dennemark 10/31/2025||
I have AMD Strix Halo (395) on my work laptop (HP Ultrabook G1A) as well as at home with Framework Desktop.

On both i have setup lemonade-server on system start. At work i use Qwen3 Coder 30B-3A with continue.dev. It serves me well in 90% of cases.

At home i have 128GB RAM. I try a bit GPT120B. I host Open WebUI on it and connect via https and wireguard to it, so i can use it as PWA on my phone. I love not needing to think about where my data goes. But i would like to allow parallel requests, so i need to tinker a bit more. Maybe llama-swap is enough.

I just need to see how to deal with context length. My models stop or go into infinite loop after some messages. But then i often start a new chat.

Lemonade-server runs with llama.cpp, vllm seems to be scaling better thoug, but is not so easy to set up.

Unsloth GGUFs are great resource for models.

Also for Strix Halo check out kyuz0 repositorIES! Also has image gen. I didnt try those yet. But the benchmarks are awesome! Lots to learn from. Framework forum can be useful, too.

https://github.com/kyuz0/amd-strix-halo-toolboxes Also nice: https://llm-tracker.info/ It links to some benchmark site with models by size. I prefer such resources, since it is quite easy to see which one fit in my RAM (even though i have this silly thumbrule Billion Token ≈ GB RAM).

Btw. even a AMD HX 370 with non soldered RAM can get some nice t/s for smaller models. Can be helpful enough when disconnected from internet and you dont know how to style a svg :)

Thanks for opening up this topic! Lots of food :)

Balinares 11/1/2025|
Does Qwen3 Coder do a good job invoking its tools as appropriate for you? Under continue.dev at least, I've found I need to remind it constantly.
scosman 10/31/2025||
What are folks motivation for using local coding models? Is it privacy and there's no cloud host you trust?

I love local models for some use cases. However for coding there is a big gap between the quality of models you can run at home and those you can't (at least on hardware I can afford) like GLM 4.6, Sonnet 4.5, Codex 5, Qwen Coder 408.

What makes local coding models compelling?

realityfactchex 10/31/2025||
> compelling

>> motivation

It's the only way to be sure it's not being trained on.

Most people never come up with any truly novel ideas to code. That's fine. There's no point in those people not submitting their projects to LLM providers.

This lack of creativity is so prevalent, that many people believe that it is not possible to come up with new ideas (variants: it's all been tried before; or: it would inevitably be tried by someone else anyway; or: people will copy anyway).

Some people do come up with new stuff, though. And (sometimes) they don't want to be trained on. That is the main edge IMO, for running local models.

In a word: competition.

Note, this is distinct from fearing copying by humans (or agents) with LLMs at their disposal. This is about not seeding patterns more directly into the code being trained on.

Most people would say, forget that, just move fast and gain dominance. And they might not be wrong. Time may tell. But the reason can still stand as a compelling motivation, at least theoretically.

Tangential: IANAL, but I imagine there's some kind of parallel concept around code/concept "property ownership". If you literally send your code to a 3P LLM, I'm guessing they have rights to it and some otherwise handwavy (quasi important) IP ownership might become suspect. We are possibly in a post-IP world (for some decades now depending on who's talking), but not everybody agrees on that currently, AFAICT.

scosman 11/2/2025||
There are guarantees from several providers that they don’t train on, or even retain, a copy of your data. You are right they could be lying, but some are big enough that would be catastrophic to them from a liability point of view.

Re:creative competition - that’s interesting. I open source much of my creative work so I guess that’s never been a concern of mine.

jckahn 10/31/2025|||
I don't ever want to be dependent on a cloud service to be productive, and I don't want to have to pay money to experiment with code.

Paying money for probabilistically generated tokens is effectively gambling. I don't like to gamble.

nprateem 10/31/2025||
Where did you get your free GPU from?
nicce 10/31/2025|||
The problem is the same as owning the house vs. renting.
jckahn 10/31/2025||||
I just use my AMD Framework 13 and 24GB M4 Mac mini. They run gpt-oss models, but only the 20b fits on the mini.
serf 11/1/2025|||
GPUs can do other things. Cloud service LLM providers cannot.
voakbasda 10/31/2025|||
Zero trust in remote systems run by others with unknowable or questionable motives.
scosman 10/31/2025|||
Makes sense that you'd run locally then.

But really no host you trust to not keep data? Big tech with no-log guarantees and contractual liability? Companies with no-log guarantees and clear inference business model to protect like Together/Fireworks? Motives seem aligned.

I'd run locally if I could without compromise. But the gap from GLM 4.5 Air to GLM 4.6 is huge for productivity.

xemdetia 10/31/2025||
This really isn't an all or nothing sort of situation. Many of the AI players have a proven record of simply not following existing norms. Until there is a consumer oriented player who is not presuming that training on my private data and ideas is permitted it only makes sense to do some stuff things locally. Beyond that many of the companies providing AI have either weird limits or limitations that interrupt me. I just know as an individual or a fledgling company I am simply not big enough to fight some of these players and win, and the compliance around companies running AI transparently is too new for me to rely on so the rules of engagement are all over the place. Also don't forget in a few years when the dust settles that company with that policy you like is highly likely to be consumed by a company who may not share the same ethics but your data is still held by them.

Why take a chance?

fm2606 10/31/2025|||
> Zero trust in remote systems run by others with unknowable or questionable motives.

This all day long.

Plus I like to see what can be done without relying on big tech (relying on someone to create an LLM that I can use, notwithstanding).

zargon 10/31/2025|||
Another reason along with the others is that the output quality of the top commercial models varies wildly with time. They start strong and then deteriorate. The providers keep changing the model and/or its configuration without changing the name. With a local open weights model, you can learn each model's strengths and it can't be taken away with an update.
brailsafe 10/31/2025|||
I don't run any locally, but when I was thinking about investing in a setup, it would just be to have the tool offline. I haven't found the online subscription models to be sufficiently and frequently useful enough beyond occasional random tedious implementations that I'd consider investing in either online or offline LLMs long-term, and I've reverted back to normal programming for the most part, since it just keeps me more engaged.
IanCal 10/31/2025||
Something to consider is using a middleman like openrouter, you can buy some credits and then use them at whatever provider through them - no subscription just payg. For a few ad hoc things you can put a few bucks in and not worry about some monthly thing.
johnisgood 10/31/2025|||
What setup would you (or other people) recommend for a local model, and which model, if I want something like Claude Sonnet 4.5 (or actually, earlier versions, which seemed to be better)?

Anyone could chime in! I just want to have working local model that is at least as good as Sonnet 4.5, or 3.x.

scosman 10/31/2025||
Nothing open is quite as good as Sonnet 4.5 and Codex 5. GLM 4.6, MiniMax M2, Deepseek v3.2, Kimi K2 and Qwen Coder 3 are close. But those are hundreds of billions of parameters, so running locally is very very expensive.
johnisgood 10/31/2025||
That is unfortunate. I will never be able to afford such hardware that could run them. :(
garethsprice 10/31/2025|||
It's fun for me. This is a good enough reason to do anything.

I learn a lot about how LLMs work and how to work with them.

I can also ask my dumbest questions to a local model and get a response faster, without burning tokens that count towards usage limits on the hosted services I use for actual work.

Definitely a hobby-category activity though, don't feel you're missing out on some big advantage (yet, anyway) unless you feel a great desire to set fire to thousands of dollars in exchange for spending your evenings untangling CUDA driver issues and wondering if that weird smell is your GPU melting. Some people are into that sort of thing, though.

nprateem 10/31/2025||
Deep-seated paranoia, delusions of grandeur, bragging rights, etc, etc.
Greenpants 11/1/2025||
I got a personal Mac Studio M4 Max with 128GB RAM for a silent, relatively power-efficient yet powerful home server. It runs Ollama + Open WebUI with GPT-OSS 120b as well as GLM4.5-Air (default quantisations). I rarely ever use ChatGPT anymore. Love that all data stays at home. I connect remotely only via VPN (my phone enables this automatically via Tasker).

I'm 50% brainstorming ideas with it, asking critical questions and learning something new. The other half is actual development, where I describe very clearly what I know I'll need (usually in TODOs in comments) and it will write those snippets, which is my preferred way of AI-assistance. I stay in the driver seat; the model becomes the copilot. Human-in-the-loop and such. Worked really well for my website development, other personal projects and even professionally (my work laptop has its own Open WebUI account for separation).

mark_l_watson 11/1/2025|
I like your method of adding TODOs in your code, then using a model - I am going to try that. I only have a 32G M2 Mac so I have to use Ollama Cloud to run some of the larger models but that said I am surprised by what I can do ‘all local’ and it really is magical running all on my own hardware, when I can.
Greenpants 11/1/2025||
The TODOs really help me get my logic sorted out first in pseudocode. Glad to inspire someone else with it!

I've read that GPT-OSS:20b is still a very powerful model, I believe it fits in your Mac's RAM as well and could still be quite fast to output. For me personally, only the more complex questions require a better model than local ones. And then I'm often wondering if LLMs are the right tool to solve the complexity.

brendoelfrendo 10/31/2025||
I keep mine pretty simple: my desktop at home has an AMD 7900XT with 20gb VRAM. I use Ollama to run local models and point Zed's AI integration at it. Right now I'm mostly running Devstral 24b or an older Qwen 2.5 Coder 14b. Looking at it, I might be able to squeak by running Qwen 3 Coder 30b, so I might give it a try to test it out.
reactordev 10/31/2025||
I use LM Studio with GGUF models running on either my Apple MacBook Air M1 (it’s, ok…) or my Alienware x17 R2 with an RTX 3080 on a Core i9 (runs like autocomplete) in VS Code using Continue.dev

My only complaint is agent mode needs good token gen so I only go agent mode on the RTX machine.

I grew up on 9600baud so I’m cool with watching the text crawl.

mjgs 10/31/2025||
I use podman compose to spin up an Open WebUI container and various Llama.cpp containers, 1 for each model. Nothing fancy like a proxy or anything. Just connect direct. I also use Continue extension inside vscode, and always use devcontainers when I'm working with any LLMs.

I had to create a custom image of llama.cpp compiled with vulkan so the LLMs can access the GPU on my MacBook Air M4 from inside the containers for inference. It's much faster, like 8-10x faster than without.

To be honest so far I've been using mostly cloud models for coding, the local models haven't been that great.

Some more details on the blog: https://markjgsmith.com/posts/2025/10/12/just-use-llamacpp

wongarsu 10/31/2025|
$work has a GPU server running Ollama, I connect to it using the continue.dev VsCode extension. Just ignore the login prompts and set up models via the config.yaml.

In terms of models, qwen2.5-coder:3b is a good compromise for autocomplete, as agent choose pretty much just the biggest sota model you can run

More comments...