Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Posted by threeturn 3 days ago

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

338 points | 184 commentspage 4

vinhnx 3 days ago|

> Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

Open-source coding assistant: VT Code (my own coding agent -- github.com/vinhnx/vtcode) Model: gpt-oss-120b remote hosted via Ollama cloud experimental

> What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

Macbook Pro M1

> What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

All agentic coding workflow (debug, refactor, refine and testing sandbox execution). VT Code is currently in preview and being active developed, but currently it is mostly stable.

jdthedisciple 3 days ago|

Wait ollama cloud has a free tier?

Sounds too good. Where's the catch? And is it private?

bradfa 3 days ago|||

The catch is ollama cloud is likely to increase prices and/or decrease usage limit levels soon. Free tier has more restrictions than their $20/mo tier. They claim to not store anything (https://ollama.com/cloud) but you'll have to clarify what you mean by "private" (your model likely runs on shared hardware with other users).

vinhnx 3 days ago||

I agree. "Free" usage could mean tradeoff. But for side-project and experiments, to accesss open source model like gpt-oss, as my machine can not run, I think I will accept it.

bradfa 2 days ago||

My experience with the free tier and qwen3-coder cloud is the hourly limit gets you about 250k tokens input and then your usage is paused till the hour is up. Enough to try something very small.

vinhnx 3 days ago|||

Yeah, Ollama recently announces Cloud, as I think it still in beta and free, usage is generously too, enough to build and hack on. But I'm not sure about data training, I don't see the settings to turn this off..

BirAdam 3 days ago||

Mac Studio, M4 Max

LM Studio + gpt-oss + aider

Works quite quickly. Sometimes I just chat with it via LM Studio when I need a general idea for how to proceed with an issue. Otherwise, I typically use aider to do some pair programming work. It isn't always accurate, but it's often at least useful.

__mharrison__ 3 days ago||

I have a MBP with 128GB.

Here's the pull request I made to Aider for using local models:

https://github.com/Aider-AI/aider/issues/4526

dnel 2 days ago||

I recently picked up a Threadripper 3960x, 256GB DDR4 and RTX2080ti 11GB running Debian 13 and open web-ui w/ ollama.

It runs well, not much difference to Claude etc but still learning the ropes and how to get the best out of it and local llms in general. Having tonnes of memory is nice for switching out models in ollama quickly since everything stays in cache.

The GPU memory is the weak point though so I'm mostly using models up to 18b parameters that can fit in the vram.

dboreham 3 days ago||

I've run smaller models (I forget which ones, this was about a year ago) on my laptop just to see what happened. I was quite surprised that I could get it to write simple Python programs. Actually very surprised which led me to re-evaluate my thinking on LLMs in general. Anyway, since then I've been using the regular hosted services since for now I don't see a worthwhile tradeoff running models locally. Apart from the hardware needed, I'd expect to be constantly downloading O(100G) model files as they improve on a weekly basis. I don't have the internet capacity to easily facilitate that.

disambiguation 3 days ago||

Not my build and not coding, but I've seen some experimental builds (oss 20b on a 32gb mac mini) with Kiwix integration to make what is essentially a highly capable local private search engine.

stuxnet79 3 days ago|

Any resources you can share for these experimental builds? This is something I was looking into setting up at some point. I'd love to take a look at examples in the wild to gauge if it's worth my time / money.

An aside, if we ever reach a point where it's possible to run an OSS 20b model at reasonable inference on a Macbook Pro type of form factor, then the future is definitely here!

disambiguation 2 days ago||

In reference to this post i saw a few weeks ago:

https://lemmy.zip/post/50193734

(Lemmy is a reddit style forum)

The author mainly demos their "custom tools" and doesn't elaborate further. But IMO is still an impressive showcase for an offline setup.

I think the big hint is "open webui" which supports native function calls.

Some more searching and i found this: https://pypi.org/project/llm-tools-kiwix/

It's possible the future is now.. assuming you have an M series with enough RAM. My sense is that you need ~1gb of RAM for every 1b paramters, so 32gb should in theory work here. I think macs also get a performance boost over other hardware due to unified memory.

Spit balling aside, I'm in the same boat, saving my money, waiting for the right time. If it isn't viable already its damn close.

stuxnet79 22 minutes ago||

It seems like the ecosystem around these tools has matured quite rapidly. I am somewhat familiar with Open WebUI, however, the last time I played around with it, I got the sense that it was merely a front-end to Ollama, the llm command line tool & it didn't have any capabilities outside of that.

I got spooked when the Ollama team started monetizing so I ended up doing more research into llama.cpp and realized it could do everything I wanted including serve up a web front end. Once I discovered this I sort of lost interest in Open WebUI.

I'll have to revisit all these tools again to see what's possible in the current moment.

> My sense is that you need ~1gb of RAM for every 1b paramters, so 32gb should in theory work here. I think macs also get a performance boost over other hardware due to unified memory.

This is a handy heuristic to work with, and the links you sent will keep me busy for the next little while. Thanks!

sharms 3 days ago||

FWIW I bought the M4 max with 128GB and it is useful for local LLMs for OCR, I don't find it as useful for coding (ala Codex / Claude Code) with local LLMs. I find that even with GPT 5 / Claude 4.5 Sonnet that trust is low, and local LLMs can lower that just enough to not be as useful. The heat is also a factor - Apple makes great hardware, but I don't believe it is designed for continuous usage the way a desktop is.

KETpXDDzR 1 day ago||

I tried using oolama to run an LLM on my A6000 for Cursor. It fits completely in VRAM. Nevertheless it was significantly slower than Claude 4.5 opus. Also, the support in Cursor for local models is really bad.

baby_souffle 3 days ago||

Good quality still needs more power than what a laptop can do. The local llama subreddit has a lot of people doing well with local rigs, but they are absolutely not laptop size.

codingbear 3 days ago|

I use local for code completions only. Which means models supporting FIM tokens.

My current setup is the llama-vscode plugin + llama-server running Qwen/Qwen2.5-Coder-7B-Instruct. It leads to very fast completions, and don't have to worry about internet outages which take me out of the zone.

I do wish qwen-3 released a 7B model supporting FIM tokens. 7B seems to be the sweet spot for fast and usable completions

Mostlygeek 3 days ago|

qwen3-coder-30B-A3B supports FIM and should be faster than the 7B if you got the vram.

I use bartowkski’s Q8 quant over dual 3090s and it gets up to 100tok/sec. The Q4 quant on a single 3090 is very fast and decently smart.

More comments...