Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Posted by threeturn 10/31/2025

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

350 points | 192 comments

lreeves 10/31/2025|

I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.

I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.

Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.

fm2606 10/31/2025||

gpt-oss-120b is amazing. I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc). ChatGPT finished a 50 question quiz in 6 min with a score of 46 / 50. gpt-oss-120b took over an hour but got 47 / 50. All the other local LLMs I tried were small and performed way worse, like less than 50% correct.

I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.

EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.

embedding-shape 10/31/2025|||

> gpt-oss-120b is amazing

Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.

I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.

If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.

whatreason 11/1/2025||||

What do you use to run gpt-oss here? ollama, vLLM, etc

embedding-shape 11/1/2025||

Not parent, but frequent user of GPT-OSS, tried all different ways of running it. Choice goes something like this:

- Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS

- Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version

- Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama

Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.

rovr138 10/31/2025||||

> I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc)

If you share the scripts to gather the GCP documentation this, that'd be great. Because I have had an idea to do something like this, and the part I don't want to deal with is getting the data

fm2606 11/1/2025||

I tried scripts but got blocked. I used wget to download tthem

giorgioz 11/1/2025||||

on what hardware you manate to run gpt-oss-120b locally?

lacoolj 10/31/2025||||

you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?

I'm about to try this out lol

The 20b model is not great, so I'm hoping 120b is the golden ticket.

ThatPlayer 10/31/2025|||

With MoE models like gpt-oss, you can run some layers on the CPU (and some on GPU): https://github.com/ggml-org/llama.cpp/discussions/15396

Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"

gunalx 10/31/2025||||

I have in many cases had better results with the 20b model, over the 120b model. Mostly because it is faster and I can iterate prompts quicker to choerce it to follow instructions.

embedding-shape 11/1/2025||

> had better results with the 20b model, over the 120b model

The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.

gunalx 11/3/2025||

But the 120b model has just as bad if not worse formatting issues, compared to the 20b one. For simple refactorings, or chatting about possible solutions i actually feel teh 20b halucinates less than the 120b, even if it is less competent. Migth also be because of 120b not liking being in q8, or not being properly deployed.

embedding-shape 11/4/2025||

> But the 120b model has just as bad if not worse formatting issues, compared to the 20b one

What runtime/tools are you using? Haven't been my experience at all, but I've also mostly used it via llama.cpp and my own "coding agent". It was slightly tricky to get the Harmony parsing in place and working correct, but once that's in place, I haven't seen any formatting issues at all?

The 20B is definitely worse than 120B for me in every case and scenario, but it is a lot faster. Are you running the "native" MXFP4 weights or something else? That would have a drastic impact on the quality of responses you get.

Edit:

> Migth also be because of 120b not liking being in q8

Yeah, that's definitely the issue, I wouldn't use either without letting them be MXFP4.

fm2606 10/31/2025||||

Everything I run, even the small models, some amount goes to the GPU and the rest to RAM.

fm2606 10/31/2025|||

Hmmm...now that you say that, it might have been the 20b model.

And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.

Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.

I was really impressed but disappointed at the huge disparity between time the two.

gkfasdfasdf 11/1/2025||||

What were you using for RAG? Did you build your own or some off the shelf solution (e.g. openwebui)

fm2606 11/2/2025||

I used pg vector chunking on paragraphs. For the answers I saved in a flat text file and then parsed to what I needed.

For parsing and vectorizing of the GCP docs I used a Python script. For reading each quiz question, getting a text embedding and submitting to an LLM, I used Spring AI.

It was all roll your own.

But like I stated in my original post I deleted it without backup or vcs. It was the wrong directory that I deleted. Rookie mistake for which I know better.

adastra22 11/1/2025|||

What quantization settings?

neilv 10/31/2025||

https://github.com/mostlygeek/llama-swap

egberts1 10/31/2025||

Ollama, 16-CPU Xenon E6320 (old), 1.9Ghz, 120GB DDRAM4, 240TB RAID5 SSDs, on Dell Precision T710 ("The Beast"). NO GPU. 20b (n oooooot f aah st at all). Pure CPU bound. Tweaked for 256KB chunking into RAG.

Ingested election laws of 50 states, territories and Federal.

Goal. Mapping out each feature of the election and deal with (in)consistent terminologies sprouted by different university-trained public administration. This is the crux of hallunications: getting a diagram of ballot handling and their terminologies.

Then maybe tackle the multitude ways of election irregularities, or at least point out integrity gaps at various locales.

https://figshare.com/articles/presentation/Election_Frauds_v...

knownjorbist 11/3/2025||

So much intelligence devoted to what is obviously a huge con - the Big Lie?

egberts1 11/7/2025||

Just yesterday, relevation of USPS stop scanning mail-in ballot envelope as ordered by both NYC and NJ government just before election day.

Might have reveal overloading in certain pickup areas.

banku_brougham 10/31/2025||

bravo, this is a great use of talent for society

gcr 10/31/2025||

For new folks, you can get a local code agent running on your Mac like this:

1. $ npm install -g @openai/codex

2. $ brew install ollama; ollama serve

3. $ ollama pull gpt-oss:20b

4. $ codex --oss -m gpt-oss:20b

This runs locally without Internet. Idk if there’s telemetry for codex, but you should be able to turn that off if so.

You need an M1 Mac or better with at least 24GB of GPU memory. The model is pretty big, about 16GB of disk space in ~/.ollama

Be careful - the 120b model is 1.5× better than this 20b variant, but takes 5× higher requirements.

windexh8er 11/1/2025||

I've been really impressed by OpenCode [0]. The limitations of all the frontier TUI is removed and it is feature complete and performant compared to Codex or Claude Code.

[0] https://opencode.ai/

ponyous 11/3/2025|||

What kind of API subscription are you using? I found opencode to be incredibly expensive - prompts costing $5, while with aider I did it for <$0.1.

embedding-shape 11/1/2025|||

> OpenCode will be available on desktop soon

Anyone happen to know what that means exactly? The install instructions at the top seems to indicate it already is available on desktop?

windexh8er 11/1/2025||

It's a terminal only (TUI) tool today. They're releasing a graphical (GUI) version in the future.

embedding-shape 11/1/2025||

> It's a terminal only (TUI) tool today.

But to use that TUI you need a desktop, or at least a laptop I guess, but that distinction doesn't make sense. Are they referring to the GUI being the "Desktop Version"? Never heard it put that way before if so.

windexh8er 11/3/2025||

> But to use that TUI you need a desktop...

No, you don't need a "desktop" to use a TUI. It's terminal based and has nothing to do with the desktop environment you're in.

Alao, if you have a "desktop" that assumes you're using a GUI. Pretty straightforward.

nickthegreek 10/31/2025|||

have you been able to build or reiterate anything of value using just 20b to vibe code?

ricardonunez 11/6/2025|||

Similar on a MBpro 16 m4 max 128gb 1tb. I only use it for testing and did created a microservice to process some data for a client but no daily workflow.

abacadaba 11/1/2025|||

As much as I've been using llms via api all day every day, being able to run it locally on my mba and talk to my laptop still feels like magic

giancarlostoro 11/1/2025||

LM Studio is even easier, and things like JetBrains IDEs will sync to LM Studio, same with Zed.

dust42 10/31/2025||

On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.

For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.

When giving well defined small tasks, it is as good as any frontier model.

I like the speed and low latency and the availability while on the plane/train or off-grid.

Also decent FIM with the llama.cpp VSCode plugin.

If I need more intelligence my personal favourites are Claude and Deepseek via API.

redblacktree 10/31/2025||

Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.

dust42 10/31/2025|||

I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.

On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.

To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167

About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...

And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.

tommy_axle 10/31/2025|||

Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.

Xenograph 10/31/2025|||

Have you tried continue.dev's new open completion model [1]? How does it compare to llama.vscode FIM with qwen?

[1] https://blog.continue.dev/instinct/

codingbear 10/31/2025||

how are you running qwen3 with llama-vscode? I am still using qwen-2.5-7b.

There is an open issue about adding support for Qwn3 which I have been monitoring, would love to use Qwen3 if possible. Issue - https://github.com/ggml-org/llama.vscode/issues/55

softfalcon 10/31/2025||

For anyone who wants to see some real workstations that do this, you may want to check out Alex Ziskind's channel on YouTube:

https://www.youtube.com/@AZisk

At this point, pretty much all he does is review workstations for running LLM's and other machine-learning adjacent tasks.

I'm not his target demographic, but because I'm a dev, his videos are constantly recommended to me on YouTube. He's a good presenter and his advice makes a lot of sense.

zargon 11/1/2025||

His reviews are very amateurish and slapdash. He has no test methodology and just throws together random datapoints that can't be compared to anything. In his Pro 6000 video he compares it with an M3 128GB with empty context. Then he runs a large context (a first for him!) on the 6000 and notes how long prompt processing takes, and never mentions the M3 again!

embedding-shape 11/1/2025||

Seems pretty spot on for what you'd expect from YouTube developers, most of the channels are like that to be honest. Something about "developer" + "youtube" that just seems to spawn low effort content that tries to create drama rather than fair replacements for blog posts that teach you something.

fm2606 10/31/2025|||

> I'm not his target demographic Me either and I am a dev as well

> He's a good presenter and his advice makes a lot of sense. Agree

Not that I think he forms his answers on who is sponsoring him, but I feel he couldn't do a lot of the stuff he does without sponsors. If the sponsors aren't supplying him with all that hardware then, in my opinion, he is taking a significant risk in buying all of it out of pocket and hoping that the money he makes from YT covers it (which I am sure it does, several times over). But there is no guarantee that the money he makes from YT will cover the costs, is the point I'm making.

But, then again, he does use the hardware in other videos so the it isn't like he is banking on a single video to cover the costs.

hereme888 10/31/2025||

Dude... what a good YT channel. The guy is no nonsense, straight to the point. Thanks.

jetsnoc 10/31/2025||

  Models
    gpt-oss-120b, Meta Llama 3.2, or Gemma
    (just depends on what I’m doing)

  Hardware
    - Apple M4 Max (128 GB RAM)
      paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking

  Software
    - Claude Code
    - RA.Aid
    - llama.cpp

  For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation.

  Process

    I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review.

Infernal 10/31/2025||

What does the GPD Win 4 do in this scenario? Is there a step w/ Agent Organizer that decides if a task can go to a smaller model on the Win 4 vs a larger model on your Mac?

altcognito 10/31/2025|||

What sorts of token/s are you getting with each model?

jetsnoc 10/31/2025||

Model performance summary:

  **openai/gpt-oss-120b** — MLX (MXFP4), ~66 tokens/sec @ Hugging Face: `lmstudio-community/gpt-oss-120b-MLX-8bit`

  **google/gemma-3-27b** — MLX (4-bit), ~27 tokens/sec @ Hugging Face: `mlx-community/gemma-3-27b-it-qat-4bit`

  **qwen/qwen3-coder-30b** — MLX (8-bit), ~78 tokens/sec @ Hugging Face: `Qwen/Qwen3-Coder-30B-A3B-Instruct`

Will reply back and add Meta Llama performance shortly.

CubsFan1060 10/31/2025||

What is the Agent Organizer you use?

jetsnoc 10/31/2025||

It’s a Claude agent prompt. I don’t recall who originally shared it, so I can’t yet attribute the source, but I’ll track that down shortly and add proper attribution here.

Here’s the Claude agent markdown:

https://github.com/lst97/claude-code-sub-agents/blob/main/ag...

Edit: Updated from the old Pastebin link to the GitHub version. Attribution found: lst97 on GitHub

nicce 10/31/2025||

How it looks like Claude agent is written by Claude...

giancarlostoro 10/31/2025||

If you're going to get a MacBook, get the Pro, it has a built-in fan, you don't want the heat just sitting there on the MacBook Air. Same with the Mac mini, get the studio instead, it has a fan, the Mini does not. I don't know about you but I wouldn't want my brand new laptop / desktop to be heating up the entire time I'm coding with 0 cool off. If you go the Mac route, I recommend getting TG Pro, the default fan settings on the Mac are awful they don't kick in soon enough, TG Pro lets you make it a little more "sensitive" to those temperature shifts, its like $20 for TG Pro if I remember correctly, but worth it.

I have a MacBook Pro with an M4 Pro chip, and 24GB of RAM, I believe only 16 of it is usable by the models, so I can run the GPT OSS 20B model (iirc) but the smaller one. It can do a bit, but the context window fills up quickly, so I do find myself switching context windows often enough. I do wonder if a maxed out MacBook Pro would be able to run larger context windows, then I would easily be able to code all day with it offline.

I do think Macs are phenomenal at running local LLMs if you get the right one.

amonroe805-2 10/31/2025||

Quick correction: The mac mini does have a fan. Studio is definitely more capable due to bigger, better chips, but my understanding is the mini is generally not at risk of thermal throttling with the chips you can buy it with. The decision for desktop macs really just comes down to how much chip you want to pay for.

suprjami 10/31/2025|||

Correction, if you're going to get a Mac then get a Max or Ultra, with as much memory as possible, the increase in RAM bandwidth will make running larger models viable.

Terretta 10/31/2025|||

And yes, for context windows / cached context, the MacBook Pro with 128GB memory is a mind boggling laptop.

The Studio Ultras are surprisingly strong as well for a pretty monitor stand.

embedding-shape 10/31/2025|||

> I do think Macs are phenomenal at running local LLMs if you get the right one.

How does the prompt processing speed look like today? I think it was either M3 or M4 together with 128GB, trying to run even slightly longer prompts took forever for the initial prompt processing so whatever speed gain you get at inference, basically didn't matter. Maybe it works better today?

giancarlostoro 11/1/2025||

I have only ever used the M4 (on my wife's Macbook Air) and M4 Pro (on my Macbook Pro) and it was reasonable speeds, I was able to tie LM Studio with PyCharm and ask it questions about code, but my context Window kept running out, I don't think the 24GB model is the right choice, the key thing you have to also look out for is for example I might hvae 24GB of RAM, but only 16 of it can be used as VRAM, so I'm more competitive than my 3080 in terms of VRAM, though my 3080 could probably run circles around my M4 Pro if it wanted to.

trailbits 10/31/2025||

The default context window using ollama or lmstudio is small, but you can easily quadruple the default size while running gpt-oss-20b on a 24GB Mac.

erikig 10/31/2025||

Hardware: MacBook Pro M4 Max, 128GB

Platform: LMStudio (primarily) & Ollama

Models:

- qwen/qwen3-coder-30b A3B Instruct 8-bit MLX

- mlx-community/gpt-oss-120b-MXFP4-Q8

For code generation especially for larger projects, these models aren't as good as the cutting edge foundation models. For summarizing local git repos/libraries, generating documentation and simple offline command-line tool-use they do a good job.

I find these communities quite vibrant and helpful too:

- https://www.reddit.com/r/LocalLLM/

- https://www.reddit.com/r/LocalLLaMA/

mkagenius 10/31/2025||

Since you are on Mac, if you need some kind code execution sandbox, check out Coderunner[1] which is based on Apple container, provides a way execute any LLM generated cod e without risking arbitrary code execution on your machine.

I have recently added claude skills to it. So, all the claude skills can be executed locally on your mac too.

1. https://github.com/instavm/coderunner

shell0x 10/31/2025||

I have a Mac Studio with the M4 Max and 128GB RAM

The Qwen3-coder model you use is pretty good. You can enable the LM Studio API and install the qwen CLI and point to the API endpoint. This basically gives you functionality similar to Claude code.

I agree that the code quality is not on part with gpt5-codex and Claude. I also haven't tried z.ai's models locally yet. I think on a Mac with that size GLM 4.5 Air should be able to run.

For README generation I like gemma3-27b-it-qat and gpt-oss-120b.

whitehexagon 10/31/2025||

Qwen3:32b on MBP M1 Pro 32GB running Asahi linux. Mainly command line for some help with armv8 assembly, and some SoC stuff (this week explaining I2C protocol). I couldnt find any good intro on the web-of-ads. It's not much help with Zig, but then nothing seems to keep up with Zig at the moment.

I get a steady stream of tokens, slightly slower than my reading pace, which I find is more than fast enough. In fact I´d only replace with exact same, or maybe M2 + Asahi with enough RAM to run the bigger Qwen3 model.

I saw qwen3-coder mentioned here. I didnt know about that one. Anyone got any thoughts on how that compares to qwen3? Will it also fit in 32GB?

I'm not interested in agents, or tool integration, and especially wont use anything cloud. I like to own my env. and code top-to-bottom. Having also switched to Kate and Fossil it feels like my perfect dev environment.

Currently using an older Ollama, but will switch to llama.cpp now that ollama has pivoted away from offline only. I got llama.cpp installed, but not sure how to reuse my models from ollama, I thought ollama was just a wrapper, but they seems to be different model formats?

[edit] be sure to use it powered, linux is a bit battery heavy, but Qwen3 will pull 60W+ and flatten a battery real fast.

suprjami 10/31/2025|

Qwen 3 Coder is a 30B model but an MoE with 3B active parameters, so it should be much faster for you. Give it a try.

It's not as smart as dense 32B for general tasks, but theoretically should be better for the sort of coding tasks from StackExchange.

woile 10/31/2025|

I just got a AMD AI 9 HX 370 with 128GB RAM from laptopwithlinux.com and I've started using zed + ollama. I'm super happy with the machine and the service.

Here's my ollama config:

https://github.com/woile/nix-config/blob/main/hosts/aconcagu...

I'm not an AI power user. I like to code, and I like the AI to autocomplete snippets that are "logical", I don't use agents, and for that, it's good enough.

altcognito 10/31/2025|

What sorts of token/s are you getting with qwen/gemma?

More comments...