Top
Best
New

Posted by threeturn 2 days ago

Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop

Dear Hackers, I’m interested in your real-world workflows for using open-source LLMs and open-source coding assistants on your laptop (not just cloud/enterprise SaaS). Specifically:

Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?

What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?

What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).

I'm conducting my own investigation, which I will be happy to share as well when over.

Thanks! Andrea.

334 points | 181 comments
lreeves 2 days ago|
I sometimes still code with a local LLM but can't imagine doing it on a laptop. I have a server that has GPUs and runs llama.cpp behind llama-swap (letting me switch between models quickly). The best local coding setup I've been able to do so far is using Aider with gpt-oss-120b.

I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.

Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.

fm2606 2 days ago||
gpt-oss-120b is amazing. I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc). ChatGPT finished a 50 question quiz in 6 min with a score of 46 / 50. gpt-oss-120b took over an hour but got 47 / 50. All the other local LLMs I tried were small and performed way worse, like less than 50% correct.

I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.

EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.

embedding-shape 2 days ago|||
> gpt-oss-120b is amazing

Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.

I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.

If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.

giorgioz 1 day ago||||
on what hardware you manate to run gpt-oss-120b locally?
whatreason 1 day ago||||
What do you use to run gpt-oss here? ollama, vLLM, etc
embedding-shape 1 day ago||
Not parent, but frequent user of GPT-OSS, tried all different ways of running it. Choice goes something like this:

- Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS

- Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version

- Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama

Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.

rovr138 1 day ago||||
> I created a RAG agent to hold most of GCP documentation (separate download, parsing, chunking, etc)

If you share the scripts to gather the GCP documentation this, that'd be great. Because I have had an idea to do something like this, and the part I don't want to deal with is getting the data

fm2606 18 hours ago||
I tried scripts but got blocked. I used wget to download tthem
gkfasdfasdf 1 day ago||||
What were you using for RAG? Did you build your own or some off the shelf solution (e.g. openwebui)
fm2606 18 hours ago||
I used pg vector chunking on paragraphs. For the answers I saved in a flat text file and then parsed to what I needed.

For parsing and vectorizing of the GCP docs I used a Python script. For reading each quiz question, getting a text embedding and submitting to an LLM, I used Spring AI.

It was all roll your own.

But like I stated in my original post I deleted it without backup or vcs. It was the wrong directory that I deleted. Rookie mistake for which I know better.

lacoolj 1 day ago||||
you can run the 120b model on an 8GB GPU? or are you running this on CPU with the 64GB RAM?

I'm about to try this out lol

The 20b model is not great, so I'm hoping 120b is the golden ticket.

ThatPlayer 1 day ago|||
With MoE models like gpt-oss, you can run some layers on the CPU (and some on GPU): https://github.com/ggml-org/llama.cpp/discussions/15396

Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"

gunalx 1 day ago||||
I have in many cases had better results with the 20b model, over the 120b model. Mostly because it is faster and I can iterate prompts quicker to choerce it to follow instructions.
embedding-shape 1 day ago||
> had better results with the 20b model, over the 120b model

The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.

fm2606 1 day ago||||
Everything I run, even the small models, some amount goes to the GPU and the rest to RAM.
fm2606 1 day ago|||
Hmmm...now that you say that, it might have been the 20b model.

And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.

Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.

I was really impressed but disappointed at the huge disparity between time the two.

adastra22 1 day ago|||
What quantization settings?
neilv 1 day ago||
https://github.com/mostlygeek/llama-swap
egberts1 1 day ago||
Ollama, 16-CPU Xenon E6320 (old), 1.9Ghz, 120GB DDRAM4, 240TB RAID5 SSDs, on Dell Precision T710 ("The Beast"). NO GPU. 20b (n oooooot f aah st at all). Pure CPU bound. Tweaked for 256KB chunking into RAG.

Ingested election laws of 50 states, territories and Federal.

Goal. Mapping out each feature of the election and deal with (in)consistent terminologies sprouted by different university-trained public administration. This is the crux of hallunications: getting a diagram of ballot handling and their terminologies.

Then maybe tackle the multitude ways of election irregularities, or at least point out integrity gaps at various locales.

https://figshare.com/articles/presentation/Election_Frauds_v...

banku_brougham 1 day ago|
bravo, this is a great use of talent for society
gcr 1 day ago||
For new folks, you can get a local code agent running on your Mac like this:

1. $ npm install -g @openai/codex

2. $ brew install ollama; ollama serve

3. $ ollama pull gpt-oss:20b

4. $ codex --oss -m gpt-oss:20b

This runs locally without Internet. Idk if there’s telemetry for codex, but you should be able to turn that off if so.

You need an M1 Mac or better with at least 24GB of GPU memory. The model is pretty big, about 16GB of disk space in ~/.ollama

Be careful - the 120b model is 1.5× better than this 20b variant, but takes 5× higher requirements.

windexh8er 1 day ago||
I've been really impressed by OpenCode [0]. The limitations of all the frontier TUI is removed and it is feature complete and performant compared to Codex or Claude Code.

[0] https://opencode.ai/

embedding-shape 1 day ago||
> OpenCode will be available on desktop soon

Anyone happen to know what that means exactly? The install instructions at the top seems to indicate it already is available on desktop?

windexh8er 1 day ago||
It's a terminal only (TUI) tool today. They're releasing a graphical (GUI) version in the future.
embedding-shape 1 day ago||
> It's a terminal only (TUI) tool today.

But to use that TUI you need a desktop, or at least a laptop I guess, but that distinction doesn't make sense. Are they referring to the GUI being the "Desktop Version"? Never heard it put that way before if so.

nickthegreek 1 day ago|||
have you been able to build or reiterate anything of value using just 20b to vibe code?
abacadaba 1 day ago|||
As much as I've been using llms via api all day every day, being able to run it locally on my mba and talk to my laptop still feels like magic
giancarlostoro 1 day ago||
LM Studio is even easier, and things like JetBrains IDEs will sync to LM Studio, same with Zed.
dust42 2 days ago||
On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.

For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.

When giving well defined small tasks, it is as good as any frontier model.

I like the speed and low latency and the availability while on the plane/train or off-grid.

Also decent FIM with the llama.cpp VSCode plugin.

If I need more intelligence my personal favourites are Claude and Deepseek via API.

redblacktree 2 days ago||
Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.
dust42 1 day ago|||
I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.

On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.

To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167

About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...

And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.

tommy_axle 2 days ago|||
Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.
Xenograph 1 day ago|||
Have you tried continue.dev's new open completion model [1]? How does it compare to llama.vscode FIM with qwen?

[1] https://blog.continue.dev/instinct/

codingbear 1 day ago||
how are you running qwen3 with llama-vscode? I am still using qwen-2.5-7b.

There is an open issue about adding support for Qwn3 which I have been monitoring, would love to use Qwen3 if possible. Issue - https://github.com/ggml-org/llama.vscode/issues/55

softfalcon 2 days ago||
For anyone who wants to see some real workstations that do this, you may want to check out Alex Ziskind's channel on YouTube:

https://www.youtube.com/@AZisk

At this point, pretty much all he does is review workstations for running LLM's and other machine-learning adjacent tasks.

I'm not his target demographic, but because I'm a dev, his videos are constantly recommended to me on YouTube. He's a good presenter and his advice makes a lot of sense.

zargon 1 day ago||
His reviews are very amateurish and slapdash. He has no test methodology and just throws together random datapoints that can't be compared to anything. In his Pro 6000 video he compares it with an M3 128GB with empty context. Then he runs a large context (a first for him!) on the 6000 and notes how long prompt processing takes, and never mentions the M3 again!
embedding-shape 1 day ago||
Seems pretty spot on for what you'd expect from YouTube developers, most of the channels are like that to be honest. Something about "developer" + "youtube" that just seems to spawn low effort content that tries to create drama rather than fair replacements for blog posts that teach you something.
fm2606 2 days ago|||
> I'm not his target demographic Me either and I am a dev as well

> He's a good presenter and his advice makes a lot of sense. Agree

Not that I think he forms his answers on who is sponsoring him, but I feel he couldn't do a lot of the stuff he does without sponsors. If the sponsors aren't supplying him with all that hardware then, in my opinion, he is taking a significant risk in buying all of it out of pocket and hoping that the money he makes from YT covers it (which I am sure it does, several times over). But there is no guarantee that the money he makes from YT will cover the costs, is the point I'm making.

But, then again, he does use the hardware in other videos so the it isn't like he is banking on a single video to cover the costs.

hereme888 1 day ago||
Dude... what a good YT channel. The guy is no nonsense, straight to the point. Thanks.
jetsnoc 2 days ago||

  Models
    gpt-oss-120b, Meta Llama 3.2, or Gemma
    (just depends on what I’m doing)

  Hardware
    - Apple M4 Max (128 GB RAM)
      paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking

  Software
    - Claude Code
    - RA.Aid
    - llama.cpp

  For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation.

  Process

    I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review.
Infernal 1 day ago||
What does the GPD Win 4 do in this scenario? Is there a step w/ Agent Organizer that decides if a task can go to a smaller model on the Win 4 vs a larger model on your Mac?
altcognito 2 days ago|||
What sorts of token/s are you getting with each model?
jetsnoc 1 day ago||
Model performance summary:

  **openai/gpt-oss-120b** — MLX (MXFP4), ~66 tokens/sec @ Hugging Face: `lmstudio-community/gpt-oss-120b-MLX-8bit`

  **google/gemma-3-27b** — MLX (4-bit), ~27 tokens/sec @ Hugging Face: `mlx-community/gemma-3-27b-it-qat-4bit`

  **qwen/qwen3-coder-30b** — MLX (8-bit), ~78 tokens/sec @ Hugging Face: `Qwen/Qwen3-Coder-30B-A3B-Instruct`

Will reply back and add Meta Llama performance shortly.
CubsFan1060 2 days ago||
What is the Agent Organizer you use?
jetsnoc 2 days ago||
It’s a Claude agent prompt. I don’t recall who originally shared it, so I can’t yet attribute the source, but I’ll track that down shortly and add proper attribution here.

Here’s the Claude agent markdown:

https://github.com/lst97/claude-code-sub-agents/blob/main/ag...

Edit: Updated from the old Pastebin link to the GitHub version. Attribution found: lst97 on GitHub

nicce 1 day ago||
How it looks like Claude agent is written by Claude...
giancarlostoro 2 days ago||
If you're going to get a MacBook, get the Pro, it has a built-in fan, you don't want the heat just sitting there on the MacBook Air. Same with the Mac mini, get the studio instead, it has a fan, the Mini does not. I don't know about you but I wouldn't want my brand new laptop / desktop to be heating up the entire time I'm coding with 0 cool off. If you go the Mac route, I recommend getting TG Pro, the default fan settings on the Mac are awful they don't kick in soon enough, TG Pro lets you make it a little more "sensitive" to those temperature shifts, its like $20 for TG Pro if I remember correctly, but worth it.

I have a MacBook Pro with an M4 Pro chip, and 24GB of RAM, I believe only 16 of it is usable by the models, so I can run the GPT OSS 20B model (iirc) but the smaller one. It can do a bit, but the context window fills up quickly, so I do find myself switching context windows often enough. I do wonder if a maxed out MacBook Pro would be able to run larger context windows, then I would easily be able to code all day with it offline.

I do think Macs are phenomenal at running local LLMs if you get the right one.

amonroe805-2 2 days ago||
Quick correction: The mac mini does have a fan. Studio is definitely more capable due to bigger, better chips, but my understanding is the mini is generally not at risk of thermal throttling with the chips you can buy it with. The decision for desktop macs really just comes down to how much chip you want to pay for.
suprjami 1 day ago|||
Correction, if you're going to get a Mac then get a Max or Ultra, with as much memory as possible, the increase in RAM bandwidth will make running larger models viable.
Terretta 2 days ago|||
And yes, for context windows / cached context, the MacBook Pro with 128GB memory is a mind boggling laptop.

The Studio Ultras are surprisingly strong as well for a pretty monitor stand.

embedding-shape 2 days ago|||
> I do think Macs are phenomenal at running local LLMs if you get the right one.

How does the prompt processing speed look like today? I think it was either M3 or M4 together with 128GB, trying to run even slightly longer prompts took forever for the initial prompt processing so whatever speed gain you get at inference, basically didn't matter. Maybe it works better today?

giancarlostoro 1 day ago||
I have only ever used the M4 (on my wife's Macbook Air) and M4 Pro (on my Macbook Pro) and it was reasonable speeds, I was able to tie LM Studio with PyCharm and ask it questions about code, but my context Window kept running out, I don't think the 24GB model is the right choice, the key thing you have to also look out for is for example I might hvae 24GB of RAM, but only 16 of it can be used as VRAM, so I'm more competitive than my 3080 in terms of VRAM, though my 3080 could probably run circles around my M4 Pro if it wanted to.
trailbits 1 day ago||
The default context window using ollama or lmstudio is small, but you can easily quadruple the default size while running gpt-oss-20b on a 24GB Mac.
erikig 1 day ago||
Hardware: MacBook Pro M4 Max, 128GB

Platform: LMStudio (primarily) & Ollama

Models:

- qwen/qwen3-coder-30b A3B Instruct 8-bit MLX

- mlx-community/gpt-oss-120b-MXFP4-Q8

For code generation especially for larger projects, these models aren't as good as the cutting edge foundation models. For summarizing local git repos/libraries, generating documentation and simple offline command-line tool-use they do a good job.

I find these communities quite vibrant and helpful too:

- https://www.reddit.com/r/LocalLLM/

- https://www.reddit.com/r/LocalLLaMA/

mkagenius 1 day ago||
Since you are on Mac, if you need some kind code execution sandbox, check out Coderunner[1] which is based on Apple container, provides a way execute any LLM generated cod e without risking arbitrary code execution on your machine.

I have recently added claude skills to it. So, all the claude skills can be executed locally on your mac too.

1. https://github.com/instavm/coderunner

shell0x 1 day ago||
I have a Mac Studio with the M4 Max and 128GB RAM

The Qwen3-coder model you use is pretty good. You can enable the LM Studio API and install the qwen CLI and point to the API endpoint. This basically gives you functionality similar to Claude code.

I agree that the code quality is not on part with gpt5-codex and Claude. I also haven't tried z.ai's models locally yet. I think on a Mac with that size GLM 4.5 Air should be able to run.

For README generation I like gemma3-27b-it-qat and gpt-oss-120b.

whitehexagon 1 day ago||
Qwen3:32b on MBP M1 Pro 32GB running Asahi linux. Mainly command line for some help with armv8 assembly, and some SoC stuff (this week explaining I2C protocol). I couldnt find any good intro on the web-of-ads. It's not much help with Zig, but then nothing seems to keep up with Zig at the moment.

I get a steady stream of tokens, slightly slower than my reading pace, which I find is more than fast enough. In fact I´d only replace with exact same, or maybe M2 + Asahi with enough RAM to run the bigger Qwen3 model.

I saw qwen3-coder mentioned here. I didnt know about that one. Anyone got any thoughts on how that compares to qwen3? Will it also fit in 32GB?

I'm not interested in agents, or tool integration, and especially wont use anything cloud. I like to own my env. and code top-to-bottom. Having also switched to Kate and Fossil it feels like my perfect dev environment.

Currently using an older Ollama, but will switch to llama.cpp now that ollama has pivoted away from offline only. I got llama.cpp installed, but not sure how to reuse my models from ollama, I thought ollama was just a wrapper, but they seems to be different model formats?

[edit] be sure to use it powered, linux is a bit battery heavy, but Qwen3 will pull 60W+ and flatten a battery real fast.

suprjami 1 day ago|
Qwen 3 Coder is a 30B model but an MoE with 3B active parameters, so it should be much faster for you. Give it a try.

It's not as smart as dense 32B for general tasks, but theoretically should be better for the sort of coding tasks from StackExchange.

woile 2 days ago|
I just got a AMD AI 9 HX 370 with 128GB RAM from laptopwithlinux.com and I've started using zed + ollama. I'm super happy with the machine and the service.

Here's my ollama config:

https://github.com/woile/nix-config/blob/main/hosts/aconcagu...

I'm not an AI power user. I like to code, and I like the AI to autocomplete snippets that are "logical", I don't use agents, and for that, it's good enough.

altcognito 2 days ago|
What sorts of token/s are you getting with qwen/gemma?
More comments...