Posted by threeturn 2 days ago
Ask HN: Who uses open LLMs and coding assistants locally? Share setup and laptop
Which model(s) are you running (e.g., Ollama, LM Studio, or others) and which open-source coding assistant/integration (for example, a VS Code plugin) you’re using?
What laptop hardware do you have (CPU, GPU/NPU, memory, whether discrete GPU or integrated, OS) and how it performs for your workflow?
What kinds of tasks you use it for (code completion, refactoring, debugging, code review) and how reliable it is (what works well / where it falls short).
I'm conducting my own investigation, which I will be happy to share as well when over.
Thanks! Andrea.
I guess you could get a Ryzen AI Max+ with 128GB RAM to try and do that locally but non-nVidia hardware is incredibly slow for coding usage since the prompts become very large and take exponentially longer but gpt-oss is a sparse model so maybe it won't be that bad.
Also just to point it out, if you use OpenRouter with things like Aider or roocode or whatever you can also flag your account to only use providers with a zero-data retention policy if you are truly concerned about anyone training on your source code. GPT5 and Claude are infinitely better, faster and cheaper than anything I can do locally and I have a monster setup.
I ran this on an i7 with 64gb of RAM and an old nvidia card with 8g of vram.
EDIT: Forgot to say what the RAG system was doing which was answering a 50 question multiple choice test about GCP and cloud engineering.
Yup, I agree, easily best local model you can run today on local hardware, especially when reasoning_effort is set to "high", but "medium" does very well too.
I think people missed out on how great it was because a bunch of the runners botched their implementations at launch, and it wasn't until 2-3 weeks after launch that you could properly evaluate it, and once I could run the evaluations myself on my own tasks, it really became evident how much better it is.
If you haven't tried it yet, or you tried it very early after the release, do yourself a favor and try it again with updated runners.
- Need batching + highest total throughoutput? vLLM, complicated to deploy and install though, need special versions for top performance with GPT-OSS
- Easiest to manage + fast enough: llama.cpp, easier to deploy as well (just a binary) and super fast, getting ~260 tok/s on a RTX Pro 6000 for the 20B version
- Easiest for people not used to running shell commands or need a GUI and don't care much for performance: Ollama
Then if you really wanna go fast, try to get TensorRT running on your setup, and I think that's pretty much the fastest GPT-OSS can go currently.
If you share the scripts to gather the GCP documentation this, that'd be great. Because I have had an idea to do something like this, and the part I don't want to deal with is getting the data
For parsing and vectorizing of the GCP docs I used a Python script. For reading each quiz question, getting a text embedding and submitting to an LLM, I used Spring AI.
It was all roll your own.
But like I stated in my original post I deleted it without backup or vcs. It was the wrong directory that I deleted. Rookie mistake for which I know better.
I'm about to try this out lol
The 20b model is not great, so I'm hoping 120b is the golden ticket.
Mentions 120b is runnable on 8GB VRAM too: "Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too"
The difference of quality and accuracy of the responses between the two is vastly different though, if tok/s isn't your biggest priority, especially when using reasoning_effort "high". 20B works great for small-ish text summarization and title generation, but for even moderately difficult programming tasks, 20B fails repeatedly while 120B gets it right on the first try.
And like a dumbass I accidentally deleted the directory and didn't have a back up or under version control.
Either way, I do know for a fact that the gpt-oss-XXb model beat chatgpt by 1 answer and it was 46/50 at 6 minutes and 47/50 at 1+ hour. I remember because I was blown away that I could get that type of result running locally and I had texted a friend about it.
I was really impressed but disappointed at the huge disparity between time the two.
Ingested election laws of 50 states, territories and Federal.
Goal. Mapping out each feature of the election and deal with (in)consistent terminologies sprouted by different university-trained public administration. This is the crux of hallunications: getting a diagram of ballot handling and their terminologies.
Then maybe tackle the multitude ways of election irregularities, or at least point out integrity gaps at various locales.
https://figshare.com/articles/presentation/Election_Frauds_v...
1. $ npm install -g @openai/codex
2. $ brew install ollama; ollama serve
3. $ ollama pull gpt-oss:20b
4. $ codex --oss -m gpt-oss:20b
This runs locally without Internet. Idk if there’s telemetry for codex, but you should be able to turn that off if so.
You need an M1 Mac or better with at least 24GB of GPU memory. The model is pretty big, about 16GB of disk space in ~/.ollama
Be careful - the 120b model is 1.5× better than this 20b variant, but takes 5× higher requirements.
Anyone happen to know what that means exactly? The install instructions at the top seems to indicate it already is available on desktop?
But to use that TUI you need a desktop, or at least a laptop I guess, but that distinction doesn't make sense. Are they referring to the GUI being the "Desktop Version"? Never heard it put that way before if so.
For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.
When giving well defined small tasks, it is as good as any frontier model.
I like the speed and low latency and the availability while on the plane/train or off-grid.
Also decent FIM with the llama.cpp VSCode plugin.
If I need more intelligence my personal favourites are Claude and Deepseek via API.
On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.
To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167
About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...
And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.
There is an open issue about adding support for Qwn3 which I have been monitoring, would love to use Qwen3 if possible. Issue - https://github.com/ggml-org/llama.vscode/issues/55
https://www.youtube.com/@AZisk
At this point, pretty much all he does is review workstations for running LLM's and other machine-learning adjacent tasks.
I'm not his target demographic, but because I'm a dev, his videos are constantly recommended to me on YouTube. He's a good presenter and his advice makes a lot of sense.
> He's a good presenter and his advice makes a lot of sense. Agree
Not that I think he forms his answers on who is sponsoring him, but I feel he couldn't do a lot of the stuff he does without sponsors. If the sponsors aren't supplying him with all that hardware then, in my opinion, he is taking a significant risk in buying all of it out of pocket and hoping that the money he makes from YT covers it (which I am sure it does, several times over). But there is no guarantee that the money he makes from YT will cover the costs, is the point I'm making.
But, then again, he does use the hardware in other videos so the it isn't like he is banking on a single video to cover the costs.
Models
gpt-oss-120b, Meta Llama 3.2, or Gemma
(just depends on what I’m doing)
Hardware
- Apple M4 Max (128 GB RAM)
paired with a GPD Win 4 running Ubuntu 24.04 over USB-C networking
Software
- Claude Code
- RA.Aid
- llama.cpp
For CUDA computing, I use an older NVIDIA RTX 2080 in an old System76 workstation.
Process
I create a good INSTRUCTIONS.md for Claude/Raid that specifies a task & production process with a task list it maintains. I use Claude Agents with an Agent Organizer that helps determine which agents to use. It creates the architecture, prd and security design, writes the code, and then lints, tests and does a code review. **openai/gpt-oss-120b** — MLX (MXFP4), ~66 tokens/sec @ Hugging Face: `lmstudio-community/gpt-oss-120b-MLX-8bit`
**google/gemma-3-27b** — MLX (4-bit), ~27 tokens/sec @ Hugging Face: `mlx-community/gemma-3-27b-it-qat-4bit`
**qwen/qwen3-coder-30b** — MLX (8-bit), ~78 tokens/sec @ Hugging Face: `Qwen/Qwen3-Coder-30B-A3B-Instruct`
Will reply back and add Meta Llama performance shortly.Here’s the Claude agent markdown:
https://github.com/lst97/claude-code-sub-agents/blob/main/ag...
Edit: Updated from the old Pastebin link to the GitHub version. Attribution found: lst97 on GitHub
I have a MacBook Pro with an M4 Pro chip, and 24GB of RAM, I believe only 16 of it is usable by the models, so I can run the GPT OSS 20B model (iirc) but the smaller one. It can do a bit, but the context window fills up quickly, so I do find myself switching context windows often enough. I do wonder if a maxed out MacBook Pro would be able to run larger context windows, then I would easily be able to code all day with it offline.
I do think Macs are phenomenal at running local LLMs if you get the right one.
The Studio Ultras are surprisingly strong as well for a pretty monitor stand.
How does the prompt processing speed look like today? I think it was either M3 or M4 together with 128GB, trying to run even slightly longer prompts took forever for the initial prompt processing so whatever speed gain you get at inference, basically didn't matter. Maybe it works better today?
Platform: LMStudio (primarily) & Ollama
Models:
- qwen/qwen3-coder-30b A3B Instruct 8-bit MLX
- mlx-community/gpt-oss-120b-MXFP4-Q8
For code generation especially for larger projects, these models aren't as good as the cutting edge foundation models. For summarizing local git repos/libraries, generating documentation and simple offline command-line tool-use they do a good job.
I find these communities quite vibrant and helpful too:
I have recently added claude skills to it. So, all the claude skills can be executed locally on your mac too.
The Qwen3-coder model you use is pretty good. You can enable the LM Studio API and install the qwen CLI and point to the API endpoint. This basically gives you functionality similar to Claude code.
I agree that the code quality is not on part with gpt5-codex and Claude. I also haven't tried z.ai's models locally yet. I think on a Mac with that size GLM 4.5 Air should be able to run.
For README generation I like gemma3-27b-it-qat and gpt-oss-120b.
I get a steady stream of tokens, slightly slower than my reading pace, which I find is more than fast enough. In fact I´d only replace with exact same, or maybe M2 + Asahi with enough RAM to run the bigger Qwen3 model.
I saw qwen3-coder mentioned here. I didnt know about that one. Anyone got any thoughts on how that compares to qwen3? Will it also fit in 32GB?
I'm not interested in agents, or tool integration, and especially wont use anything cloud. I like to own my env. and code top-to-bottom. Having also switched to Kate and Fossil it feels like my perfect dev environment.
Currently using an older Ollama, but will switch to llama.cpp now that ollama has pivoted away from offline only. I got llama.cpp installed, but not sure how to reuse my models from ollama, I thought ollama was just a wrapper, but they seems to be different model formats?
[edit] be sure to use it powered, linux is a bit battery heavy, but Qwen3 will pull 60W+ and flatten a battery real fast.
It's not as smart as dense 32B for general tasks, but theoretically should be better for the sort of coding tasks from StackExchange.
Here's my ollama config:
https://github.com/woile/nix-config/blob/main/hosts/aconcagu...
I'm not an AI power user. I like to code, and I like the AI to autocomplete snippets that are "logical", I don't use agents, and for that, it's good enough.