Running local models on an M4 with 24GB memory

Posted by shintoist 11 hours ago

Running local models on an M4 with 24GB memory(jola.dev)

336 points | 107 comments

soganess 10 hours ago|

Getting so close to good!

I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.

On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?

Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.

thot_experiment 7 hours ago||

Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects.

https://thot-experiment.github.io/gradient-gemma4-31b/

This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.

running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write

eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true

thot_experiment 5 hours ago||

Re-posting this from a buried comment for visibility because it's just so fucking impressive to me.

I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.

idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.

AntiUSAbah 3 hours ago|||

It definitly is and just a few years ago unheared of.

And we progress on so many different frontiers in parallel: Agent harness, Agent model, hardware etc.

hparadiz 1 hour ago|||

A technology indistinguishable from magic.

gertlabs 4 hours ago|||

The small Qwen 3.6 models handle context a little better than Gemma 4, but Gemma 4 26B in particular has such small and efficient solutions which are really smart for its weight class. I was so impressed with its performance in our benchmark upon release that I wrote a blog post about it [0], although its position on the leaderboard later fell a bit as we ran it in more long context agentic coding environments.

[0] https://gertlabs.com/blog/gemma-4-economics

pdyc 4 hours ago|||

i use smaller model gemma e2b for most of my editing and it works surprisingly well. Workflow is planning with sota models and execution via small models. If you plan properly dont leave ambiguity for smaller model it works well.

2ndorderthought 1 hour ago||

Out of curiosity have you tried other small models? The e2b for me was unusable. Llama3.2 3b was better and that thing is a year old and I rarely use it now too.

discordance 8 hours ago|||

Could you please share your time to first token and tok/s?

isomorphic 5 hours ago|||

M4 Pro 64GB (14 CPU / 20 GPU), Gemma 4 31B Q4_K_M GGUF, LM Studio: time to first token 0.92s, 11.56 tokens/s.

Edit: For comparison with the other poster, same setup as above, but with Gemma 4 31B Instruct 8bit MLX (not sure if exactly the same model): time to first token 4.62s, 7.20 tokens/s; with a different prompt, 1.17s and 7.24 tokens/s.

zozbot234 3 hours ago||

Could you (or anyone with the same hardware) try antirez's ds4 and report how gracefully it degrades with only the 64GB RAM? Obviously it's going to be dog slow at best for any single inference flow, but can you meaningfully improve on that by running many sessions in parallel? (Ideally you'd need roughly on the order of model sparsity in order to get meaningful sharing of MoE weights, but whether that's genuinely achievable is anyone's guess!)

ls612 7 hours ago|||

I’m on an M2 Max and get 10 tok/s with Gemma 4 8bit MLX

plufz 5 hours ago||

Does gemma work better than qwen3 in your experience?

2ndorderthought 1 hour ago||

Not in mine. I see a lot of people talking about Gemma on here but in my circles pretty much everyone else is running qwen.

busfahrer 6 minutes ago||

I am considering a M5 Pro (18/20C) Macbook with 64GB of RAM, but I'm having a really hard time finding benchmarks of real world models:

Could somebody please provide some tokens-per-second numbers for example for Qwen 3.6 35B/A3B, specifically for Q4 and Q6 quants?

quacker 9 hours ago||

I could have used this article before I spent the weekend arriving to the same conclusion!

Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.

GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.

Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.

I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.

It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.

solenoid0937 9 hours ago||

> It is absolutely not comparable to frontier models.

This is not said often enough.

Yes, local LLMs are great! But reading most HN posts on the subject, you'd think they're within reach of Opus 4.7.

There is a very small, very vocal, very passionate crowd that dramatically overstates the capabilities of local LLMs on HN.

thot_experiment 7 hours ago|||

Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.

That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.

fg137 6 hours ago||

Sure. Sample size = 1.

2ndorderthought 57 minutes ago|||

The models op is using are from a year ago. The big breakthroughs happened in April this past month

thot_experiment 6 hours ago|||

It may surprise you but over thousands of hours I have actually gathered more than one sample.

EDIT: Here's another sample for ya. I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.

idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.

fg137 6 hours ago||||

This.

I have seen way too many people who are overly optimistic about local LLMs.

Having spent a decent amount of time playing with them on consumer nvidia GPUs, I understand well that they not going to be widely usable any time soon. Unfortunately not many people share that.

2ndorderthought 55 minutes ago|||

So the cofounder of hugging face made a post about qwen 3.6 being atclaude level of performance for the lols?

When were you trying local models? The model releases from April 2026 are a serious change in performance.

close04 3 hours ago||||

Not this. Let's reframe the problem. How many years behind do you think they are? By all accounts Gemma 4 is better than a frontier model from 3 years ago. Back then we were wowed by frontier models but when the local model reaches the same performance it's no good anymore, because you moved the target?

Relatively speaking local models might always be behind the curve compared to frontier ones. You can tell by the hardware needed to run each. But in absolute terms they're already past the performance threshold everyone praised in the past.

Right now in a lab somewhere there's a model that's probably better than anything else. There's a ChatGPT 5.6, an Opus 4.8. Knowing that do you suddenly feel a wave of disappointment at the current frontier models?

tommoneytools 53 minutes ago|||

[dead]

AntiUSAbah 3 hours ago||||

You are missing context.

A local model is as good as a frontier model for responding on a signal threat with you which requieres basic tool calling.

A local model is as good as a frontier model of writing a joke.

A local model is as good as a frontier model at responding to an email.

Not sure what needs to be said often enough, no one without a clue would play around with local model setup and would compleltly ignore frontier models and their capabilities?!

HDBaseT 8 hours ago||||

At least in my experience, local models are very far away from models like Opus 4.7 or ChatGPT 5.5 in coding and problem solving areas.

I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.

darkstar_16 2 hours ago||

I think you're doing it wrong. Use the frontier moddels for the research, planning etc and once you have a plan give it to a local model for implementation.

2ndorderthought 58 minutes ago||||

The guy is running potato models!

ActorNightly 2 hours ago|||

Im like 50% convinced that these people are paid by Apple to promote their products. Because the conversation is always just being able to execute models (even larger ones), on mac hardware with unified memory, but nobody ever mentions that inference speed is unusably slow.

You can have good local LLM performance through agents, but you need fast inference. Generally, 2x 3090 or at the minimum 2x3080s (you need 2 to speed up prefill processing to build KV Cache). You just ironically need to be good at prompt engineering, which has a lot of analogue in real world on being able to manage low skilled people in completing tasks.

2ndorderthought 58 minutes ago|||

Try qwen3.6.35 a3b not qwen3.5 9b. It's completely different.

layoric 9 hours ago||

Honestly surprised to hear that GPT OSS 20B runs slow on mac hardware. It's absolutely one of the fastest models I've run on local GPUs for its size, but only tried Nvidia cards.

Edit: TIL it is MoE and only has 3.6B active, explains a lot.

quacker 8 hours ago||

Yeah, I'm probably wrong there. GPT OSS 20B is certainly much faster than some other models I've tried. I actually gave GPT OSS 20B a few prompts just now and it seems to respond as fast or faster than Qwen 3.5 9B. But I needed many more prompts for GPT OSS 20B to complete my contrived task, so progress felt much slower.

dizlexic 14 minutes ago||

Thanks for sharing. I made a post earlier on bluesky describing my random setup on 32gb M2 studio. I'd love feedback. I'm a monkey and if I don't see I can't do.

https://bsky.app/profile/mooresolutions.io/post/3mliilyf2i22...

nl 10 hours ago||

I think it's useful to be realistic about what you can do with a local model, especially something as small as the 9B the author is using. A 9B model is around the level of Sonnet 3.6 - it can do autocomplete and small functions but it loses track trying to understand large problems.

But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.

My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!

This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)

It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.

It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.

There's quite a lot there: press "Tour" to see it all.

Will be open source next week.

furyofantares 6 hours ago||

But I was doing a lot more than autocomplete and small functions with Sonnet 3.5.

potatoman22 6 hours ago|||

Not to be nitpicky, but many of the 4-12b models are somewhere between GPT-3.5 and GPT-4o-mini. It's hard to find a good comparison though, because the benchmarks people score models against change so often. For reference, Sonnet 3.6 came out about a year after GPT 3.5

nl 5 hours ago||

Don't worry about being nitpicky! I'm going to out-nitpick you....

Actually....

I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5).

The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw...

Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25.

ai_fry_ur_brain 10 hours ago||

[flagged]

nl 10 hours ago||

I think knowledge is power.

I think that the more people who try local models (especially the larger ones) the better.

I sometimes get the impression that many people claiming that local models are as good as frontier models work in "token poor" environments. If you can't build large-scale programs using at least Opus 4.5+ then it's difficult to compare. They compare something like Qwen 27B with Sonnet and see that it is nearly as good, but miss that the frontier models are a lot better.

That knowledge is power, too.

I personally can help making local models more accessible. I can't make Opus cheaper.

bachmeier 10 hours ago||

> I sometimes get the impression that many people claiming that local models are as good as frontier models work in "token poor" environments. If you can't build large-scale programs using at least Opus 4.5+ then it's difficult to compare.

I sometimes get the impression that people posting comments on HN don't realize that LLMs do more than vibe coding.

BubbleRings 9 hours ago||

Yeah no kidding. For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as you can yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)

However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically—and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employer at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do, and remember, the United States patent office went to “first to file” in 2013.

Oh and don’t take legal advice from random people in the internet by the way.

solenoid0937 9 hours ago||

> This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically—and thereby prevent you (or anyone else) from being able to ever patent the invention.

This is simply not true. Even if it were true (and again, it's not) you could simply use zero data retention APIs.

No one at the big model companies is trawling through your chats to steal your patents. It's not only illegal and against their own terms of service, but these people have better uses of their time.

PAndreew 2 hours ago||

Critics are (rightly) pointing to the fact that these models are not on par with SOTA for complex coding tasks. But many seems to forget that a large part of white collar office work is Excel crushing, file moving, translating dry legal documents, e-mail drafting, PPT drudgery, etc. These are absolutely doable with 30-35b+ models with the added benefit of keeping company data private.

2ndorderthought 50 minutes ago||

I think the conclusion is flawed here? Sure qwen3.5 9b is nowhere near the sota models. It's 9b and was made a year ago? Everyone taking about local models is pumped about the models released in April this year. Qwen 3.6 27b and qwen 35b a3b if you have a sad GPU. Those are comparable to sota models, seriously.

tjoff 2 hours ago||

Arguably excel and legal are much worse than code because catching the mistakes can be much harder.

Case in point, JPMorgan London Whale incident, $6 billion loss caused by an excel error...

PAndreew 2 hours ago||

Yes... I mean organisations have to adapt to this new working scheme. First they need new processes (maybe borrowed from SW development) that enables them to triage work products on a risk/reward scale. For example my wife works on medical device tenders. It is obligatory to translate every frikkin Word document to our native language which in the end noone will read. Do we use LLMs to do the translation? Hell yeah. For a critical legal document? Eeee. Also I think enablers like speical harnesses shall be developed/improved by keeping these folks in mind. For example to build hooks into the harness that forces the LLM to test/review/sample its output. So yes it's a complex topic, but my point was rather that the inherent capabilities of medium-large-ish open LLMs are sufficient for let's say 70-80% of such office work, and it's a huge market.

sourc3 11 hours ago||

I am running qwen 3.6 9b quantized model on my m4 pro 48gb and it is barely useful to do some basic pi.dev/cc driven development. I think 128gb desktops are the sweet setup to actually get meaningful work done. However, getting your hands on one of these machines is difficult at the moment.

As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.

carbocation 10 hours ago||

Was the choice of such a small model driven by a desire for high tok/sec? I ask because an m4 pro 48gb machine can run larger models (if model intelligence is the thing that would make it more useful).

sourc3 10 hours ago||

Yes that was my goal. Also noticed a huge performance gain going from ollama to mlx. Your mileage may vary.

elij 10 hours ago|||

I'm using the 30b MOE model on same spec with 65k tokens as a sub agent with tooling and it absolutely writes decent code. The dense 9b I agree wasn't great.

sjones671 10 hours ago|||

Thanks for saying this. There's so much nonsense out there online about local models being better than Opus 4.7 and the like. It's just not true for regular users.

I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.

Yukonv 10 hours ago|||

What models and quantizations have you been trying? I've had great success with the larger Qwen 3.x models at 6-bit levels. Using 6 bit quantization is really the bare minimum to give local models a fair shot at agentic flows. Once you start pushing below that the models become more "dumb" from the limited bit space.

SecretDreams 9 hours ago|||

The main benefits for local are:

1) control 2) privacy 3) transparent cost model

Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.

hparadiz 10 hours ago|||

How does it (the openrouter version) compare to ChatGPT 5.5 or Claude Opus 4.6?

sourc3 10 hours ago||

Good enough. It gets 60-70% of the work I need done for a lot less $ (keep in mind I am using these for personal projects that doesn’t generate revenue). If I was using it with the hopes of making money I think I would just use Codex at this point.

rapatel0 9 hours ago||

I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good.

Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models

tjpnz 14 minutes ago||

How about a M4 with 16GB of memory?

canpan 11 hours ago|

Recent models (Qwen 3.6 and Gemma) can really do coding locally. Feels like SOTA from maybe a year ago? But you would want about 32-40GB total memory. 24GB is just a bit short of that. A gaming PC with 16GB graphics card and 32GB RAM brings you very close to a usable coding system.

wktmeow 9 hours ago||

That’s the exact ram/vram combo of my desktop - what model would you suggest for that gaming pc setup?

canpan 7 hours ago||

I would recommend to start withQwen 3.6 35B at maybe Q5, it should be fast in that setup. For intelligence Qwen 3.7 27b, is smarter but will run much more slow. Others also mention gemma 4, which might be worth a try.

solenoid0937 9 hours ago|||

> Feels like SOTA from maybe a year ago?

Agree but only for small projects. SOTA from a year ago still wins on larger projects

ai_fry_ur_brain 10 hours ago|||

"Coding system" "can really do coding locally"

Vibe coders out here thinking all software development is solved by because they made an (ugly and unoriginal) dashboard for their SaaS clone and their single column with 3x3 feature card landing page thats identical to every other vibe coders "startup"

DrBenCarson 10 hours ago||

How are you using that RAM with the GPU?

canpan 10 hours ago||

Llama.cpp with automatic offload to main memory. You can also use Ollama, it is easier, but slower.

reverius42 5 hours ago||

For those who want a GUI, LM Studio does this too (with llama.cpp as the backend I think). I'm getting great (albeit slow) results with Qwen3.6-35B MoE on 8GB GPU RAM, 40GB system RAM.

More comments...