Running local models on an M4 with 24GB memory

Posted by shintoist 12 hours ago

Running local models on an M4 with 24GB memory(jola.dev)

336 points | 107 commentspage 2

y42 6 hours ago|

Having an M3 with 36 GByte I was under the assumption, that I can utilize like Qwen and similar models. It's quite easy to set up, you can use pi or hermes for CLI access, or "Continue" to use it in VS Code. You can choose between omlx, Ollama and even more to run the model itself. It's no rocket science, but the results are also not satisfying.

I use it occassionally for very easy tasks, fix typos or update meta data in blog posts. So yeah, it improves productivity. But coding-wise it's far away from Codex, Claude et al.

ThomasBb 7 hours ago||

Beyond the models getting better; there are still huge gains available in the inference engine side with new tricks like Dflash, MRT, turboquant - for some usecases these can multiply the speeds. There are even some model specific optimized kernels like for DeepSeek 4 flash that seem wild.

Makes me feel we are nowhere near the optimum yet.

Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df...

https://x.com/bindureddy/status/2052982206344409242?s=46

brrrrrm 7 hours ago|

what's MRT?

ThomasBb 2 hours ago||

Sorry, autocorrect got me there: MTP is what I meant.

nu11ptr 11 hours ago||

Still trying to understand if a Macbook Pro M5 Max with 128GB is likely going to be able to run coding models well enough that I can cancel my Codex, or even go down to the $20/month plan.

guessmyname 11 hours ago||

A 128GiB MacBook Pro in Canada is what, north of CAD $11k after tax? That’s around USD $7k. At $20/month for a cloud AI subscription, you’re looking at almost 30 years of service for the same money.

How long do people realistically expect a laptop to stay competitive with SOTA local models? Especially in a space where model sizes, context windows, and inference requirements keep moving every year.

And even if the hardware lasts, the local experience usually doesn’t. A heavily quantized local model running at tolerable speeds on consumer hardware is still nowhere near frontier hosted models in reasoning, coding, multimodal capability, tool use, or reliability.

The economics just don’t make sense to me unless you specifically need offline inference, privacy guarantees, or low latency for a niche workflow. Otherwise you’re tying up $10k upfront to run an approximation of what you can already access through a subscription that continuously improves over time.

You could literally put the difference into index funds and probably cover the subscription indefinitely from the returns alone, even accounting for gradual price increases.

dale_glass 45 minutes ago|||

Buy a Framework Desktop with 128 GB instead. It's half the price, and though I bought it for even less before RAM prices went crazy.

2ndorderthought 1 hour ago||||

You can buy a used GPU for under 400 dollars if you already have a desktop and run qwen 3.6 a3b and for a majority of frontier tasks get by just fine. Why do you need to spend 10k on a laptop, we are swimming in ewaste.

tom_ 10 hours ago||||

But what if you were going to buy a laptop anyway? Obviously you can't do anything with less than 64 GBytes these days, so the question is just whether you go for the jump to 128.

In the UK, it's currently an extra £800 to get a 128 GB vs the 64 GB equivalent. So that's more like 3 years of Claude - I think? - assuming current prices stay the same.

Or: you might just feel like £800 isn't an unjustifiable amount of money (one way or another), and tick the box, on the basis that it might just work out. As the saying goes, in for 459,900 pennies, in for £5,399...

gabagool 8 hours ago|||

> Obviously you can't do anything with less than 64 GBytes these days

I don't think that's true. Plenty of people can run basic workflows at 8GB on the MacBook Neo and most others are fine at 16 GB.

nu11ptr 1 hour ago||

I am a developer, as many of us on here are. I currently have 32GB of RAM and am constantly fighting swap. 64GB would be min even w/o local model.

winrid 5 hours ago||||

I rebuilt the entire fastcomments moderation UI 2yrs ago with webstorm on my 16gb thinkpad. 64gb is nice but not needed. I wonder if every dev didn't use an M4 Pro if software wouldn't be so resource hungry...

jval43 7 hours ago|||

Realistically it's 48 M5 Pro vs 128 M5 Max due to constraints on how you can configure them. So a more substantial difference of ~2k US.

nu11ptr 10 hours ago||||

You are assuming I'd only get it for that. That would probably just be the straw that broke the camels back, but I'm already thinking about a purchase even if that doesn't work out.

knollimar 9 hours ago||||

You have to use the item a lot, to the point where you'd be exceeding subs a lot

brcmthrowaway 9 hours ago|||

This is one of the best takedowns of local models I've ever seen.

I just hate paying money for cloud subscriptions, and work has given me a decent laptop

Yukonv 11 hours ago||

Have been using Qwen 3.6 27b recently along with various other models the last month and it is very capable for writing code at a level I haven't need to use a subscription for 95% of what I throw at it. Been using it to write extensions for Pi to expand tool kit without much fuss as one example. Is it as fast or SOTA? No, but you can't ignore how functional it is on hardware you own. Where it can begin to struggle is giving too open ended prompts or investigating complex technical issues. At that level its knowledge is not high enough to solve those problems on its own.

isaisabella 6 hours ago||

I'd rather spend thousands dollars on a Mac than subscribing API. The local model allows me to do my work any time and anywhere, without worrying about privacy leak.

amelius 3 hours ago||

I'd pick a much more open system with more capabilities for a little bit more money, e.g. a Jetson Orin 64GB (unified memory). Runs Linux out of the box.

MinimalAction 8 hours ago||

Well, but if I have a MacBook Air M4 with 16GB, I don't know what useful models can I run.

jen20 2 hours ago|

`brew install llmfit`

kristianpaul 6 hours ago||

Good to keep hideThinkingBlock default, is on purpose to be able to steer de model.

rtpg 11 hours ago||

What kinda harness do people use with these local models? I am quite happy with the Claude Code permission model and interface in general for coding stuff (For chat-y interfaces I have no real opinion)

spike021 10 hours ago||

I'll have to try some more. I've been playing with gpt-oss 20b on my M4 24GB but it hasn't been the best experience.

stuaxo 4 hours ago|

"What does work is a more interactive workflow where you’re clearly communicating with the model step by step, and giving it a lot of guidance. I’m sure that sounds pointless to many of you, why use a model where you have to babysit it as it works, but I actually found that it encouraged me to be more engaged. "

This sort of thing is key to knowing what's going on and bit having your brain fully atrophy.

More comments...