Running local models on an M4 with 24GB memory

Posted by shintoist 13 hours ago

Running local models on an M4 with 24GB memory(jola.dev)

354 points | 115 commentspage 3

kristianpaul 7 hours ago|

Good to keep hideThinkingBlock default, is on purpose to be able to steer de model.

spike021 11 hours ago||

I'll have to try some more. I've been playing with gpt-oss 20b on my M4 24GB but it hasn't been the best experience.

reillyse 11 hours ago||

so, interested how many people are running higher end AI models locally? Figure if I'm spending $800/month on tokens I can build a pretty beefy local machine for the cost of a few months spend - what is people's experience with say a $5k server custom built (and only for) running an AI model.

entrope 11 hours ago||

You will likely have to compromise on memory bandwidth or capacity under a $10k price. The Radeon R9700 has 32 GB of VRAM and is pretty cheap (~$1500 right now), which is what I primarily use. My home desktop has 128 GB RAM and my laptop has 96 GB RAM, but bandwidth limits make most models slow on those CPUs. Models with multi-token prediction are somewhat usable on them: Nemotron 3 Super runs reasonably well on my desktop but does poorly on agentic coding that I've given it; my laptop can run Qwen3.6-27B reasonably well with a version of llama.cpp that is patched for MTP support; but usually I run Qwen3.6-27B on my R9700. vLLM might support two or three R9700s on some OS, but I've not been able to get it to run at all with Ubuntu 26.04: system ROCm version is apparently different than what's in the container images, and system OpenMPI v5.0 finally removed C++ bindings that were deprecated in 2005 but are linked from some Python wheel that vLLM (probably indirectly) imports.

If you are spending $800/month on tokens you are likely to notice degradation for local models compared to near-frontier models. The models I can run locally are consistently worse than Claude Sonnet 4.6 (again for the work I give them), although Qwen3.6 does feel almost like magic for its size because it can do a lot. The really big open-weight models should be better, but they want 200+GB RAM, which will need a correspondingly expensive CPU.

adornKey 4 hours ago|||

I'm running a server in the 5K-league. And the results are very good. I get about 150 Tokens/s from Qwen3 for coding. And about 50 Tokens/s from the newer non-MoE Qwens.

I wouldn't bother with less than 32GB of VRAM. With 16GB you can already run something usable, but 32GB gives you much more power. 9B and 14B are only interesting if you want to tune models yourself. The sweet spot now seem to be around 27B-35B.

2ndorderthought 2 hours ago||

Check in with /r/localllama. There's 100gb vram set ups from complete ewaste to single 8gb GPU inference machines.depends on what you want and can afford

stuaxo 5 hours ago||

"What does work is a more interactive workflow where you’re clearly communicating with the model step by step, and giving it a lot of guidance. I’m sure that sounds pointless to many of you, why use a model where you have to babysit it as it works, but I actually found that it encouraged me to be more engaged. "

This sort of thing is key to knowing what's going on and bit having your brain fully atrophy.

BubbleRings 11 hours ago||

People do use SOTA LLM’s for other things besides computer programming.

For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as possible yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)

However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically… and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employee at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do. Remember, the United States patent office went to “first inventor to file” in 2013.

Oh and don’t take legal advice from random people on the internet by the way.

dempedempe 10 hours ago|

It takes people years to learn how to write a good patent. If you gave your lawyer your attempt at writing your own patent, they might use the info to understand what you want (you're right about that), but a good lawyer would probably just start from scratch.

Imagine you're a contractor. You have a client who knows nothing about software development that wants you to write some software for them. They give you some code they generated with an LLM to get you started. Would you use the code or start over?

redsocksfan45 2 hours ago||

[dead]

shouvik12 10 hours ago||

[flagged]

Ngraph 12 hours ago||

[dead]

zoomuser 12 hours ago||

[dead]

NBJack 13 hours ago|

I'm puzzled. The M4, as far as I know, doesn't have 24GB. Did the author mean a M40?

tra3 13 hours ago||

There’s definitely an option with 24 gigs of ram: https://support.apple.com/en-ca/121552

NBJack 9 hours ago||

Ah, thank you. I was assuming a Nvidia Tesla M4.

sertsa 13 hours ago|||

M4 Mac Mini w/24GB sitting right here on my desk.

NBJack 9 hours ago||

Thanks; I assumed the author was talking about an Nvidia Tesla M4 (hence my confusion and assumption that they meant the M40 series, which has 24GB of VRAM).

spoonyvoid7 13 hours ago||

M4 = M4 Macbook Pro

teaearlgraycold 12 hours ago||

Or Air

More comments...