A few lessons learned:
1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.
2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...
Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.
I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.
If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be an exercise in frustration if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to describing and categorizing unstructured data.
There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.
Only Apple has the unique dynamic allocation though.
I'm not sure what exactly you're referring to with "Only Apple has the unique dynamic allocation though." On Strix Halo you set the fixed VRAM size to 512 MB in the BIOS, and you set a few Linux kernel params that enable dynamic allocation to whatever limit you want (I'm using 110 GB max at the moment). LLMs can use up to that much when loaded, but it's shared fully dynamically with regular RAM and is instantly available for regular system use when you unload the LLM.
(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)
For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.
> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.
> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.
> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.
First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).
Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).
So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.
But since this is a tech forum, I assumed some people would be interested by the correction on the details that were wrong.
"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"
(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.
I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.
Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.
But “quality” is the hard part. In this case I’m just choosing the largest quants.
You can check out here how it does that: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...
To detect NVIDIA GPUs, for example: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...
In this case it just runs the command "nvidia-smi".
Note: llmfit is not web-based.
I too was a little surprised by this. My browser (Vivladi) makes a big deal about how privacy-conscious they are, but apparently browser fingerprinting is not on their radar.
> Estimates based on browser APIs. Actual specs may vary
It looks like I can run more local LLMs than I thought, I'll have to give some of those a try. I have decent memory (96GB) but my M2 Max MBP is a few years old now and I figured it would be getting inadequate for the latest models. But llmfit thinks it's a really good fit for the vast majority of them. Interesting!
Love the idea though!
EDIT: Okay the whole thing is nonsense and just some rough guesswork or asking an LLM to estimate the values. You should have real data (I'm sure people here can help) and put ESTIMATE next to any of the combinations you are guessing.
Preliminary testing did not come to that conclusion.
> Apple’s New M5 Max Changes the Local AI Story
For the lazy: that's less then 3x: 1.8 * 3 = 5.4
A couple suggestions:
1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!
Super impressive comparisons, and correlates with my perception having three seperate generations of GPU (from your list pulldown). Thanks for including the "old AMD" Polaris chipsets, which are actually still much faster than lower-spec Apple silicon. I have Ollama3.1 on a VEGA64 and it really is twice as fast as an M2Pro...
----
For anybody that thinks installing a local LLM is complicated: it's not (so long as you have more than one computer, don't tinker on your primary workhorse). I am a blue collar electrician (admittedly: geeky); no more difficult than installing linux.
It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.
It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.
But I don't run out of tokens.
The tool is very nice though.
Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)