I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.
That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.
It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.
Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.
That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.
A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).
Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.
What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best
Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.
Because honestly I don't care about 0.2 tps for my use cases although I've spoken with many who are fine with numbers like that.
At least the people I've talked to they talk about how if they have a very high confidence score that the model will succeed they don't mind the wait.
Essentially a task failure is 1 in 10, I want to monitor and retry.
If it's 1 in 1000, then I can walk away.
The reality is most people don't have a bearing on what this order of magnitude actually is for a given task. So unless you have high confidence in your confidence score, slow is useless
But sometimes you do...
I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.
I do not want my career to become dependent upon Anthropic.
Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.
I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.
I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.
We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.
There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.
Would it not help with the DDR4 example though if we had more "real world" tests?
The bigger takeaway (IMO) is that there will never really be hardware that scales like Claude or ChatGPT does. I love local AI, but it stresses the fundamental limits of on-device compute.
I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.
It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.
I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?
What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.
I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.
Setting a task and then coming back to see if it worked an hour later is too much friction for me!
I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.
I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.
I wonder if it has to do with the message format, since it should be able to do tool use afaict.
In a day or two I'll release my answer to this problem. But, I'm curious, have you had a different experience where tool use works in one of these CLIs with a small local model?
Who pays for a free model? GPU training isn't free!
I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.
People always will want the fastest, best, easiest setup method.
"Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.
People do not care about the fastest and best past a point.
Let's use transportation as an analogy. If all you have is a horse, a car is a massive improvement. And when cars were just invented, a car with a 40mph top speed was a massive improvement over one with a 20mph top speed and everyone swapped.
While cars with 200mph top speeds exist, most people don't buy them. We all collectively decided that for most of us, most of the time, a top speed of 110-120 was plenty, and that envelope stopped being pushed for consumer vehicles.
If what currently takes Claude Opus 10 minutes to do can be done is 30ms, then making something that can do it in 20ms isn't going to be enough to get everyone to pay a bunch of extra money for.
Companies will buy the cheapest thing that meets their needs. SOTA models right now are much better than the previous generation but we have been seeing diminishing returns in the jump sizes with each of the last couple generations. If the gap between current and last gen shrinks enough, then people won't pay extra for current gen if they don't need it. Just like right now you might use Sonnet or Haiku if you don't think you need Opus.
That said, I'm not sure if this capability is only achievable in huge frontier models, I would be perfectly content using a model that can do this (acting as a force multiplier), and not much else.
Phones are never going to run the largest models locally because they just don't have the size, but we're seeing improvements in capability at small sizes over time that mean that you can run a model on your phone now that would have required hundreds of billions of parameters less than 6 years ago.
When that happens, the most powerful AI will be whichever has the most virtuous cycles going with as wide a set of active users as possible. Free will be hard to compete with because raising the price will exclude the users that make it work.
Until then though, I think you're right that open will lag.
When there are no other downsides, sure. But when the frontier companies start tightening the thumbscrews, price will influence what people consider good enough.
Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).
I was just talking at a high level. If transformers are HDD technology, maybe there's SSD right around the corner that's a paradigm shift for the whole industry (but for the average user just looks like better/smarter models). It's a very new field, and it's not unrealistic that major discoveries shake things up in the next decade or less.
That would imply they'd have to be actually smarter than humans, not just faster and be able to scale infinitely. IMHO that's still very far away..
On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.
So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.
But who knows, maybe Qwen gives them a hand? (hint,hint)
Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.
Perhaps I'm grossly wrong -- I guess time will tell.
There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.
To do well as an LLM you want to end up with the weights that gets furthest in the direction of "reasoning".
So assume that with just one language there's a possibility to get stuck in local optima of weights that do well on the English test set but which doesn't reason well.
If you then take the same model size but it has to manage to learn several languages, with the same number of weights, this would eliminate a lot of those local optima because if you don't manage to get the weights into a regime where real reasoning/deeper concepts is "understood" then it's not possible to do well with several languages with the same number of weights.
And if you speak several languages that would naturally bring in more abstraction, that the concept of "cat" is different from the word "cat" in a given language, and so on.
i.e. there is a lot of commonality between programming languages just as there is between human languages, so training on one language would be beneficial to competency in other languages.
I assumed that is what was catered for with "even when controlling for the size of the training set".
I.e. assuming I am reading it right: That it is better to get the same data as 25% in 4 languages, than 100% in one language.
FP8: https://huggingface.co/Qwen/Qwen3-Coder-Next-FP8
Sequential (single request)
Prompt Gen Prompt Processing Token Gen
Tokens Tokens (tokens/sec) (tokens/sec)
------ ------ ----------------- -----------
521 49 3,157 44.2
1,033 83 3,917 43.7
2,057 77 3,937 43.6
4,105 77 4,453 43.2
8,201 77 4,710 42.2
Parallel (concurrent requests)
pp4096+tg128 (4K context, 128 gen):
n t/s
-- ----
1 28.5
2 39.0
4 50.4
8 57.5
16 61.4
32 62.0
pp8192+tg128 (8K context, 128 gen):
n t/s
-- ----
1 21.6
2 27.1
4 31.9
8 32.7
16 33.7
32 31.7System info:
$ ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7897 (3dd95914d)
built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line: $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
--ctx-size 32768Not as good as running the entire thing on the GPU, of course.
Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?
The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.
I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc.
Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?
Great work as always btw!
brew upgrade llama.cpp # or brew install if you don't have it yet
Then: llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this: llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
It's using about 28GB of RAM.Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.
A few times it got stuck in thinking loops and I had to cancel prompts.
This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.
3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.
If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.
Of course you get degraded performance with this.
Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.
There's no reason for a coding model to contain all of ao3 and wikipedia =)
If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.
I had not considered that, seems like a great solution for local models that may be more resource-constrained.
Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.
I'm more bullish about small and medium sized models + efficient tool calling than I'm about LLMs too large to be run at home without $20k of hardware.
The model doesn't need to have the full knowledge of everything built into it when it has the toolset to fetch, cache and read any information available.
Chinese labs are acting as a disruption against Altman etcs attempt to create big tech monopolies, and that's why some of us cheer for them.
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
Does anyone see a reason to still use elevenlabs etc. ?
Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.
Related: as an actual magician, although no longer performing professionally, I was telling another magician friend the other day that IMHO, LLMs are the single greatest magic trick ever invented judging by pure deceptive power. Two reasons:
1. Great magic tricks exploit flaws in human perception and reasoning by seeming to be something they aren't. The best leverage more than one. By their nature, LLMs perfectly exploit the ways humans assess intelligence in themselves and others - knowledge recall, verbal agility, pattern recognition, confident articulation, etc. No other magic trick stacks so many parallel exploits at once.
2. But even the greatest magic tricks don't fool their inventors. David Copperfield doesn't suspect the lady may be floating by magic. Yet, some AI researchers believe the largest, most complex LLMs actually demonstrate emergent thinking and even consciousness. It's so deceptive it even fools people who know how it works. To me, that's a great fucking trick.
Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.