Posted by mft_ 8 hours ago
I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:
mmlu: 87.86%
gpqa diamond: 82.32%
gsm8k: 86.43%
ifeval: 75.90%
More details of my experience:- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...
- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...
- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...
Overall an excellent model to have for offline inference.
In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.
This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:
> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.
Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.
Eventually I might try a more practical eval such as terminal bench.
This is always the problem with the 2-bit and even 3-bit quants: They look promising in short sessions but then you try to do real work and realize they’re a waste of time.
Running a smaller dense model like 27B produces better results than 2-bit quants of larger models in my experience.
It would be nice to see a scientific assessment of that statement.
In my anecdotal experience I’ve been happier with Q6 and dealing with the tradeoffs that come with it over Q4 for Qwen3.5 27B.
They did reduce the number of experts, so maybe that was it?
By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.
I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):
% llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 134217.73 MB
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 | 189.67 ± 1.98 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 | 19.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d10000 | 168.92 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d10000 | 18.93 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d20000 | 152.42 ± 0.22 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d20000 | 17.87 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d30000 | 139.37 ± 0.28 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d30000 | 17.12 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d40000 | 128.38 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d40000 | 16.38 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d50000 | 118.07 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d50000 | 15.66 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d60000 | 108.44 ± 0.38 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d60000 | 14.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d70000 | 98.85 ± 0.18 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d70000 | 14.36 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d80000 | 91.39 ± 0.49 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d80000 | 13.84 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d90000 | 85.76 ± 0.24 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d90000 | 13.30 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d100000 | 80.19 ± 0.83 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d100000 | 12.82 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d150000 | 54.46 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d150000 | 10.17 ± 0.09 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d200000 | 47.05 ± 0.15 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d200000 | 9.04 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d250000 | 40.71 ± 0.26 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d250000 | 8.01 ± 0.02 |
build: d28961d81 (8299)
So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.
> You're the guy who launched Neovim!
That's me ;D
> I use it every day.
So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/
@justinmk deserves the credit for this!
Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)
I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX.
However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package:
- SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode
Update: I just did a quick asitop test while inferencing and the GPU power was averaging at 53.55
This is some interesting work, but applying such extreme measures to LLMs to get them to run severely degrades quality. I know he claims negligible quality loss, but in my experience 2-bit quantizations are completely useless for real work. You can get them to respond to prompts, but they lose their intelligence and will go around in circles.
He also shows 5-6 tokens per second. Again that’s impressive for a large model on limited hardware but it’s very slow. Between the severely degraded model abilities and the extremely slow output the 397B result should be considered an attempt at proving something can technically run, not evidence that it can run well and produce output you’d expect from a 397B model.
He even mentions the obvious problems with his changes:
> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.
So right out of the gate this isn’t useful if you want to do anything with it. He could have tried smaller models or less quantizations to get actual useful output from the model, but it wouldn’t look as impressive. It’s honestly getting kind of exhausting to read all of these AI-coded (admitted in the link) and AI-written papers made more for resume building. It would have been interesting to see this work applied to running a useful model that hadn’t been lobotomized instead of applying tricks to get an impressive headline but useless output.
Hand written... by GPT? ;)
To render movies we happily wait for the computer to calculate how lights bounce around, for hours even days.
So why not do the same with AIs? Ask big question to big models and get the answer to the universe tomorrow?
Most LLM use cases are about accelerating workflows. If you have to wait all night for a response and then possibly discover that it took the wrong direction, misunderstood your intent, or your prompt was missing some key information then you have to start over.
I don’t let LLMs write my code but I do a lot of codebase exploration, review, and throwaway prototyping. I have hundreds to maybe thousands of turns in the LLM conservation each day. If I had to wait 10X or 100X as long then it wouldn’t be useful. I’d be more productive ignoring a slow LLM and doing it all myself.
If you have to wait overnight because the model is offloading to disk, that's a model you wouldn't have been able to run otherwise without very expensive hardware. You haven't really lost anything. If anything, it's even easier to check on what a model is doing during a partial inference or agentic workload if the inference process is slower.
This exact problem exist for rendering, when you realize that after a long render an object was missing in the background and the costly frame is now useless. To counter that you make multiple "draft" renders first to make sure everything is in the frame and your parameters are properly tuned.
Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.
It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.
Outside of that the SSD is idling.
Table 3 shows for K=4 experts an IO of 943 MB/Tok at 3.15 Tok/s giving an average IO of 2970 MB/s far below what the SSD could do.
I'm not sure, but not all expert weights are used immediately. Maybe they could do async reads for the down tensors parallelizing compute with IO.
Not sure if this works on Mac, I only tested my larger than RAM setup on Linux with io_uring O_DIRECT reads and I saw that about 20% of total reads do finish while my fused upgate matmul is already running.
Edit: Typos