Posted by cmitsakis 8 hours ago
You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.
All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.
I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?
[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...
Output after I exit the llama-server command:
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 = 6262 + 4553 + 3329) + 0 |
llama_memory_breakdown_print: | - Host | 2779 = 666 + 0 + 2112 | │ Qwen 3.6 35B-A3B │ Haiku 4.5
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Verified │ 73.4 │ 66.6
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Multilingual │ 67.2 │ 64.7
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Pro │ 49.5 │ 39.45
────────────────────────┼──────────────────┼────────────────────────
Terminal Bench 2.0 │ 51.5 │ 61.2 (Warp), 27.5 (CC)
────────────────────────┼──────────────────┼────────────────────────
LiveCodeBench │ 80.4 │ 41.92
These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.
https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...
No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.
Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.
My current is a used M1 MBP Pro with 16GB of ram.
I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.
Although, part of me wants to see how far I could get with my trusty laptop.
As I am using mostly the non-open models, I have no idea what these numbers mean.