Top
Best
New

Posted by cmitsakis 8 hours ago

Qwen3.6-35B-A3B: Agentic coding power, now open to all(qwen.ai)
716 points | 338 commentspage 2
fooblaster 8 hours ago|
Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!
wrxd 6 hours ago|
Same here. I really hope in a near future local model will be good enough and hardware fast enough to run them to become viable for most use cases
vlapec 1 hour ago||
No need to hope; it is inevitable.
abhikul0 8 hours ago||
I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.
mhitza 7 hours ago||
It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

abhikul0 7 hours ago|||
Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.
zozbot234 7 hours ago|||
CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.
abhikul0 6 hours ago||
I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.
zozbot234 6 hours ago||
Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.
mhitza 7 hours ago|||
For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage
dgb23 7 hours ago||||
Do I expect the same memory footprint from an N active parameters as from simply N total parameters?
daemonologist 7 hours ago|||
No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.)

All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.

pdyc 7 hours ago|||
i dont get it, mac has unified memory how would offloading experts to cpu help?
bee_rider 7 hours ago||
I bet the poster just didn’t remember that important detail about Macs, it is kind of unusual from a normal computer point of view.

I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?

abhikul0 7 hours ago||
Of course the swap is there for fallback but I hate using it lol as I don't want to degrade SSD longevity.
pdyc 7 hours ago||
can you elaborate? you can use quantized version, would context still be an issue with it?
abhikul0 7 hours ago|||
A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

nickthegreek 7 hours ago|||
context is always an issue with local models and consumer hardware.
pdyc 7 hours ago||
correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac
abhikul0 7 hours ago||
For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom.

Output after I exit the llama-server command:

  llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
  llama_memory_breakdown_print: |   - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 =  6262 +    4553 +    3329) +           0 |
  llama_memory_breakdown_print: |   - Host                |                   2779 =   666 +       0 +    2112                |
KronisLV 4 hours ago||
I wonder how this one compares to Qwen3 Coder Next (the 80B A3B model), since you'd think that even though it's older, it having more parameters would make it more useful for agentic and development use cases: https://huggingface.co/collections/Qwen/qwen3-coder-next
the__alchemist 2 hours ago||
Is this the hybrid variant of Gwent and Quen? I hope this is in The Witcher IV!
jake-coworker 7 hours ago||
This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).
wild_egg 7 hours ago|
Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability
daemonologist 6 hours ago|||
I didn't see a direct comparison, but there's some overlap in the published benchmarks:

                           │ Qwen 3.6 35B-A3B │ Haiku 4.5               
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Verified     │ 73.4             │ 66.6                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Multilingual │ 67.2             │ 64.7                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Pro          │ 49.5             │ 39.45                   
   ────────────────────────┼──────────────────┼──────────────────────── 
    Terminal Bench 2.0     │ 51.5             │ 61.2 (Warp), 27.5 (CC)  
   ────────────────────────┼──────────────────┼──────────────────────── 
    LiveCodeBench          │ 80.4             │ 41.92                   

These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.
coder543 6 hours ago||||
Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.

deaux 3 hours ago|||
I find Gemma 4 26B A4B better than Haiku 4.5 and that's smaller than this one.
giantg2 3 hours ago||
I cant wait to see some smaller sizes. I would love to run some sort of coding centric agent on a local TPU or GPU instead of having to pay, even if it's slower.
codeugo 2 hours ago||
Are we going to get to the point where a local model can do almost what sonnet 4.6 can do?
intothemild 2 hours ago||
We're already there IMHO.. If you have enough ram, sure.. but the ~32gig people can run models that beat sonnet 4.5
bluerooibos 2 hours ago||
Of course we are. And Opus 4.6+. It's a matter of when, not if.
cyrialize 4 hours ago||
My last laptop was a used 2012 T530.

My current is a used M1 MBP Pro with 16GB of ram.

I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.

Although, part of me wants to see how far I could get with my trusty laptop.

bigyabai 4 hours ago|
Your current laptop is still a fine thin client. Unless you program in the woods, it's probably cheapest to build a home inference box and route it over Tailscale or something.
system2 2 hours ago||
Or just an API server for all other devices to connect and do stuff with it.
zengid 2 hours ago||
any tips for running it locally within an agent harness? maybe using pi or opencode?
stratos123 1 hour ago|
It pretty much just works. Run the unsloth quant in llama.cpp and hook it up to pi. A bunch of minor annoyances like not having support for thinking effort. It also defaults to "interleaved thinking" (thinking blocks get stripped from context), set `"chat_template_kwargs": {"preserve_thinking": True},` if you interrupt the model often and don't want it to forget what it was thinking.
amelius 5 hours ago|
Looks like they compare only to open models, unfortunately.

As I am using mostly the non-open models, I have no idea what these numbers mean.

varispeed 3 hours ago|
[dead]
More comments...