Qwen3.6-35B-A3B: Agentic coding power, now open to all

Posted by cmitsakis 11 hours ago

Qwen3.6-35B-A3B: Agentic coding power, now open to all(qwen.ai)

851 points | 402 commentspage 4

999900000999 8 hours ago|

Looking to move off ollama on Open Suse tumbleweed.

Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?

badsectoracula 5 hours ago||

You can compile it from source, all you need to do is clone the repository and do a `cmake -B build -DGGML_VULKAN=1` (add other backends if you want) followed by a `cmake --build build --config Release` and then you get all the llama tools in the `build/bin` (including `llama-server` which provides a web-based interface). There is a `docs/build.md` that has more detailed info (especially if you need another backend, though at least on my RX 7900 XTX i see no difference in terms of performance between Vulkan and ROCm and the former is much more stable and compatible -- i tried ROCm for a bit thinking it'd be much faster but only ended up being much more annoying as some models would OOM on it while they worked on Vulkan -- if you or NVIDIA hardware all this may sound quaint though :-P).

999900000999 3 hours ago||

Cool, I assume this is how adults use llms.

I’m on a nvidia gpu , but I want to be able to combine vram with system memory.

stratos123 3 hours ago|||

Why not just download the binaries from github releases?

rexreed 7 hours ago||

Why are you looking to move off Ollama? Just curious because I'm using Ollama and the cloud models (Kimi 2.5 and Minimax 2.7) which I'm having lots of good success with.

999900000999 6 hours ago||

Ollama co mingles online and local models which defeats the purpose for me

rexreed 3 hours ago||

You can disable all cloud models in your Ollama settings if you just want all local. For cloud you don't have to use the cloud models unless you explicitly request.

solomatov 7 hours ago||

Did anyone try it and Gemma 4? Does it feel that it's better than Gemma 4?

incomingpain 10 hours ago||

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

segmondy 18 minutes ago||

This is obviously a continuation training of 3.5, it's not a new model architecture but an incremental improvement.

adrian_b 10 hours ago||

Their previous model Qwen3.5 was available in many sizes, from very small sizes intended for smartphones, to medium sizes like 27B and big sizes like 122B and 397B.

This model is the first that is provided with open weights from their newer family of models Qwen3.6.

Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.

It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.

The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.

mft_ 10 hours ago|||

There was also Qwen3.5-35B-A3B in the previous generation: https://huggingface.co/Qwen/Qwen3.5-35B-A3B

storus 8 hours ago|||

> Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B

Not at all, Qwen3.5-27B was much better than Qwen3.5-35B-A3B (dense vs MoE).

rubiquity 4 hours ago|||

Not sure why you're being downvoted, I guess it's because how your reply is worded. Anyway, Qwen3.7 35B-A3B should have intelligence on par with a 10.25B parameter model so yes Qwen3.5 27B is going to outperform it still in terms of quality of output, especially for long horizon tasks.

mudkipdev 8 hours ago|||

Re-read that

storus 7 hours ago||

You should. 3.5 MoE was worse than 3.5 dense, so expecting 3.6 MoE to be superior than 3.5 dense is questionable, one could argue that 3.6 dense (not yet released) to be superior than 3.5 dense.

spuz 5 hours ago||

Ok but you made a claim about the new model by stating a fact about the old model. It's easy to see how you appeared to be talking about different things. As for the claim, Qwen do indeed say that their new 3.6 MoE model is on a par with the old 3.5 dense model:

> Despite its efficiency, Qwen3.6-35B-A3B delivers outstanding agentic coding performance, surpassing its predecessor Qwen3.5-35B-A3B by a wide margin and rivaling much larger dense models such as Qwen3.5-27B.

https://qwen.ai/blog?id=qwen3.6-35b-a3b

storus 3 hours ago||

This says a slightly different thing:

https://x.com/alibaba_qwen/status/2044768734234243427?s=48&t...

If you look, at many benchmarks the old dense model is still ahead but in couple benchmarks the new 35B demolishes the old 27B. "rivaling" so YMMV.

psim1 7 hours ago||

(Please don't downvote - serious question) Are Chinese models generally accepted for use within US companies? The company I work for won't allow Qwen.

DiabloD3 7 hours ago||

There is a difference between Chinese model and Chinese service.

Your company most likely is banning the use of foreign services, but it wouldn't make sense to ban the model, since the model would be ran locally.

I wouldn't allow my employees to use a foreign service either if my company had specific geographic laws it had to follow (ie, fin or med or privacy laws, such as the ones in the EU).

That said, I'm not sure I'd allow them to use any AI product either, locally inferred on-prem or not: I need my employees to _not_ make mistakes, not automate mistake making.

kelsey98765431 7 hours ago||

In private sector yes. Anything that touches public sector (government) and it starts to be supply chain concerns and they want all american made models

zengid 5 hours ago||

any tips for running it locally within an agent harness? maybe using pi or opencode?

stratos123 3 hours ago|

It pretty much just works. Run the unsloth quant in llama.cpp and hook it up to pi. A bunch of minor annoyances like not having support for thinking effort. It also defaults to "interleaved thinking" (thinking blocks get stripped from context), set `"chat_template_kwargs": {"preserve_thinking": True},` if you interrupt the model often and don't want it to forget what it was thinking.

andy_ppp 7 hours ago||

Do we know if other models have started detecting and poisoning training/fine tuning that these Chinese models seem to use for alignment, I’d certainly be doing some naughty stuff to keep my moat if I was Anthropic or OpenAI…

storus 3 hours ago|

They no longer show reasoning traces and are throttling more aggressively.

zozbot234 3 hours ago||

They never showed full reasoning traces, just post-hoc summaries.

storus 2 hours ago||

DeepSeek still shows them, it sometimes says "I am ChatGPT", and Claude sometimes says "I am DeepSeek" so the distillation went both ways.

tmaly 7 hours ago||

What is the min VRAM this can run on given it is MOE?

mncharity 5 hours ago|

Fwiw, with its predecessor's Qwen3.5-35B-A3B-Q6_K.gguf, on a laptop's 6 GB VRAM and 32 GB RAM, with default llama.cpp settings, I get 20 t/s generation.

rubiquity 4 hours ago||

Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

zozbot234 4 hours ago|||

Your link seems to be describing a runtime environment variable, it doesn't need a separate build from source. I'm not sure though (1) why this info is in build.md which should be specific to the building process, rather than some separate documentation; and (2) if this really isn't CUDA-specific, why the canonical GGML variable name isn't GGML_ENABLE_UNIFIED_MEMORY , with the _CUDA_ variant treated as a legacy alias. AIUI, both of these should be addressed with pull requests for llama.cpp and/or the ggml library itself.

rubiquity 3 hours ago||

You are right that it is an environment variable, and that's how I have it set in my nix config. Thanks for correcting that.

Unfortunately llama.cpp is somewhat notorious for having lackluster docs. Most of the CLI tools don't even tell you what they are for.

mncharity 2 hours ago||

Hmm. Perhaps there's a niche for a "The Missing Guide to llama.cpp"? Getting started, I did things like wrapping llama-cli in a pty... and only later noticing a --simple-io argument. I wonder if "living documents" are a thing yet, where LLMs keep an eye on repo and fora, and update a doc autonomously.

mncharity 2 hours ago|||

I hadn't tried that, thanks! I found simply defining GGML_CUDA_ENABLE_UNIFIED_MEMORY, whether 1, 0, or "", was a 10x hit to 2 t/s. Perhaps because the laptop's RAM is already so over-committed there. But with the much smaller 4B Qwen3.5-4B-Q8_0.gguf, it doubled performance from 20 to 40+ t/s! Tnx! (an old Quadro RTX 3000 rather than an iGPU)

zoobab 10 hours ago||

"open source"

give me the training data?

tjwebbnorfolk 10 hours ago||

The training data is the entire internet. How do you propose they ship that to you

thrance 8 hours ago||

As a zip archive of however they store it in their database?

flux3125 10 hours ago||

You ARE the training data

smcl 1 hour ago|

fuck off: https://news.ycombinator.com/item?id=47796830

More comments...