Top
Best
New

Posted by cmitsakis 10 hours ago

Qwen3.6-35B-A3B: Agentic coding power, now open to all(qwen.ai)
808 points | 376 commentspage 3
Glemllksdf 7 hours ago|
I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

giantg2 5 hours ago||
I cant wait to see some smaller sizes. I would love to run some sort of coding centric agent on a local TPU or GPU instead of having to pay, even if it's slower.
adrian_b 9 hours ago||
Available for download:

https://huggingface.co/Qwen/Qwen3.6-35B-A3B

dataflow 9 hours ago||
I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m
I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>
What is going on?
sosodev 8 hours ago||
These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.
zackangelo 8 hours ago|||
They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

sosodev 8 hours ago||
Oh, that’s interesting. Thanks for the correction. I didn’t know such heavily post trained models could still do good ol fashion autocomplete.
JokerDan 7 hours ago|||
And even of those models trained for tool calling and agentic flows, mileage may vary depending on lots of factors. Been playing around with smaller local models (Anything that fits on 4090 + 64gb RAM) and it is a lottery it seems on a) if it works at all and b) how long it will work for.

Sometimes they don't manage any tool calls and fall over off the bat, other times they manage a few tool calls and then start spewing nonsense. Some can manage sub agents fr a while then fall apart.. I just can't seem to get any consistently decent output on more 'consumer/home pc' type hardware. Mostly been using either pi or OpenCode for this testing.

Jeff_Brown 8 hours ago|||
This might sound snarky but in all earnestness, try talking to an AI about your experience using it.
woctordho 8 hours ago|||
Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.
recov 8 hours ago||
I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF
syntaxing 6 hours ago||
Is it worth running speculative decoding on small active models like this? Or does MTP make speculative decoding unnecessary?
kombine 9 hours ago||
What kind of hardware (preferably non-Apple) can run this model? What about 122B?
daemonologist 8 hours ago||
The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro).

122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.

ru552 8 hours ago|||
You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.

samtheprogram 8 hours ago|||
That estimate doesn't account for context, which is very important for tool use and coding.

I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.

sliken 7 hours ago|||
> You won't like it, but the answer is Apple.

Or strix halo.

Seems rather over simplified.

The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.

Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.

Just use one of the calculators and you'll get much more useful number.

3836293648 2 hours ago||
What Strix Halo system has unified memory? A quick google says it's just a static vram allocation in ram, not that CPU and GPU can actively share memory at runtime
terramex 8 hours ago|||
I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance.
mildred593 8 hours ago|||
I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

rhdunn 8 hours ago|||
The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

canpan 8 hours ago|||
Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.
bigyabai 8 hours ago||
Seconding this. You can get A3B/A4B models to run with 10+ tok/sec on a modern 6/8GB GPU with 32k context if you optimize things well. The cheapest way to run this model at larger contexts is probably a 12gb RTX 3060.
bildung 8 hours ago||
I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

999900000999 7 hours ago||
Looking to move off ollama on Open Suse tumbleweed.

Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?

badsectoracula 4 hours ago||
You can compile it from source, all you need to do is clone the repository and do a `cmake -B build -DGGML_VULKAN=1` (add other backends if you want) followed by a `cmake --build build --config Release` and then you get all the llama tools in the `build/bin` (including `llama-server` which provides a web-based interface). There is a `docs/build.md` that has more detailed info (especially if you need another backend, though at least on my RX 7900 XTX i see no difference in terms of performance between Vulkan and ROCm and the former is much more stable and compatible -- i tried ROCm for a bit thinking it'd be much faster but only ended up being much more annoying as some models would OOM on it while they worked on Vulkan -- if you or NVIDIA hardware all this may sound quaint though :-P).
999900000999 1 hour ago||
Cool, I assume this is how adults use llms.

I’m on a nvidia gpu , but I want to be able to combine vram with system memory.

stratos123 2 hours ago|||
Why not just download the binaries from github releases?
rexreed 6 hours ago||
Why are you looking to move off Ollama? Just curious because I'm using Ollama and the cloud models (Kimi 2.5 and Minimax 2.7) which I'm having lots of good success with.
999900000999 5 hours ago||
Ollama co mingles online and local models which defeats the purpose for me
rexreed 2 hours ago||
You can disable all cloud models in your Ollama settings if you just want all local. For cloud you don't have to use the cloud models unless you explicitly request.
ghc 9 hours ago||
how does this compare to gpt-oss-120b? It seems weird to leave it out.
vyr 8 hours ago||
GPT-OSS 120B (really 117B-A5.1B) is a lot bigger. better comparison would be to 20B (21B-A3.6B).
7734128 6 hours ago||
OSS-120 is too old to be relevant, and four times the size.
zengid 3 hours ago||
any tips for running it locally within an agent harness? maybe using pi or opencode?
stratos123 2 hours ago|
It pretty much just works. Run the unsloth quant in llama.cpp and hook it up to pi. A bunch of minor annoyances like not having support for thinking effort. It also defaults to "interleaved thinking" (thinking blocks get stripped from context), set `"chat_template_kwargs": {"preserve_thinking": True},` if you interrupt the model often and don't want it to forget what it was thinking.
solomatov 6 hours ago|
Did anyone try it and Gemma 4? Does it feel that it's better than Gemma 4?
More comments...