Top
Best
New

Posted by cmitsakis 6 hours ago

Qwen3.6-35B-A3B: Agentic coding power, now open to all(qwen.ai)
716 points | 338 comments
simonw 2 hours ago|
I've been running this on my laptop with the Unsloth 20.9GB GGUF in LM Studio: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/mai...

It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

kelnos 1 minute ago||
[delayed]
jubilanti 2 hours ago|||
I wonder when pelican riding a bicycle will be useless as an evaluation task. The point was that it was something weird nobody had ever really thought about before, not in the benchmarks or even something a team would run internally. But now I'd bet internally this is one of the new Shirley Cards.
abustamam 1 hour ago|||
Simon has an article on this

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

amelius 26 minutes ago||||
Yeah try it with something else, or e.g. add a tiger to the back seat.
rafaelmn 1 hour ago||||
I mean look at the result where he asked about a unicycle - the model couldn't even keep the spokes inside the wheels - would be rudimentary if it "learned" what it means to draw a bicycle wheel and could transfer that to unicycle.
duzer65657 57 minutes ago||
it's the frame that's surprisingly - and consistentnly - wrong. You'd think two triangles would be pretty easy to repro; once you get that the rest is easy. It's not like he's asking "draw a pelican on a four-bar linkage suspension mountainbike..."
Reddit_MLP2 26 minutes ago||
This is older, but even humans don't have a great concept of how a bicycle works... https://twistedsifter.com/2016/04/artist-asks-people-to-draw...
MagicMoonlight 1 hour ago|||
They’ll hardcode it in 4.8, just like they do when they need to “fix” other issues
rdslw 1 hour ago|||
interesting, I just tried this very model, unsloth, Q8, so in theory more capable than Simon's Q4, and get those three "pelicans". definitely NOT opus quality. lmstudio, via Simon's llm, but not apple/mlx. Of course the same short prompt.

Simon, any ideas?

https://ibb.co/gFvwzf7M

https://ibb.co/dYHRC3y

https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)

bertili 2 hours ago|||
It's fascinating that a $999 Mac Mini (M4 32GB) with almost similar wattage as a human brain gets us this far.
cyclopeanutopia 2 hours ago|||
But that you also gave a win to Qwen on flamingo is pretty outrageous! :)

Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)

simonw 2 hours ago||
I love how the Chinese models often have an unprompted predilection to add flair.

GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/

culi 1 hour ago|||
the more I look at these images the more convinced I become that world models are the major missing piece and that these really are ultimately just stochastic sentence machines. Maybe Chomsky was right
prirun 1 hour ago|||
The flamingo on Qwen's unicycle is sitting on the tire, not the seat. That wins because of sunglasses?
evilduck 46 minutes ago||
Can a benchmark meant as a joke not use a fun interpretation of results? The Qwen result has far better style points. Fun sunglasses, a shadow, a better ground, a better sky, clouds, flowers, etc.

If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.

MeteorMarc 1 hour ago|||
Interesting, qwen has the pelican driving on the left lane. Coincidence or has it something to do with the workers providing the RL data?
rubiquity 1 hour ago||
Could be on a bike path where bikes are on the left and pedestrians to the right.
jamwise 2 hours ago|||
I've had some really gnarly SVGs from Claude. Here's what I got after many iterations trying to draw a hand: https://imgur.com/a/X4Jqius
giantg2 1 hour ago||
Probably because all the training material of humans drawing hands are garbage haha.
danielhanchen 2 hours ago|||
Oh that is pretty good! And the SVG one!
slekker 2 hours ago||
How does it do with the "car wash" benchmark? :D
bertili 6 hours ago||
A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

zozbot234 5 hours ago||
This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.
bertili 5 hours ago|||
Is there any source for these claims?
zozbot234 5 hours ago|||
https://x.com/ChujieZheng/status/2039909917323383036 is the pre-release poll they did. ~397B was not a listed choice and plenty of people took it as a signal that it might not be up for release.
anonova 5 hours ago|||
A Qwen research member had a poll on X asking what Qwen 3.6 sizes people wanted to see:

https://x.com/ChujieZheng/status/2039909917323383036

Likely to drive engagement, but the poll excluded the large model size.

kylehotchkiss 3 hours ago||||
How many people/hackernews can run a 397b param model at home? Probably like 20-30.
jubilanti 2 hours ago|||
You can rent a cloud H200 with 140GB VRAM in a server with 256GB system ram for $3-4/hr.
ydj 32 minutes ago||||
Running the mxfp4 unsloth quant of qwen3.5-397b-a17b, I get 40 tps prefill, 20tps decode.

AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.

bitbckt 1 hour ago||||
I'm running it on dual DGX Sparks.
kridsdale3 2 hours ago||||
I can (barely, but sustainably) run Q3.5 397B on my Mac Studio with 256GB unified. It cost $10,000 but that's well within reach for most people who are here, I expect.
qlm 2 hours ago|||
Hacker News moment
toxik 2 hours ago||||
$10k is well outside my budget for frivolous computer purchases.
bdangubic 2 hours ago||
99.97% of HN users are nodding… :)
hparadiz 35 minutes ago||
There are way too many good uses of these models for local that I fully expect a standard workstation 10 years from now to start at 128GB of RAM and have at least a workstation inference device.
SlavikCA 2 hours ago||||
I'm running it on my Intel Xeon W5 with 256GB of DDR5 and Nvidia 72GB VRAM. Paid $7-8k for this system. Probably cost twice as much now.

Using UD-IQ4_NL quants.

Getting 13 t/s. Using it with thinking disabled.

rwmj 2 hours ago|||
For some reason you were being downvoted but I enjoy hearing how people are running open weights models at home (NOT in the cloud), and what kind of hardware they need, even if it's out of my price range.
r-w 3 hours ago||||
OpenRouter.
mistercheese 2 hours ago|||
Yeah I think there’s benefits to third-party providers being able to run the large models and have stronger guarantees about ZDR and knowing where they are hosted! So Open Weights for even the large models we can’t personally serve on our laptops is still useful.
parsimo2010 2 hours ago|||
If you're running it from OpenRouter, you might as well use Qwen3.6 Plus. You don't need to be picky about a particular model size of 3.6. If you just want the 397b version to save money, just pick a cheaper model like M2.7.
stavros 2 hours ago|||
It doesn't matter how many can run it now, it's about freedom. Having a large open weights model available allows you to do things you can't do with closed models.
stingraycharles 5 hours ago|||
397A17B = 397B total weights, 17B per expert?
zackangelo 5 hours ago|||
17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.

wongarsu 5 hours ago||||
397B params, 17B activated at the same time

Those 17B might be split among multiple experts that are activated simultaneously

littlestymaar 5 hours ago|||
That's not how it works. Many people get confused by the “expert” naming, when in reality the key part of the original name “sparse mixture of experts” is sparse.

Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)

guitcastro 6 hours ago||
I really wish they released qwen-image 2.0 as open weights.
homebrewer 6 hours ago||
Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Aurornis 4 hours ago||
Unsloth is great for uploading quants quickly to experiment with, but everyone should know that they almost always revise their quants after testing.

If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.

Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.

danielhanchen 4 hours ago|||
We re-uploaded Gemma4 4 times - 3 times were due to 20 llama.cpp bug fixes, which we helped solve some as well. The 4th is an official Gemma chat template improvement from Google themselves, so these are out of our hands. All providers had to re-fix their uploads, so not just us.

For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.

For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

evilduck 3 hours ago|||
I just wanted to express gratitude to you guys, you do great work. However, it is a little annoying to have to redownload big models though and keeping up with the AI news and community sentiment is a full time job. I wish there was some mechanism somewhere (on your site or Huggingface or something) for displaying feedback or confidence in a model being "ready for general use" before kicking off 100+ GB model downloads.
danielhanchen 3 hours ago|||
Hey thanks - yes agreed - for now we do:

1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made

2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter

3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed

evilduck 44 minutes ago||
Ah thanks, I wasn't aware of #3, that should be a huge boon.
CamperBob2 2 hours ago|||
Best policy is to just wait a couple of weeks after a major model is released. It's frustrating to have to re-download tens or hundreds of GB every few days, but the quant producers have no choice but to release early and often if they want to maintain their reputation.

Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.

danielhanchen 2 hours ago||
Yep agreed at least 1 week is a good idea :)

We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(

magicalhippo 53 minutes ago||||
Appreciate the work of your team very much.

Though chat templates seem like they need a better solution. So many issues, seems quite fragile.

sowbug 4 hours ago||||
Please publish sha256sums of the merged GGUFs in the model descriptions. Otherwise it's hard to tell if the version we have is the latest.
danielhanchen 4 hours ago|||
Yep we can do that probs add a table - in general be post in discussions of model pages - for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions...

HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8

But agreed it's probs better to place them in a table

sowbug 4 hours ago||
Thanks! I know about HF's chunk checksums, but HF doesn't publish (or possibly even know) the merged checksums.
danielhanchen 3 hours ago||
Oh for multi files? Hmm ok let me check that out
zargon 2 hours ago|||
Why do you merge the GGUFs? The 50 GB files are more manageable (IMO) and you can verify checksums as you say.
sowbug 1 hour ago||
I admit it's a habit that's probably weeks out of date. Earlier engines barfed on split GGUFs, but support is a lot better now. Frontends didn't always infer the model name correctly from the first chunk's filename, but once llama.cpp added the models.ini feature, that objection went away.

The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.

dist-epoch 4 hours ago|||
What do you think about creating a tool which can just patch the template embedded in the .gguf file instead of forcing a re-download? The whole file hash can be checked afterwards.
danielhanchen 3 hours ago||
Sadly it's not always chat template fixes :( But yes we now split the first shard as pure metadata (10MB) for huge models - these include the chat template etc - so you only need to download that.

For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(

embedding-shape 4 hours ago||||
Not to mention that almost every model release has some (at least) minor issue in the prompt template and/or the runtime itself, so even if they (not talking unsloth specifically, in general) claim "Day 0 support", do pay extra attention to actual quality as it takes a week or two before issues been hammered out.
danielhanchen 4 hours ago||
Yes this is fair - we try our best to communicate issues - I think we're mostly the only ones doing the communication that model A or B has been fixed etc.

We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc

i5heu 45 minutes ago||||
Thank you very much for this comment! I was not aware of that.
fuddle 3 hours ago||||
I don't understand why the open source model providers don't also publish the quantized version?
danielhanchen 3 hours ago||
They sometimes do! Qwen, Google etc do them!
canarias_mate 1 hour ago|||
[dead]
torginus 2 hours ago|||
Why doesn't Qwen itself release the quantized model? My impression is that quantization is a highly nontrivial process that can degrade the model in non-obvious ways, thus its best handled by people who actually built the model, otherwise the results might be disappointing.

Users of the quantized model might be even made to think that the model sucks because the quantized version does.

bityard 2 hours ago|||
Model developers release open-weight models for all sorts of reasons, but the most common reason is to share their work with the greater AI research community. Sure, they might allow or even encourage personal and commercial use of the model, but they don't necessarily want to be responsible for end-user support.

An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.

That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8

Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.

halJordan 1 hour ago|||
Quantization is an extraordinarily trivial process. Especially if you're doing it with llama.cpp (which unsloth obviously does).

Qwen did release an fp8 version, which is a quantized version.

palmotea 5 hours ago|||
How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.
WithinReason 5 hours ago|||
It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB
Aurornis 4 hours ago|||
Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

Glemllksdf 4 hours ago||
Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

dragonwriter 4 hours ago||
Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.
est 3 hours ago||||
I really want to know what does M, K, XL XS mean in this context and how to choose.

I searched all unsloth doc and there seems no explaination at all.

tredre3 10 minutes ago|||
Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors
huydotnet 3 hours ago|||
They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types
JKCalhoun 4 hours ago||||
"16-bit BF16 69.4 GB"

Is that (BF16) a 16-bit float?

mtklein 3 hours ago|||
Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.
Gracana 3 hours ago||||
https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.

WithinReason 3 hours ago|||
yes, it has 8 exponent bits like float32 instead of 6 like float16
palmotea 5 hours ago|||
Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.
tommy_axle 4 hours ago||||
Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.
gunalx 1 hour ago||
I habe found llama-fit sometimes just selects a way to conservative load with VRAM to spare.
zozbot234 5 hours ago||||
Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.
Ladioss 4 hours ago||||
You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.
trvz 5 hours ago|||
If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

coder543 5 hours ago|||
Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.
boppo1 1 hour ago||
I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.
palmotea 5 hours ago||||
> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

giobox 5 hours ago|||
It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

TechSquidTV 5 hours ago||||
My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

slopinthebag 3 hours ago||
You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.
jchw 5 hours ago||||
32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)

zozbot234 5 hours ago|||
New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.
jchw 4 hours ago||
Yeah, I was doing split tensor and it seemed like a wash. The Arc B70s are not huge on compute.

Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.

I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)

I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.

zozbot234 4 hours ago||
You could fit your HEDT with minimum RAM and a combination of Optane storage (for swapping system RAM with minimum wear) and fast NAND (for offloading large read-only data). If you have abundant physical PCIe slots it ought to be feasible.
dist-epoch 4 hours ago|||
NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

jchw 3 hours ago||
Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)

I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.

A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.

bfivyvysj 5 hours ago||||
A bit like asking how long is a piece of string.
latentsea 5 hours ago|||
It's twice as long as from one end to the middle.
palmotea 5 hours ago|||
More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"
layer8 5 hours ago||||
It’s also doable with AMD Strix Halo.
angoragoats 5 hours ago||||
Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).

utilize1808 5 hours ago||||
Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.
cjbgkagh 5 hours ago||
I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.
Glemllksdf 4 hours ago||
I have the same setup but tried paperclip ai with it and it seems to me that either i'm unable to setup it properly or multiply agents struggle with this setup. Especially as it seems that paperclip ai and opencode (used for connection) is blowing up the context to 20-30k

Any tips around your setup running this?

I use lmstudio with default settings and prioritization instead of split.

cjbgkagh 3 hours ago||
I asked AI for help setting it up. I use 128k context for 31B and 256k context for 26B4A. Ollama worked out of the box for me but I wanted more control with llama.cpp.

My command for llama-server:

llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000

littlestymaar 5 hours ago|||
No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

gunalx 1 hour ago||||
Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.
FusionX 5 hours ago|||
Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.
sander1095 4 hours ago|||
I sense that I don't really understand enough of your comment to know why this is important. I hope you can explain some things to me:

- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.

danielhanchen 4 hours ago|||
Oh hey - we're actually the 4th largest distributor of OSS AI models in GB downloads - see https://huggingface.co/unsloth

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.

dist-epoch 4 hours ago||||
The default Qwen "quantization" is not "bad", it's "large".

Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.

Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.

est 3 hours ago|||
hey you can do a bit research yourself and tell your results to us!
halJordan 1 hour ago|||
There's absolutely nothing wrong it insane with a safetensors file. It might be less convenient than a single file gguf. But that's just laziness not insanity
txtsd 5 hours ago|||
So I can use this in claude code with `ollama run claude`?
Ladioss 4 hours ago|||
More like `ollama launch claude --model qwen3.6:latest`

Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.

Patrick_Devine 2 hours ago|||
If you're on a Mac, use the MLX backend versions which are considerably faster than the GGML based versions (including llama.cpp) and you don't need to fiddle with the context size. The models are `qwen3.6:35b-a3b-nvfp4`, `qwen3.6:35b-a3b-mxfp8`, and `qwen3.6:35b-a3b-mlx-bf16`.
txtsd 1 hour ago|||
I only have 16GB VRAM, and my system uses ~4GB from that. What are my options? I got this one: `Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf`
nunodonato 3 hours ago||||
https://sleepingrobots.com/dreams/stop-using-ollama/
pj_mukh 5 hours ago|||
have you found a model that does this with usable speeds on an M2/M3?
postalcoder 5 hours ago||
On a M4 MBP ollama's qwen3.5:35b-a3b-coding-nvfp4 runs incredibly fast when in the claude/codex harness. M2/M3 should be similar.

It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.

terataiijo 5 hours ago||
lmao they are so fast yooo
ttul 5 hours ago|||
Yes. How do they do it? Literally they must have PagerDuty set up to alert the team the second one of the labs releases anything.
beernet 5 hours ago|||
They obviously collaborate with some of the labs prior to the official release date.
sigbottle 5 hours ago||
That... is a more plausible explanation I didn't think of.
danielhanchen 5 hours ago||
Yes we collab with them!
qskousen 2 hours ago||
Sorry this is a bit of a tangent, but I noticed you also released UD quants of ERNIE-Image the same day it released, which as I understand requires generating a bunch of images. I've been working to do something similar with my CLI program ggufy, and was curious of you had any info you could share on the kind of compute you put into that, and if you generate full images or look at latents?
sigbottle 5 hours ago|||
Is quantization a mostly solved pipeline at this point? I thought that architectures were varied and weird enough where you can't just click a button, say "go optimize these weights", and go. I mean new models have new code that they want to operate on, right, so you'd have to analyze the code and insert the quantization at the right places, automatically, then make sure that doesn't degrade perf?

Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing

bildung 5 hours ago||||
Bad QA :/ They had a bunch of broken quantizations in the last releases
danielhanchen 5 hours ago||
1. Gemma-4 we re-uploaded 4 times - 3 times were 10-20 llama.cpp bug fixes - we had to notify people to upload the correct ones. The 4th is an official Gemma chat template improvement from Google themselves.

2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space

3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.

Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.

We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.

rohansood15 4 hours ago|||
Thanks for all the amazing work Daniel. I remember you guys being late to OH because you were working on weights released the night before - and it's great to see you guys keep up the speed!
danielhanchen 4 hours ago||
Oh thanks haha :) We try our best to get model releases out the door! :) Hope you're doing great!
bildung 5 hours ago|||
Fair enough, appreciate the detailed response! Can you elaborate why other quantizations weren't affected (e.g. bartowski)? Simply because they were straight Q4 etc. for every layer?
danielhanchen 4 hours ago||
No Bartowski's are more affected - (38% NaN) than ours (22%) - for MiniMax 2.7 see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax...

We already fixed ours. Bart hasn't yet but is still working on it following our findings.

blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.

For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.

ssm_alpha and ssm_beta must be Q8_0 or higher.

Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

bildung 4 hours ago||
Thanks again, TIL
danielhanchen 4 hours ago||
Thanks!
ekianjo 5 hours ago|||
yeah and often their quants are broken. They had to update their Gemma4 quants like 4 times in the past 2 weeks.
danielhanchen 5 hours ago||
No it's not our fault - re our 4 uploads - the first 3 are due to llama.cpp fixing bugs - this was out of our control (we're llama.cpp contributors, but not the main devs) - we could have waited, but it's best to update when multiple (10-20) bugs are fixed.

The 4th is Google themselves improving the chat template for tool calling for Gemma.

https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.

kanemcgrath 17 minutes ago||
I have been using Qwen3.5-35B-A3B a lot in local testing, and it is by far the most capable model that could fit on my machine. I think quantization technology has really upped its game around these models, and there were two quants that blew me away

Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw

Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops. with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant. On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.

mtct88 6 hours ago||
Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

lelanthran 5 hours ago||
> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.

gunalx 1 hour ago||
Metawith the llama series as well,they just didn't manage to keep upping the game after and with llama4.
Aurornis 4 hours ago|||
I play with the small open weight models and I disagree. They are fun, but they are not in the same class as hosted models running on big hardware.

If some organization forbade external models they should invest in the hardware to run bigger open models. The small models are a waste of time for serious work when there are more capable models available.

NitpickLawyer 6 hours ago|||
I agree with the sentiment, but these models aren't suited for that. You can run much bigger models on prem with ~100k of hardware, and those can actually be useful in real-world tasks. These small models are fun to play with, but are nowhere close to solving the needs of a dev shop working in healthcare or banking, sadly.
kennethops 6 hours ago|||
I love the idea of building competitor to open weight models but damn is this an expensive game to play
smrtinsert 5 hours ago||
How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?
ndriscoll 5 hours ago||
Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.
coppsilgold 1 hour ago||
Much like a developer can insert a backdoor as a "bug" so can an LLM that was trained to do it.

One way you could probably do it is by identifying a commonly used library that can be misused in a way that would allow some kind of time-of-check to time-of-use (TOCTOU) exploit. Then you train the LLM to use the library incorrectly in this way.

alecco 4 hours ago||
Related interesting find on Qwen.

"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"

https://xcancel.com/N8Programs/status/2044408755790508113

armanj 6 hours ago||
I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.
zozbot234 5 hours ago||
The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!
arxell 5 hours ago|||
Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.
nunodonato 3 hours ago|||
I was hoping this would be the model to replace our Qwen3.5-27B, but the difference is marginally small. Too risky, I'll pass and wait for the release of a dense version.
Mikealcl 3 hours ago|||
Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.
zozbot234 3 hours ago||
You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.
halJordan 1 hour ago||||
That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.
JKCalhoun 4 hours ago||||
"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"

Miraste 5 hours ago|||
What? 35B-A3B is not nearly as smart as 27B.
ekianjo 5 hours ago|||
yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b
Der_Einzige 4 hours ago||
I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

zkmon 5 hours ago|||
Yes.
arunkant 5 hours ago|||
Probably coming next
zkmon 5 hours ago||
I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.
zozbot234 5 hours ago|||
MoE is not a bad idea for local inference if you have fast storage to offload to, and this is quickly becoming feasible with PCIe 5.0 interconnect.
perbu 3 hours ago|||
MoE is excellent for the unified memory inference hardware like DGX Sparc, Apple Studio, etc. Large memory size means you can have quite a few B's and the smaller experts keeps those tokens flowing fast.
cpburns2009 1 hour ago||
Anyone else getting gibberish when running unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS on CUDA (llama.cpp b8815)? UD-Q4_K_XL is fine, as is Vulkan in general.
seemaze 5 hours ago||
Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.
Vespasian 1 hour ago|
That would be really great. Though 3.5 122B is already doing a lot of work in our setup.
rvnx 5 hours ago|
China won again in terms of openness
More comments...