I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.
(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)
Personally though, I find Qwen useless for anything but coding tasks because if its insufferable sycophancy. It's like 4o dialed up to 20, every reply starts with "You are absolutely right" with zero self awareness. And for coding, only the best model available is usually sensible to use otherwise it's just wasted time.
persona: brief rude senior
Is it really just more training data? I doubt it’s architecture improvements, or at the very least, I imagine any architecture improvements are marginal.
The sweet spot isn't in the "hundreds of billions" range, it's much lower than that.
Anyways your perception of a model's "quality" is determined by careful post-training.
On my M3 Air w/ 24GB of memory 27B is 2 tok/s but 35B A3B is 14-22 tok/s which is actually usable.
35B A3B is faster but didn't do too well in my limited testing.
Also seemed to ignore fairly simple instructions in CLAUDE.md about building and running tests.
Here's how I got the 35B model to work: https://gist.github.com/danthedaniel/c1542c65469fb1caafabe13...
The 35B model is still pretty slow on my machine but it's cool to see it working.
I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?
Inference engines like llama.cpp will offload model and context to system ram for you, at the cost of performance. A MoE like 35B-A3B might serve you better than the ones mentioned, even if it doesn't fit entirely on the GPU. I suggest testing all three. Perhaps even 122-A10B if you have plenty of system ram.
Q4 is a common baseline for simple tasks on local models. I like to step up to Q5/Q6 for anything involving tool use on the smallish models I can run (9B and 35B-A3B).
Larger models tolerate lower quants better than small ones, 27B might be usable at 3 bpw where 9B or 4B wouldn't. You can also quantize the context. On llama.cpp you'd set the flags -fa on, -ctk x and ctv y. -h to see valid parameters. K is more sensitive to quantization than V, don't bother lowering it past q8_0. KV quantization is allegedly broken for Qwen 3.5 right now, but I can't tell.
IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB
And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .
I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.
I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.
Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.
I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.
Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.
https://www.localscore.ai from Mozilla Buolders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet
It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.
I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.
HTH.
1 │ DeepSeek API -- 100%
2 │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
3 │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
4 │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
5 │ qwen3.5:27b-q8_0 (thinking) -- 75.3%
I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.Imo though, going below 4 bits for anything that's less than 70B is not worth the degradation. BF/FP16 and Q8 are usually indistinguishable except for vision encoders (mmproj) and for really small models, like under 2B.
I built https://github.com/brainless/dwata to submit for Google Gemini Hackathon, and focused on an agent that would replace email content with regex to extract financial data. I used Gemini 3 Flash.
After submitting to the contest, I kept working on branch: reverse-template-based-financial-data-extraction to use Ministral 3:3b. I moved away from regex detection to a reverse template generation. Like Jinja2 syntax but in reverse, from the source email.
Financial data extraction now works OK ish and I am constantly improving this to aim for a launch soon. I will try with Qwen 3.5 Small, maybe 4b model. Both Ministral 3:3b and Qwen 3.5 Small:4b will fit on the smallest Mac Mini M4 or a RTX 3060 6GB (I have these devices). dwata should be able to process all sorts of financial data, transaction and meta-data (vendor, reference #), at a pretty nice speed. Keep it running a couple hours and you can go through 20K or 30K emails. All local!
This was a qwen3-coder-next 35B model on M4 Max with 64GB which seems to be 51GB size according to ollama. Have not yet tried the variants from the TFA.
made me laugh, especially in the context of LLMs.