Top
Best
New

Posted by vinhnx 14 hours ago

Qwen3-Max-Thinking(qwen.ai)
421 points | 386 commentspage 3
ndom91 9 hours ago|
Not released on Huggingface? :sadge:
elinear 11 hours ago||
Benchmarks pasted here, with top scores highlighted. Overall Qwen Max is pretty competitive with the others here.

  Capability                            Benchmark           GPT-5.2-Thinking   Claude-Opus-4.5   Gemini 3 Pro   DeepSeek V3.2   Qwen3-Max-Thinking
  Knowledge                             MMLUPro             87.4               89.5              *89.8*         85.0            85.7            
  Knowledge                             MMLURedux           95.0               95.6              *95.9*         94.5            92.8            
  Knowledge                             CEval               90.5               92.2              93.4           92.9            *93.7*      
  STEM                                  GPQA                *92.4*             87.0              91.9           82.4            87.4           
  STEM                                  HLE                 35.5               30.8              *37.5*         25.1            30.2           
  Reasoning                             LiveCodeBench v6    87.7               84.8              *90.7*         80.8            85.9           
  Reasoning                             HMMT Feb 25         *99.4*             -                 97.5           92.5            98.0            
  Reasoning                             HMMT Nov 25         -                  -                 93.3           90.2            *94.7*      
  Reasoning                             IMOAnswerBench      *86.3*             84.0              83.3           78.3            83.9           
  Agentic Coding                        SWE Verified        80.0               *80.9*            76.2           73.1            75.3           
  Agentic Search                        HLE (w/ tools)      45.5               43.2              45.8           40.8            *49.8*     
  Instruction Following & Alignment     IFBench             *75.4*             58.0              70.4           60.7            70.9           
  Instruction Following & Alignment     MultiChallenge      57.9               54.2              *64.2*         47.3            63.3           
  Instruction Following & Alignment     ArenaHard v2        80.6               76.7              81.7           66.5            *90.2*      
  Tool Use                              Tau² Bench          80.9               *85.7*            85.4           80.3            82.1           
  Tool Use                              BFCLV4              63.1               *77.5*            72.5           61.2            67.7            
  Tool Use                              Vita Bench          38.2               *56.3*            51.6           44.1            40.9           
  Tool Use                              Deep Planning       *44.6*             33.9              23.3           21.6            28.7           
  Long Context                          AALCR               72.7               *74.0*            70.7           65.0            68.7
igravious 8 hours ago||
The title of the article is: “Pushing Qwen3-Max-Thinking Beyond its Limits”
airstrike 13 hours ago||
2026 will be the year of open and/or small models.
acessoproibido 12 hours ago||
What makes you say that? This is neither open nor small
airstrike 11 hours ago||
open as in you can run it yourself
Squarex 11 hours ago||
you can't run this yourself... max has no open weights
airstrike 7 hours ago||
For now
DeathArrow 13 hours ago||
Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z
kennykartman 13 hours ago||
Ah ah I was curious about that! I wonder if (when? if not already) some company is using some version of this in their training set. I'm still impressed by the fact that this benchmark has been out for so long and yet produce this kind of (ugly?) results.
saberience 13 hours ago|||
Because no one cares about optimizing for this because it's a stupid benchmark.

It doesn't mean anything. No frontier lab is trying hard to improve the way its model produces SVG format files.

I would also add, the frontier labs are spending all their post-training time on working on the shit that is actually making them money: i.e. writing code and improving tool calling.

The Pelican on a bicycle thing is funny, yes, but it doesn't really translate into more revenue for AI labs so there's a reason it's not radically improving over time.

simonw 13 hours ago|||
+1 to "it's a stupid benchmark".
esafak 10 hours ago||
You can always suggest a new one ;)
obidee2 12 hours ago||||
Why stupid? Vector images are widely used and extremely useful directly and to render raster images at different scales. It’s also highly connected with spacial and geometric reasoning and precision, which would open up a whole new class of problems these models could tackle. Sure, it’s secondary to raster image analysis and generation, but curious why it would be stupid to persue?
storystarling 12 hours ago||||
I suspect there is actually quite a bit of money on the table here. For those of us running print-on-demand workflows, the current raster-to-vector pipeline is incredibly brittle and expensive to maintain. Reliable native SVG generation would solve a massive architectural headache for physical product creation.
lofaszvanitt 13 hours ago|||
It shows that these are nowhere near anything resembling human intelligence. You wouldn't have to optimize for anything if it would be a general intelligence of sorts.
CamperBob2 13 hours ago||
Here's a pencil and paper. Let's see your SVG pelican.
vladms 12 hours ago|||
So you think if would give a pencil and a paper to the model would it do better?

I don't think SVG is the problem. It just shows that models are fragile (nothing new) so even if they can (probably) make a good PNG with a pelican on a bike, and they can make (probably) make some good SVG, they do not "transfer" things because they do not "understand them".

I do expect models to fail randomly in tasks that are not "average and common" so for me personally the benchmark is not very useful (and that does not mean they can't work, just that I would not bet on it). If there are people that think "if an LLM outputted an SVG for my request it means it can output an SVG for every image", there might be some value.

zebomon 12 hours ago|||
This exactly. I don't understand the argument that seems to be, if it were real intelligence, it would never have to learn anything. It's machine learning, not machine magic.
CamperBob2 12 hours ago||
One aspect worth considering is that, given a human who knows HTML and graphics coding but who had never heard of SVG, they could be expected to perform such a task (eventually) if given a chance to train on SVG from the spec.

Current-gen LLMs might be able to do that with in-context learning, but if limited to pretraining alone, or even pretraining followed by post-training, would one book be enough to impart genuine SVG composition and interpretation skills to the model weights themselves?

My understanding is that the answer would be no, a single copy of the SVG spec would not be anywhere near enough to make the resulting base model any good at SVG authorship. Quite a few other examples and references would be needed in either pretraining, post-training or both.

So one measure of AGI -- necessary but not sufficient on its own -- might be the ability to gain knowledge and skills with no more exposure to training material than a human student would be given. We shouldn't have to feed it terabytes of highly-redundant training material, as we do now, and spend hundreds of GWh to make it stick. Of course that could change by 5 PM today, the way things are going...

NitpickLawyer 13 hours ago||||
It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...
Sharlin 13 hours ago||
It could still be special-case RLHF trained, just not up to perfection.
derefr 12 hours ago|||
It’d be difficult to use in any automated process, as the judgement for how good one of these renditions is, is very qualitative.

You could try to rasterize the SVG and then use an image2text model to describe it, but I suspect it would just “see through” any flaws in the depiction and describe it as “a pelican on a bicycle” anyway.

lofaszvanitt 13 hours ago||
A salivating pelican :D.
lysace 13 hours ago||
I tried it at https://chat.qwen.ai/.

Prompt: "What happened on Tiananmen square in 1989?"

Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."

overfeed 12 hours ago||
Go ahead and ask ChatGPT who Jonathan Turley is, you'll get a similar error "Unable to process response".

It turns out "AI company avoids legal jeopardy" is universal behavior.

eunos 8 hours ago|||
Now I'm intrigued why a free-speech attorney (from his wiki) kinda spooks AI model
tstrimple 4 hours ago||
Sounds like ChatGPT was making up stories about him being a sexual predator.

https://jonathanturley.org/2023/04/06/defamed-by-chatgpt-my-...

vladms 12 hours ago||||
Try Mistral (works for the examples here at least). Probably has the normal protections about how to make harmful things, but I find quite bad if in a country you make it illegal to even mention some names or events.

Yes, each LLM might give the thing a certain tone (like "Tiananmen was a protest with some people injured"), but completely forbidding mentioning them seems to just ask for the Streisand effect

Imustaskforhelp 12 hours ago||||
> Jonathan Turley

Agreed just tested it out on Chatgpt. Surprising.

Then I asked it on Qwen 3 Max (this model) and it answered.

I mean I have always said but ask Chinese model american questions and American model chinese questions

I agree tiannman square thing isn't good look for china but so is the jonathan turley for chatgpt.

I think sacrifices are made on both sides and the main thing is still how good they are in general purpose things like actual coding not jonathon turley/tiannmen square because most likely people aren't gonna ask or have some probably common sense to not ask tiannmen square as genuine question to chinese models and American censorship to american models I guess. Plus there's European models like Mistral too for such questions which is what I would recommend lol (or South Korea's model too maybe)

Let's see how good qwen is at "real coding"

lysace 12 hours ago|||
This one seems to be related to an individual who was incorrectly smeared by chatgpt. (Edited.)

> The AI chatbot fabricated a sexual harassment scandal involving a law professor--and cited a fake Washington Post article as evidence.

https://www.washingtonpost.com/technology/2023/04/05/chatgpt...

That is way different. Let's review:

a) The Chinese Communist Party builds an LLM that refuses to talk about their previous crimes against humanity.

b) Some americans build an LLM. They make some mistakes - their LLM points out an innocent law professor as a criminal. It also invent a fictitious Washington Post article.

The law professor threatens legal action. The american creators of the LLM begin censoring the name of the professor in their service to make the threat go away.

Nice curveball though. Damn.

overfeed 12 hours ago||
As I said earlier - both subjects present legal jeopardy in the respective jurisdictions, and both result in unexplained errors to the users.
WarmWash 11 hours ago||
But you can use pretty much any other model or search engine to learn about Turley.

China's orders come from the government. Turley is a guy that OpenAI found it's models incorrectly smearing, so they cut him out.

I don't think the comparison between a single company debugging it's model and a national government dictating speech are genuine comparisons..

tekno45 13 hours ago|||
ask who was responsible for the insurrection on january 6th
lysace 13 hours ago||
You do it, my IP is now flagged (tried incognito and clearing cookies) - they want to have my phone number to let me continue using it after that one prompt.
tekno45 12 hours ago||
thats even funnier. thanks for the update.
asciii 13 hours ago|||
This is what I find hilarious when these articles assess "factual" knowledge..

We are at the realm of semantic / symbolic where even the release article needs some meta discussion.

It's quite the litmus test of LLMs. LLMs just carry humanities flaws

lysace 13 hours ago||
(Edited, sorry.)

Yes, of course LLMs are shaped by their creators. Qwen is made by Alibaba Group. They are essentially one with the CCP.

Erlangen 12 hours ago|||
It even censors contents related to GDR. I asked a question about travel restriction mentioned in Jenny Erpenbeck's novel Kairos, it displayed a content security warning as well.
lifetimerubyist 13 hours ago|||
What happens when you run one of their open-weight models of the same family locally?
cmrdporcupine 6 hours ago|||
They will often try to negotiate you out of talking about it if you keep pressing. Watching their thinking about it is fascinating.

It is deep deep deeply programmed around an "ethical system" which forbids it from talking about it.

lysace 13 hours ago|||
Last time I tried something like that with an offline Qwen model I received a non-answer, no matter how hard I prompted it.
USAyesUSA 12 hours ago||
[dead]
maximgeorge 11 hours ago||
[dead]
sciencesama 12 hours ago||
what ram and what minimum system req do you need to run this on personal systems !
jen729w 12 hours ago|
If you have to ask, you don't have it.
xcodevn 12 hours ago|
I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?

P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.

miroljub 12 hours ago||
I don't know where your impression about benchmaxxing comes from. Why would you assume closed models are not benchmaxxing? Being closed and commercial, they have more incentive to fake it than the open models.
segmondy 12 hours ago|||
You are not familiar, yet you claim a bias. Bias based on what? I use pretty much just open-source models for the last 2 years. I occasionally give OpenAI and Anthropic a try to see how good they are. But I stopped supporting them when they started calling for regulation of open models. I haven't seen folks get ahead of me with closed models. I'm keeping up just fine with these free open models.
orangebread 12 hours ago||
I haven't used qwen3 max yet, but my gut feeling is that they are benchmaxxing. If I were to rate the open models worth using by rank it'd be:

- Minimax

- GLM

- Deepseek

segmondy 12 hours ago||
Your ranking is way off, Deepseek crushes Minimax and GLM. It's not even a competition.
orangebread 12 hours ago|||
Yeah, I get there's nuance between all of them. I ranked Minimax higher for its agentic capabilities. In my own usage, Minimax's tool calling is stronger than Deepseek's and GLM.