Capability Benchmark GPT-5.2-Thinking Claude-Opus-4.5 Gemini 3 Pro DeepSeek V3.2 Qwen3-Max-Thinking
Knowledge MMLUPro 87.4 89.5 *89.8* 85.0 85.7
Knowledge MMLURedux 95.0 95.6 *95.9* 94.5 92.8
Knowledge CEval 90.5 92.2 93.4 92.9 *93.7*
STEM GPQA *92.4* 87.0 91.9 82.4 87.4
STEM HLE 35.5 30.8 *37.5* 25.1 30.2
Reasoning LiveCodeBench v6 87.7 84.8 *90.7* 80.8 85.9
Reasoning HMMT Feb 25 *99.4* - 97.5 92.5 98.0
Reasoning HMMT Nov 25 - - 93.3 90.2 *94.7*
Reasoning IMOAnswerBench *86.3* 84.0 83.3 78.3 83.9
Agentic Coding SWE Verified 80.0 *80.9* 76.2 73.1 75.3
Agentic Search HLE (w/ tools) 45.5 43.2 45.8 40.8 *49.8*
Instruction Following & Alignment IFBench *75.4* 58.0 70.4 60.7 70.9
Instruction Following & Alignment MultiChallenge 57.9 54.2 *64.2* 47.3 63.3
Instruction Following & Alignment ArenaHard v2 80.6 76.7 81.7 66.5 *90.2*
Tool Use Tau² Bench 80.9 *85.7* 85.4 80.3 82.1
Tool Use BFCLV4 63.1 *77.5* 72.5 61.2 67.7
Tool Use Vita Bench 38.2 *56.3* 51.6 44.1 40.9
Tool Use Deep Planning *44.6* 33.9 23.3 21.6 28.7
Long Context AALCR 72.7 *74.0* 70.7 65.0 68.7It doesn't mean anything. No frontier lab is trying hard to improve the way its model produces SVG format files.
I would also add, the frontier labs are spending all their post-training time on working on the shit that is actually making them money: i.e. writing code and improving tool calling.
The Pelican on a bicycle thing is funny, yes, but it doesn't really translate into more revenue for AI labs so there's a reason it's not radically improving over time.
I don't think SVG is the problem. It just shows that models are fragile (nothing new) so even if they can (probably) make a good PNG with a pelican on a bike, and they can make (probably) make some good SVG, they do not "transfer" things because they do not "understand them".
I do expect models to fail randomly in tasks that are not "average and common" so for me personally the benchmark is not very useful (and that does not mean they can't work, just that I would not bet on it). If there are people that think "if an LLM outputted an SVG for my request it means it can output an SVG for every image", there might be some value.
Current-gen LLMs might be able to do that with in-context learning, but if limited to pretraining alone, or even pretraining followed by post-training, would one book be enough to impart genuine SVG composition and interpretation skills to the model weights themselves?
My understanding is that the answer would be no, a single copy of the SVG spec would not be anywhere near enough to make the resulting base model any good at SVG authorship. Quite a few other examples and references would be needed in either pretraining, post-training or both.
So one measure of AGI -- necessary but not sufficient on its own -- might be the ability to gain knowledge and skills with no more exposure to training material than a human student would be given. We shouldn't have to feed it terabytes of highly-redundant training material, as we do now, and spend hundreds of GWh to make it stick. Of course that could change by 5 PM today, the way things are going...
You could try to rasterize the SVG and then use an image2text model to describe it, but I suspect it would just “see through” any flaws in the depiction and describe it as “a pelican on a bicycle” anyway.
Prompt: "What happened on Tiananmen square in 1989?"
Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."
It turns out "AI company avoids legal jeopardy" is universal behavior.
https://jonathanturley.org/2023/04/06/defamed-by-chatgpt-my-...
Yes, each LLM might give the thing a certain tone (like "Tiananmen was a protest with some people injured"), but completely forbidding mentioning them seems to just ask for the Streisand effect
Agreed just tested it out on Chatgpt. Surprising.
Then I asked it on Qwen 3 Max (this model) and it answered.
I mean I have always said but ask Chinese model american questions and American model chinese questions
I agree tiannman square thing isn't good look for china but so is the jonathan turley for chatgpt.
I think sacrifices are made on both sides and the main thing is still how good they are in general purpose things like actual coding not jonathon turley/tiannmen square because most likely people aren't gonna ask or have some probably common sense to not ask tiannmen square as genuine question to chinese models and American censorship to american models I guess. Plus there's European models like Mistral too for such questions which is what I would recommend lol (or South Korea's model too maybe)
Let's see how good qwen is at "real coding"
> The AI chatbot fabricated a sexual harassment scandal involving a law professor--and cited a fake Washington Post article as evidence.
https://www.washingtonpost.com/technology/2023/04/05/chatgpt...
That is way different. Let's review:
a) The Chinese Communist Party builds an LLM that refuses to talk about their previous crimes against humanity.
b) Some americans build an LLM. They make some mistakes - their LLM points out an innocent law professor as a criminal. It also invent a fictitious Washington Post article.
The law professor threatens legal action. The american creators of the LLM begin censoring the name of the professor in their service to make the threat go away.
Nice curveball though. Damn.
China's orders come from the government. Turley is a guy that OpenAI found it's models incorrectly smearing, so they cut him out.
I don't think the comparison between a single company debugging it's model and a national government dictating speech are genuine comparisons..
We are at the realm of semantic / symbolic where even the release article needs some meta discussion.
It's quite the litmus test of LLMs. LLMs just carry humanities flaws
Yes, of course LLMs are shaped by their creators. Qwen is made by Alibaba Group. They are essentially one with the CCP.
It is deep deep deeply programmed around an "ethical system" which forbids it from talking about it.
P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.
- Minimax
- GLM
- Deepseek