Top
Best
New

Posted by vinhnx 1/26/2026

Qwen3-Max-Thinking(qwen.ai)
502 points | 424 commentspage 2
deepakkumarb 1/27/2026|
I get that these approaches work, and they’re totally valid engineering trade-offs. But I don’t think they’re the same thing as real model improvements. If we’re just throwing more tokens, longer chains of thought, or extra tools at the problem, that feels more like brute force than genuine progress.

And that distinction matters in practice. If getting slightly better answers means using 5–10× more tokens or a bunch of external calls, the costs add up fast. That doesn’t scale well in the real world. It’s hard to call something a breakthrough when quality goes up but the bill and latency go up just as much.

I also think we should be careful about reading too much into benchmarks. A lot of them reward clever prompting and tool orchestration more than actual general intelligence. Once you factor in reliability, speed, and cost, the story often looks less impressive.

DeathArrow 1/26/2026||
Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z
kennykartman 1/26/2026||
Ah ah I was curious about that! I wonder if (when? if not already) some company is using some version of this in their training set. I'm still impressed by the fact that this benchmark has been out for so long and yet produce this kind of (ugly?) results.
NitpickLawyer 1/26/2026|||
It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...
kennykartman 1/27/2026|||
Sure, I agree! I did not mean to see better results because LLMs improved significantly in their visual-spatial reasoning, but simply because I expected more people drawing SVGs of pelicans on bikes and having more LLMs ingesting them. This is what I find a bit surprising.
Sharlin 1/26/2026|||
It could still be special-case RLHF trained, just not up to perfection.
saberience 1/26/2026||||
Because no one cares about optimizing for this because it's a stupid benchmark.

It doesn't mean anything. No frontier lab is trying hard to improve the way its model produces SVG format files.

I would also add, the frontier labs are spending all their post-training time on working on the shit that is actually making them money: i.e. writing code and improving tool calling.

The Pelican on a bicycle thing is funny, yes, but it doesn't really translate into more revenue for AI labs so there's a reason it's not radically improving over time.

obidee2 1/26/2026|||
Why stupid? Vector images are widely used and extremely useful directly and to render raster images at different scales. It’s also highly connected with spacial and geometric reasoning and precision, which would open up a whole new class of problems these models could tackle. Sure, it’s secondary to raster image analysis and generation, but curious why it would be stupid to persue?
simonw 1/26/2026||||
+1 to "it's a stupid benchmark".
esafak 1/26/2026||
You can always suggest a new one ;)
lofaszvanitt 1/26/2026||||
It shows that these are nowhere near anything resembling human intelligence. You wouldn't have to optimize for anything if it would be a general intelligence of sorts.
CamperBob2 1/26/2026||
Here's a pencil and paper. Let's see your SVG pelican.
vladms 1/26/2026|||
So you think if would give a pencil and a paper to the model would it do better?

I don't think SVG is the problem. It just shows that models are fragile (nothing new) so even if they can (probably) make a good PNG with a pelican on a bike, and they can make (probably) make some good SVG, they do not "transfer" things because they do not "understand them".

I do expect models to fail randomly in tasks that are not "average and common" so for me personally the benchmark is not very useful (and that does not mean they can't work, just that I would not bet on it). If there are people that think "if an LLM outputted an SVG for my request it means it can output an SVG for every image", there might be some value.

zebomon 1/26/2026|||
This exactly. I don't understand the argument that seems to be, if it were real intelligence, it would never have to learn anything. It's machine learning, not machine magic.
CamperBob2 1/26/2026||
One aspect worth considering is that, given a human who knows HTML and graphics coding but who had never heard of SVG, they could be expected to perform such a task (eventually) if given a chance to train on SVG from the spec.

Current-gen LLMs might be able to do that with in-context learning, but if limited to pretraining alone, or even pretraining followed by post-training, would one book be enough to impart genuine SVG composition and interpretation skills to the model weights themselves?

My understanding is that the answer would be no, a single copy of the SVG spec would not be anywhere near enough to make the resulting base model any good at SVG authorship. Quite a few other examples and references would be needed in either pretraining, post-training or both.

So one measure of AGI -- necessary but not sufficient on its own -- might be the ability to gain knowledge and skills with no more exposure to training material than a human student would be given. We shouldn't have to feed it terabytes of highly-redundant training material, as we do now, and spend hundreds of GWh to make it stick. Of course that could change by 5 PM today, the way things are going...

storystarling 1/26/2026|||
I suspect there is actually quite a bit of money on the table here. For those of us running print-on-demand workflows, the current raster-to-vector pipeline is incredibly brittle and expensive to maintain. Reliable native SVG generation would solve a massive architectural headache for physical product creation.
derefr 1/26/2026|||
It’d be difficult to use in any automated process, as the judgement for how good one of these renditions is, is very qualitative.

You could try to rasterize the SVG and then use an image2text model to describe it, but I suspect it would just “see through” any flaws in the depiction and describe it as “a pelican on a bicycle” anyway.

lofaszvanitt 1/26/2026||
A salivating pelican :D.
Alifatisk 1/26/2026||
Can't wait for the benchmark at artificial analysis. Qwen team doesn't seem to have updated the information about this new model yet https://chat.qwen.ai/settings/model. I tried getting an api key from alibabacloud, but the amount of steps from creating an account made me stop, it was too much. It should be this difficult.

Incredible work anyways!

jokab 1/27/2026|
it seems to me that they dont want our money
ytrt54e 1/26/2026||
I cannot even open the page; maybe I am blacklisted for asking about Tiananmen Square when their AI first hit the news?
moffkalast 1/26/2026|
Attention citizen! -10000 social credit
gcr 1/26/2026||
Is there an open-source release accompanying this announcement or is this a proprietary model for the time being?
treefry 1/26/2026||
Are they likely to take a new strategy that they no longer open source their largest and strongest models?
ilaksh 1/26/2026||
That's now new -- Qwen 3 Max for example has been closed.
gunalx 1/26/2026||
new? They have done this a long time.
Mashimo 1/26/2026||
I tried to search, could not find anything, do they offer subscriptions? Or only pay per tokens?
esafak 1/26/2026|
I think they don't. I'd wait for the Cerebras release; they have a subscription offering called Cerebras Code for $50/month. https://www.cerebras.ai/pricing
pier25 1/26/2026||
Tried it and it's super slow compared to others LLMs.

I imagine the Alibaba infra is being hammered hard.

ilaksh 1/26/2026|
Well but it's also deliberately doing a ton of thinking right?
dajonker 1/26/2026|
These LLM benchmarks are like interviews for software engineers. They get drilled on advanced algorithms for distributed computing and they ace the questions. But then it turns out that the job is to add a button the user interface and it uses new tailwind classes instead of reusing the existing ones so it is just not quite right.
More comments...