Top
Best
New

Posted by vinhnx 1/26/2026

Qwen3-Max-Thinking(qwen.ai)
502 points | 424 comments
roughly 1/26/2026|
One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech.
Sol- 1/26/2026||
I also find the implications for this for AGI interesting. If very compute-intensive reasoning leads to very powerful AI, the world might remain the same for at least a few years even after the breakthrough because the inference compute simply cannot keep up.

You might want millions of geniuses in a data center, but perhaps you can only afford one and haven't built out enough compute? Might sound ridiculous to the critics of the current data center build-out, but doesn't seem impossible to me.

roughly 1/26/2026||
I've been pretty skeptical of LLMs as the solution to AGI already, mostly just because the limits of what the models seem capable of doing seem to be lower than we were hoping (glibly, I think they're pretty good at replicating what humans do when we're running on autopilot, so they've hit the floor of human cognition, but I don't think they're capable of hitting the ceiling). That said, I think LLMs will be a component of whatever AGI winds up being - there's too much "there" there for them to be a total dead end - but, echoing the commenter below and taking an analogy to the brain, it feels like "many well-trained models, plus some as-yet unknown coordinator process" is likely where we're going to land here - in other words, to take the Kahneman & Tversky framing, I think the LLMs are making a fair pass at "system 1" thinking, but I don't think we know what the "system 2" component is, and without something in that bucket we're not getting to AGI.
marcd35 1/26/2026|||
i'm no expert, and i actually asked google gemini a similar question yesterday - "how much more energy is consumed by running every query through Gemini AI versus traditional search?" turns out that the AI result is actually on par, if not more efficient (power wise) than traditional search. I think it said its the equivalent power of watching 5 seconds of TV per search.

I also asked perplexity to give a report of the most notable ARXIV papers. This one was at the top of the list -

"The most consequential intellectual development on arXiv is Sara Hooker's "On the Slow Death of Scaling," which systematically dismantles the decade-long consensus that computational scale drives progress. Hooker demonstrates that smaller models—Llama-3 8B and Aya 23 8B—now routinely outperform models with orders of magnitude more parameters, such as Falcon 180B and BLOOM 176B. This inversion suggests that the future of AI development will be determined not by raw compute, but by algorithmic innovations: instruction finetuning, model distillation, chain-of-thought reasoning, preference training, and retrieval-augmented generation. The implications are profound—progress is no longer the exclusive domain of well-capitalized labs, and academia can meaningfully compete again."

roughly 1/26/2026|||
I’m… deeply suspicious of Gemini’s ability to make that assessment.

I do broadly agree that smaller, better tuned models are likely to be the future, if only because the economics of the large models seem somewhat suspect right now, and also the ability to run models on cheaper hardware’s likely to expand their usability and the use cases they can profitably address.

lelandbatey 1/26/2026||||
Some external context on those approximate claims:

- Run a 1500W USA microwave for 10 seconds: 15,000 joules

- Llama 3.1 405B text generation prompts: On average 6,706 joules total, for each response

- Stable Diffusion 3 Medium generating a 1024 x 1024 pixel image w/ 50 diffusion steps: about 4,402 joules

[1] - MIT Technology Review, 2025-05-20 https://www.technologyreview.com/2025/05/20/1116327/ai-energ...

wongarsu 1/26/2026||
A single Google search in 2009: about 1,000 joules

Couldn't find any more up-to-date number, everyone just keeps repeating that 0.0003kWh number from 2009

https://googleblog.blogspot.com/2009/01/powering-google-sear...

827a 1/26/2026||||
Conceptually, the training process is like building a massive and highly compressed index of all known results. You can't outright ignore the power usage to build this index, but at the very least once you have it, in theory traversing it could be more efficient than the competing indexes that power google search. Its a data structure that's perfectly tailored to semantic processing.

Though, once the LLM has to engage a hypothetical "google search" or "web search" tool to supplement its own internal knowledge; I think the efficiency obviously goes out the window. I suspect that Google is doing this every time you engage with Gemini on Search AI Mode.

ainch 1/27/2026||||
It's a good paper by Hooker but that specific comparison is shoddy. Llama and Aya were both trained by significantly more competent labs on different datasets to Falcon and BLOOM. The takeaway there is "it doesn't matter if you have loads of parameters if you don't know what you're doing."

If we compare apples-to-apples, eg. across Claude models, the larger Opus still happily outperforms it's smaller counterparts.

fatata123 1/27/2026|||
[dead]
mrandish 1/26/2026|||
> the token counts to achieve these results

I've also been increasingly curious about better metrics to objectively assess relative model progress. In addition to the decreasing ability of standardized benchmarks to identify meaningful differences in the real-world utility of output, it's getting harder to hold input variables constant for apples-to-apples comparison. Knowing which model scores higher on a composite of diverse benchmarks isn't useful without adjusting for GPU usage, energy, speed, cost, etc.

retinaros 1/26/2026|||
yes. reasoning has a lot of scammy features. just look the number of tokens to nswer on bench and you will see that some models are just awful
nielsole 1/26/2026||
Pareto frontier is the term you are looking for
torginus 1/26/2026||
It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?

My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.

dsign 1/26/2026||
Hm, interesting. I use Kagi assistant with search (by Kagi), and it has a search filter that allows the model to search only academic articles. So far it has not disappointed. Of course the cynic in me thinks it's only a matter of time before there's so much AI-generated garbage even in academic articles that it will eventually become worthless. But when that turns into a serious problem, we will find some sort of solution (probably one involving tons of roller ball pens and in-person meaty handshakes).
Aurornis 1/27/2026|||
> is it possible the the Chinese internet has better quality content available?

That’s a huge leap of logic.

The simpler explanation is that it has better searching functionality and performance.

The models are multi-lingual and can parse results from global websites just fine.

torginus 1/27/2026||
Yes Im not familiar with the Chinese internet, however I've found that in expert topics, textbooks far outperform most internet content, with the sole exception of Wikipedia, which also sometimes has almost professional/academic-quality data on some topics.

I think existence of Wikipedia is a red herring, there's no historical inevitability that people will band together to curate a high-quality encyclopedia on every imaginable topic.

There might be similar, even broader/better efforts on the Chinese internet we (I) know nothing about.

It also might be that Chinese search engines are better than Google at finding high quality data.

But I reiterate - these search based LLMs kinda suck in the West, because Google kinda sucks. Every use of deep research usually ended up with the model citing the same crap articles and data you could find on Google manually, but whereas I could tell the data was no good, AI took it at face value.

exe34 1/26/2026||
maybe they don't have Reddit?
fragmede 1/26/2026||
They have http://v2ex.com though.
Aqua0 1/30/2026||
Unsurprising site. https://tieba.baidu.com/ could be of the same scale as Reddit.
isusmelj 1/26/2026||
I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...
QianXuesen 1/26/2026||
There’s a domestic AI price war in China, plus pricing in mainland China benefits from lower cost structures and very substantial government support e.g., local compute power vouchers and subsidies designed to make AI infrastructure cheaper for domestic businesses and widespread adoption. https://www.notebookcheck.net/China-expands-AI-subsidies-wit...
chrishare 1/27/2026||
All of this is true and credit assignment is hard, but the brutal competition between Chinese firms, especially in manufacturing, differentiates them from and advances them over economies in the west. It makes investment hard as profits are competed away, which is blasphemy in Thiel's worldview, but is excellent for consumers both local and global.
specialist 1/27/2026||
Yes and: Good for the nations underwriting all that domestic competition. Playbook followed by Japan, South Korea, etc, and most recently China.
epolanski 1/26/2026|||
I guess they want to partially subsidize local developers?

Maybe that's a requirement from whoever funds them, probably public money.

segmondy 1/26/2026||
Seriously? Does Netflix or Spotify cost the same everywhere around the world? They earn less and their buying power is less.
vineyardmike 1/26/2026|||
The costs of Netflix and Spotify are licensing. Offering the subscription at half price to additional users is non-cannibalizing and a way to get more revenue from the same content.

The cost of LLMs are the infrastructure. Unless someone can buy/power/run compute cheaper (Google w/ TPUs, locales with cheap electricity, etc), there won't be a meaningful difference in costs.

storystarling 1/27/2026||
That assumes inference efficiency is static, which isn't really the case. Between aggressive quantization, speculative decoding, and better batching strategies, the cost per token can vary wildly on the exact same hardware. I suspect the margins right now come from architecture choices as much as raw power costs.
epolanski 1/26/2026|||
Sure so do professional tools like Microsoft teams or compute in different places of the world.
KlayLay 1/26/2026|||
It could be that energy is a lot cheaper in China, but it could be other reasons, too.
yomansat 1/26/2026||
Slightly off-topic, surveillance Pricing is a term being used more often, whereby even hotel room prices vary based on where you're booking from, what terms you searched for etc.

Here's a short video on the subject:

https://youtube.com/shorts/vfIqzUrk40k?si=JQsFBtyKTQz5mYYC

syntaxing 1/26/2026||
Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.
Flavius 1/26/2026||
Based on their own published benchmarks, it appears that this model is at least 6 months behind.
spwa4 1/26/2026||
Strange how things evolve. When ChatGPT started it had about 2 years headstart over Google's best proprietary model, and more than 2 years ahead to open source models.

Now they have to be lucky to be 6 months ahead to an open model with at most half the parameter count, trained on 1%-2% the hardware US models are trained on.

rglullis 1/26/2026|||
And more than that, the need for people/business to pay the premium for SOTA getting smaller and smaller.

I thought that OpenAI was doomed the moment that Zuckerberg showed he was serious about commoditizing LLM. Even if llama wasn't the GPT killer, it showed that there was no secret formula and that OpenAI had no moat.

NitpickLawyer 1/26/2026||
> that OpenAI had no moat.

Eh. It's at least debatable. There is a moat in compute (this was openly stated at a meeting of AI tech ceos in china, recently). And a bit of a moat in architecture and know-how (oAI gpt-oss is still best in class, and if rumours are to be believed, it was mostly trained on synthetic data, a la phi4 but with better data). And there are still moats around data (see gemini family, especially gemini3).

But if you can conjure up compute, data and basic arch, you get xAI which is up there with the other 3 labs in SotA-like performance. So I'd say there are some moats, but they aren't as safe as they'd thought they'd be in 2023, for sure.

rbtprograms 1/26/2026||||
it seems they believed that superior models would be the moat, but when deepseek essentially replicated o1 they switched to the ecosystem as the moat.
DeathArrow 1/27/2026|||
>Now they have to be lucky to be 6 months ahead to an open model with at most half the parameter count, trained on 1%-2% the hardware US models are trained on.

Maybe there's a limit in training and throwing more hardware at it does very little improvement?

oersted 1/26/2026||
In my experience GPT-5.2 with extra-high thinking is consistently a bit better and significantly cheaper (even when I use the Fast version which is 2x the price in Cursor).

The HN obsession with Claude Code might be a bit biased by people trying to justify their expensive subscriptions to themselves.

However, Opus 4.5 is much faster and very high quality too, and that ends up mattering more in practice. I end up using it much more and paying a dear but worthwhile price for it.

PS: Despite what the benchmarks say, I find Gemini 3 Pro and Flash to be a step below Claude and GPT, although still great compared to the state-of-the-art last year, and very fast and cheap. Gemini also seems to have a less AI sounding writing-style.

I am aware this is all quite vague and anecdotal, just my two cents.

I do think these kinds of opinions are valuable. Benchmarks are a useful reference, but they do give the illusion of certainty to something that is fundamentally much harder to measure and quite subjective.

manmal 1/26/2026|||
Better, yes, but cheaper - only when looking at API costs I guess? Who in their right mind uses the API instead of the subsidized plans? There, Opus is way cheaper in terms of subsidized tokens.
sandos 1/27/2026||||
Iv'e been using GPT-5.1, 5.1-codex and 5.1-codex-max and gpt-5.2 the last few weeks. Then I got tipped off about opus, and that it was supposed to be awesome. The problem is I can clearly see old patterns of "Oooh, I found the issue!" in the middle of the stream long before it has found the real issue I was asking about, and not very good results. The GPT family to me is better.

I was especially impressed by 5.1-codex-max for a webapp, but that is ofc where these model in general shine. But it was freak, never had 15-20 iterations (with 100s of lines added each time) before where I did not have to correct anything.

anonzzzies 1/26/2026||||
You are using opus via api? 200$/mo is nothing for what I get for it so not sure how it is considered expensive. I guess it is how you it; I hit the limits every day. Using the API, I would indeed be paying through the nose but why would anyone?
keyle 1/26/2026|||
My experience exactly.
boutell 1/27/2026||
The most important benchmark:

https://boutell.dev/misc/qwen3-max-pelican.svg

I used Simon Willison's usual prompt.

It thought for over 2 minutes (free account). The commentary was even more glowing than the image.

It has a certain charm.

siliconc0w 1/26/2026||
I don't see a hugging face link, is Qwen no longer releasing their models?
dust42 1/26/2026||
Max was always closed.
behnamoh 1/26/2026||
So the only way to run it is by using Qwen's API? No thanks. At least with Kimi and GLM, I can use Fireworks/whatever to avoid sending data to China.
cmrdporcupine 1/26/2026||
When I looked earlier, Qwen claims to have DCs in Singapore and (I think?) the US but now I can't seem to find where I saw that.

Whether that means anything, I dunno.

imrebuild 1/27/2026||
[dead]
tosh 1/26/2026||
afaiu not all of their models are open weight releases, this one so far is not open weight (?)
sidchilling 1/26/2026||
What would a good coding model to run on an M3 Pro (18GB) to get Codex like workflow and quality? Essentially, I am running out quick when using Codex-High on VSCode on the $20 ChatGPT plan and looking for cheaper / free alternatives (even if a little slower, but same quality). Any pointers?
duffyjp 1/26/2026|||
Nothing. This summer I set up a dual 16GB GPU / 64GB RAM system and nothing I could run was even remotely close. Big models that didn't fit on 32gb VRAM had marginally better results but were at least of magnitude slower than what you'd pay for and still much worse in quality.

I gave one of the GPUs to my kid to play games on.

Tostino 1/26/2026||
Yup, even with 2x 24gb GPUs, it's impossible to get anywhere close to the big models in terms of quality and speed, for a fraction of the cost.
mirekrusin 1/26/2026||
I'm running unsloth/GLM-4.7-Flash-GGUF:UD-Q8_K_XL via llama.cpp on 2x 24G 4090s which fits perfectly with 198k context at 120 tokens/s – the model itself is really good.
fsiefken 1/26/2026||
I can confirm, running glm-4.7-flash-7e-qx54g-hi-mlx here, a 22gb model @q5 on m4 max pro and 59 tokens/s.
medvezhenok 1/26/2026||||
Short answer: there is none. You can't get frontier-level performance from any open source model, much less one that would work on an M3 Pro.

If you had more like 200GB ram you might be able to run something like MiniMax M2.1 to get last-gen performance at something resembling usable speed - but it's still a far cry from codex on high.

mittermayr 1/26/2026||||
at the moment, I think the best you can do is qwen3-coder:30b -- it works, and it's nice to get some fully-local llm coding up and running, but you'll quickly realize that you've long tasted the sweet forbidden nectar that is hosted llms. unfortunately.
evilduck 1/26/2026||||
They are spending hundreds of billions of dollars on data centers filled with GPUs that cost more than an average car and then months on training models to serve your current $20/mo plan. Do you legitimately think there's a cheaper or free alternative that is of the same quality?

I guess you could technically run the huge leading open weight models using large disks as RAM and have close to the "same quality" but with "heat death of the universe" speeds.

tosh 1/26/2026||||
18gb RAM it is a bit tight

with 32gb RAM:

qwen3-coder and glm 4.7 flash are both impressive 30b parameter models

not on the level of gpt 5.2 codex but small enough to run locally (w/ 32gb RAM 4bit quantized) and quite capable

but it is just a matter of time I think until we get quite capable coding models that will be able to run with less RAM

adam_patarino 1/27/2026||
ahem ... cortex.build

Current test version runs in 8GB @ 60tks. Lmk if you want to join our early tester group!

jgoodhcg 1/26/2026||||
Z.ai has glm-4.7. Its almost as good for about $8/mo.
margorczynski 1/26/2026||
Not sure if it's me but at least for my use cases (software devl, small-medium projects) Claude Opus + Claude Code beats by quite a margin OpenCode + GLM 4.7. At least for me Claude "gets it" eventually while GLM will get stuck in a loop not understanding what the problem is or what I expect.
zamalek 1/26/2026||
Right, GLM is close But not close enough. If I have to spend $200 for Opus fallback i may as well not use it always. Still an unbelievable option if $200 is a luxury, the price-per-quality is absurd.
Mashimo 1/26/2026||||
A local model with 18GB of ram that has the same quality has codex high? Yeah, nah mate.

The best could be GLN 4.7 Flash, and I doubt it's close to what you want.

atwrk 1/26/2026||||
"run" as in run locally? There's not much you can do with that little RAM.

If remote models are ok you could have a look at MiniMax M2.1 (minimax.io) or GLM from z.ai or Qwen3 Coder. You should be able to use all of these with your local openai app.

marcd35 1/26/2026|||
antigravity is solid and has a generous free tier.
ezekiel68 1/26/2026||
Last autumn I tried Qwen3-coder via CLI agents like trae to help add significant advanced features to a rust codebase. It consistently outperformed (at the time) Gemini 2.5 Pro and Claude Opus 3.5 with its ability to generate and re-factor code such that the system stayed coherent and improved performance and efficiency (this included adding Linux shared-memory IPC calls and using x86_64 SIMD intrinsics in rust).

I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.

mohsen1 1/26/2026||
Is this available on Open Router yet? I want it to go head-to-head against Gemini 3 Flash which is the king of playing Mafia so far

https://mafia-arena.com

ilaksh 1/26/2026||
I don't think so. Just checked like five minutes ago. Probably before tomorrow though.
culi 1/26/2026||
See also

* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO

* https://dashboard.safe.ai/ — CAIS' incredible dashboard (cited in OP)

* https://clocks.brianmoore.com/ — a visual comparison of how well models can draw a clock. A new clock is drawn every minute

* https://eqbench.com/ — emotional intelligence benchmarks for LLMs

* https://www.ocrarena.ai/battle — OCR battles, ELO

arendtio 1/26/2026||
> By scaling up model parameters and leveraging substantial computational resources

So, how large is that new model?

marcd35 1/26/2026||
While Qwen2.5 was pre-trained on 18 trillion tokens, Qwen3 uses nearly twice that amount, with approximately 36 trillion tokens covering 119 languages and dialects.

https://qwen.ai/blog?id=qwen3

arendtio 1/26/2026||
Thanks for the info, but I don't think it answers the question. I mean, you could train a 20-node network on 36 trillion tokens. Wouldn't make much sense, but you could. So I was asking more about the number of nodes / parameters or GB of file size.

In addition, there seem to be many different versions of Qwen3. E.g. here the list from ollama library: https://ollama.com/library/qwen3/tags

gunalx 1/26/2026||
This is the Max series models with unreleased weights, so probably larger than the largest released one. Also when refering to models, use huggingface or modelscope (wherever it is published) ollama is a really poor source on model info. they have some some bad naming (like confusing people on the deepseek R1 models), renaming, and more on model names, and they default to q4 quants, witch is a good sweet-spot but really degrades performance compared to the raw weigths.
naji_alazhar 1/26/2026||
[dead]
throwaw12 1/26/2026|
Aghhh, I wished they release a model which outperforms Opus 4.5 in agentic coding in my earlier comments, seems I should wait more. But I am hopeful
wyldfire 1/26/2026||
By the time they release something that outperforms Opus 4.5, Opus 5.2 will have been released which will probably be the new state-of-the-art.

But these open weight models are tremendously valuable contributions regardless.

wqaatwt 1/26/2026||
Qwen 3 Max wasn’t originally open, or did they realease?
frankc 1/26/2026|||
One of the ways the chinese companies are keeping up is by training the models on the outputs of the American fronteir models. I'm not saying they don't innovate in other ways, but this is part of how they caught up quickly. However, it pretty much means they are always going to lag.
Onavo 1/26/2026|||
Does the model collapse proof still hold water these days?
CuriouslyC 1/26/2026||||
Not true, for one very simple reason. AI model capabilities are spiky. Chinese models can SFT off American frontier outputs and use them for LLM-as-judge RL as you note, but if they choose to RL on top of that with a different capability than western labs, they'll be better at that thing (while being worse at the things they don't RL on).
aurareturn 1/26/2026||||
They are. There is no way to lead unless China has access to as much compute power.
jyscao 1/26/2026||
They likely will lead in compute power in the medium term future, since they’re definitely the country with the highest energy generation capacity at this point. Now they just need to catch up on the hardware front, which I believe they’ve also made significant progress on over the last few years.
anonzzzies 1/26/2026||
What is the progress on that front? People here on HN are usually saying China is very far away from from progress in competitive cpu/gpu space; I cannot really find objective sources I can read; it is either from China saying it is coming or from the west saying its 10+ years behind.
MaxPock 1/27/2026|||
If that's how it is done, we'd have very many models from all manner of countries. I mean ,how difficult is distillation for India , Japan and EU ?
WarmWash 1/26/2026|||
The Chinese just distill western SOTA models to level up their models, because they are badly compute constrained.

If you were pulling someone much weaker than you behind yourself in a race, they would be right on your heels, but also not really a threat. Unless they can figure out a more efficient way to run before you do.

esafak 1/26/2026||
But it is a threat when the performance difference is not worth the cost in the customers' eyes.
OGEnthusiast 1/26/2026|||
Check out the GLM models, they are excellent
khimaros 1/26/2026||
Minimax m2.1 rivals GLM 4.7 and fits in 128GB with 100k context at 3bit quantization.
auspiv 1/26/2026|||
There have been a couple "studies" and comparing various frontier-tier AIs that have led to the conclusion that Chinese models are somewhere around 7-9 months behind US models. Other comment says that Opus will be at 5.2 by the time Qwen matches Opus 4.5. It's accurate, and there is some data to show by how much.
lofaszvanitt 1/26/2026||
Like these benchmarks mean anything.
More comments...