Top
Best
New

Posted by ritzaco 14 hours ago

GLM 5.2 vs. Opus(techstackups.com)
423 points | 287 commentspage 3
pietz 12 hours ago|
GLM 5.2 has one big issue that will limit its meaningful success and that's the value of their coding subscription.

Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.

At the same time, none of these companies will use a Chinese API for their employees.

For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.

We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.

So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.

Certhas 12 hours ago||
My impression is that individual subscriptions are the loss leading hook. The money is made on Enterprise token contracts.

Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.

Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.

The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.

LUmBULtERA 10 hours ago|||
> My impression is that individual subscriptions are the loss leading hook

Except there is no evidence of this at all, just people comparing API and subscription pricing. The leaked financial info for OpenAI shows inference is profitable right now, though it does not show a distinction between subscription and API revenue... but if subscription revenue was so lossy, it would hard for total inference to still be profitable.

CuriouslyC 10 hours ago|||
Anthropic has indicated in the past that API gross margins are ~60%. This might have improved since then, though competition from OAI puts a ceiling on that.
LUmBULtERA 10 hours ago||
Subscription inference can also be cheaper than the cost of API inference if the provider wants it to -- providers can do flexible scheduling for subscription inference for example, around API inference, to lower its cost and get better utilization of the hardware.
Certhas 10 hours ago|||
I did clearly say "my impression is". And you have no evidence to the contrary. We don't even reliably in w how many subscribers Vs enterprise customers they have. And the OpenAI leak doesn't even cleanly say that inference is profitable from what I can tell... The better evidence that it probably is are the prices charged by open weight model providers.
LUmBULtERA 10 hours ago||
Fair enough, there is not strong specific evidence to the contrary except about overall inference being profitable for OpenAI (as well as the open weight model providers hosted throughout the world).
fbnszb 9 hours ago||||
> The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.

I believe this is the reason why we can even have this debate. Without this kind of competition we would not have these subsidies.

pietz 10 hours ago|||
To be clear, I agree with this and they have my unlimited support pushing for relevance of open source models. GLM 5.2 is amazing and I couldn't be more excited.

I just think that as of today, most people will not find a good reason to switch to GLM.

twobitshifter 10 hours ago|||
Taking a view from outside the USA, European companies just had Fable taken away due to US export controls, and before that Anthropic announced it is holding their data for 30 days. There is immediate value to these firms to build their infrastructure around an AI that won’t be pulled away from them. And outside of Europe, other countries are more price sensitive and don’t have the same fear of building relationships with Chinese companies.
WarmWash 7 hours ago|||
There is no such thing as a relationship with "chinese companies". In China there is just the State, and that is it.

If the world needs any more evidence of Europe's short-sightedness, it would be them running to China to spite the US (instead of creating fertile grounds for their own tech).

metobehonest 6 hours ago||
No one is running to China to "spite the US". Recent geopolitical developments have shown the US to be a violent, unpredictable and unreliable partner.
SubiculumCode 8 hours ago|||
And you have that guarantee from Xi?
bornfreddy 5 hours ago||
With openweights? Yes. It might halucinate a backdoor somewhere ( not that you can trust any model about that), but it will still work.
edg5000 10 hours ago|||
This is an important point. I suspect API pricing will eventually disappear just like how paying for an MMS disappeared. It's an antiquated model. The bulk of the work is being done on "coding plans" is my wild guess.

It's annoying that the plans are so restrictive beyond usage limits. Understandable maybe, but annoying. In practice, only Anthropic (and maybe Google) are really restrictive though. They really scared me away with their policy of charging API rates after the fact if they consider your usage not TOS-aligned. This might be an ungrounded fear that I have, but I feel this is something they'd do so they scared me away.

HarHarVeryFunny 9 hours ago|||
> But the only people that use API billing for their coding work are large corporations

As well as people using 3rd party harnesses like OpenCode.

> At the same time, none of these companies will use a Chinese API for their employees

So who are Amazon Bedrock (who serve GLM) targetting?

Individuals are presumably going with one of the cheaper US providers such as DeepInfra ($0.18/M cached input for GLM vs $0.50 for Opus) or Fireworks AI.

veber-alex 8 hours ago|||
The value of these models is that you can run them on your own hardware.

A company can buy a NVIDIA B300 and serve it's developers in house with unlimited tokens.

tw1984 8 hours ago|||
> At the same time, none of these companies will use a Chinese API for their employees.

nice try but you intentionally ignored the entire Chinese market & Chinese big corporates. there are 130 Chinese companies in the fortune 500 list, with an average revenue of 80 billion USD each. do you think they are going to sign up for Claude, Codex or GLM? now consider South East Asia, Africa, Middle East, Middle Asia and South America, tell me why their large corporates won't be using GLM API billings?

your western centric view of the world is totally out of date, like it or not, 2026 is vastly different from 1996, the US no longer controls high tech whatsoever.

tpm 10 hours ago|||
Also, I was testing out the GLM 5.2 using Openrouter because that's where I've got an account with some money and then when I wanted to perhaps subscribe for a better deal at z.ai, their infra was clearly overloaded to the point the 5.2 was timing out on 100% of chat requests, so perhaps I will try later when the infrastructure catches up with the model capability. Only then I can make sure their subscription is worth it.
jauntywundrkind 10 hours ago||
I'm on glm pro subscription and I get so so so much more usage than Claude or Codex! I hammer on glm all day. It's a more expensive plan, but I would need a much much much bigger plan for codex or Claude to do what I do.
mellosouls 5 hours ago||
GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game

This implies Opus was potentially much (?) better value.

GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.

It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.

InsideOutSanta 4 hours ago|
For me, time doesn't matter for LLMs. I can start a bunch of tasks, and I'll review the PRs when they're done. Faster is nicer, but if the task gets done correctly, I'm good.
mellosouls 1 hour ago||
Me too, I just think the comparison was a bit simplistic, at least in the expression of it.
jkwang 13 hours ago||
GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.
em500 13 hours ago||
We've had the great small Qwen 3.6 early April that many could actually run on their laptop. Then similar from Google a few weeks later (Gemma4, better in prose, worse in code). Then the super cheap large Deepseek V4 a few weeks later. Then antirez DS4 build that made that actually runnable on MacBooks and Mac Studios. And now the "near-frontier / near-Opus" GLM 5.2.

For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.

CuriouslyC 10 hours ago||
The difference here is that those small models are impressive, but not super useful. Deepseek 4 is impressively cheap for the intelligence, but not reliable enough to daily drive unless your time has low value.

GLM passes a meaningful threshold of reliability/utility that puts it in a different category for real work. Just like Opus really took off after passing a threshold with 4.5. It's the first open model to do that.

kgeist 39 minutes ago|||
Qwen3.6-27b is surprisingly good for tasks that need modifying an existing repo by analogy with the existing code. For example, you have an existing CRUD app and want to add a new domain model and expose it via the API. Qwen3.6 analyzes how things are done in the project and usually makes it work flawlessly in one shot, and the code is what you expected more-less. Qwen3.6 only struggles with non-trivial code or when you bootstrap a project from scratch (due to the lack of world knowledge, it's a small model after all). But how often do you write non-trivial code or projects from scratch?

I once gave Sonnet 4.6 and Qwen 3.6 the same real-world task to compare: "extend the existing code with this new requirement". Qwen3.6-27b perfectly followed the existing conventions, while Sonnet 4.6 invented its own conventions that were rejected during CR by another dev (i.e. he basically chose Qwen3.6's output in a blind test). Qwen3.6-27b, run locally, also managed to finish faster on that task (mostly because Sonnet 4.6 made tool calling errors and removed some code by accident, so it spent additional time reverting its errors, and got somewhat confused in the process).

We already have production code running live that was written entirely by Qwen3.6-27b. Although, we plan to move to self-hosting GLM5.2 because it's more versatile.

hnfong 8 hours ago|||
Qwen models are super useful for those running local.

And there are valid reasons to run local, even if performance (quality and speed) aren't best.

epolanski 13 hours ago||
To me DS 4 is still the most interesting due to much lower costs. Also DS 4 training isn't done yet.

From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:

- on the 16 tasks, one needed several prompts to be steered back into the topic

- its review capabilities seem much worse

- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.

That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.

greyman 13 hours ago||
>On output tokens, GLM-5.2 is less than a fifth the price of Opus.

Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

Aozora7 13 hours ago||
There is, for example, OpenCode Go subscription, which for $10 a month gives you a decently generous quota of GLM-5.2, among other models.

And z.ai themselves also have subscriptions.

sourcecodeplz 12 hours ago||
to be exact, it gives you USD 60 of usage of open models.
KronisLV 12 hours ago|||
> For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

https://z.ai/subscribe

I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.

Edit: seems like Anthropic Pro + GLM Pro (Yearly) would let me almost halve my costs of Anthropic Max 5x. Only concerns are about GLM 5.2 not having vision support and also being kinda slower and also not being as good as Opus.

CuriouslyC 10 hours ago||
I'm considering shifting to the OpenAI $20 plan + GLM. OAI has the best computer use, vision support and the best programming intelligence of any model short of Mythos/Fable, and the quota is a lot more generous than the Anthropic $20 plan.
jameswhitford 13 hours ago|||
Yes this is true. This test was run on a $20 pro Claude subscription. I would definitely love to try use both models on the highest plans for a whole month and compare the two, great format for a future head-to-head comparison.
buster 13 hours ago|||
Is it fair when the one is heavily subsidized and the other one is not?

I think it's most fair to compare the plain token pricing that is used by everyone.

fooster 7 minutes ago|||
I don't think it is fair to say that opus or gpt 5.5 are subsidized? inference for both anthropic and openai are very profitable.
esperent 13 hours ago||||
> Is it fair when the one is heavily subsidized

As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.

usef- 13 hours ago|||
Z.ai is also believed to be "subsidised". Its parent company is running at a large loss right now.

Anthropic have claimed they expect their first profitable quarter this year -- they may have bigger margins on their raw API than you realise.

stavros 10 hours ago||
We're all sure they have big margins in their raw API, it's the subscription we're claiming is subsidised.
usef- 9 hours ago||
Oh I know. But people often point to the API usage cost as an indicator of the magnitude of subsidisation, or to say that the big labs are far less efficient than cheap competitors.

I'm saying that this is not necessarily the case. They do a lot of optimisation and don't have the same price pressure to lower margins. They may not be losing as much on subscriptions as people think.

stavros 9 hours ago||
Oh hm, I've never seen this. API prices have always been exorbitant, I'm sure they're making good margins on that. Let's hope they aren't losing as much on subscriptions, because I'm not ready for everything to be API costs.
lithiumii 13 hours ago||
GLM has subscription plans too.
fooster 8 minutes ago|||
There are lots of subscription plans with acccess to GLM 5.2.
linzhangrun 13 hours ago|||
Out of stock, unavailable
maxdo 8 hours ago||
So the benchmark is : Two models with different harness produced very different results .

Glm game was completely broken Opus game was at first glance ok but also with bugs

Different models with different cost produced different non perfect results . How is it “close” ? :)

Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly

zkmon 13 hours ago||
Cost difference matters most as cost optimization is the whole point of AI. Time difference (30 min vs 1 hr) is not a deal-breaker. The small precision gap on the first iteration does not matter for 99% of the work that happens in real world.
jameswhitford 12 hours ago|
Yes I 100% agree. Time-taken can be improved (with harnesses, subagent workflows etc.) and varies based on task.
TurdF3rguson 13 hours ago||
Pretty clearly it's beating Opus at [web dev](https://www.gptbased.com/) - on price, on score.. I mean what else is there?
myaccountonhn 11 hours ago||
Article states it's not multimodal. I guess that means for webdev it means you can't take a screenshot to indicate errors etc.
jofzar 12 hours ago|||
I hate to be that guy, but real privacy policy on training data/it being hosted somewhere where I'm not worried about secrets being stored/leaked.
HPsquared 12 hours ago|||
Open weights win on that front surely?
jofzar 12 hours ago||
Assuming I have 20k to run my own version of GLM?
mcintyre1994 11 hours ago|||
I guess the idea is that you probably can, or will be able to, find a host that you trust at least as much as you trust Anthropic.
Havoc 12 hours ago||||
Realistically you’d need to rotate secrets anyway once it moves from dev to production regardless of model provider
CuriouslyC 10 hours ago||||
2016 me would agree, but 2026 me looks at Trump and Dario, and at China, sees basically no ethical difference (or possibly even an ethical deficit for America) and considers that perhaps it's better to go with the option that isn't trying to hoodwink me with bullshit platitudes and flag waiving while doing whatever they want in actuality.
neonstatic 3 hours ago||||
I'm sorry, is this criticism of Z.AI and China or Anthropic and the US? Not that there is much of a difference these days..
dkersten 12 hours ago|||
Its on other providers, like Together.ai
trick-or-treat 13 hours ago||
Latency? Just saying there's other things to consider.
doe88 7 hours ago||
To me one shot prompting is as relevant as Strava's KOM is for cycling, i'm more interested in a good cycling performance after a 3 hours ride than a straight up 30 min record effort.
stavarotti 7 hours ago||
These style of comparisons are decent at showing capability but they don't really show me what I truly want - a sounding board and implementer with senior engineer-level execution. When I look back at all the teams that I've been part of, the best outcomes came from white-boarding (sometimes in the metaphorical sense) with one or two people, at times arguing, then finally compromising on a plan. Instead of synthetic benchmarks that try to be objective, I wonder if there's a way test this, or maybe I'm opining on a way of working that will soon be gone?
CuriouslyC 10 hours ago|
You should repeat this experiment but with progressively more detail in the initial prompt. Claude's secret sauce is taking weakly specified prompts and making passable things from them, but as the degrees of freedom in the prompt go down Claude starts to disobey while other models close in on the intent.
jameswhitford 10 hours ago|
That is a great suggestion that I am definitely going to look into, thanks!
Babooz 10 hours ago||
Nice comparison, but perhaps a more informative one would be to keep the harness the same and use Claude Code for both model. In your comparison, the differences could be due to many harness design decisions.
More comments...