Top
Best
New

Posted by meetpateltech 6 hours ago

GLM-5: From Vibe Coding to Agentic Engineering(z.ai)
348 points | 189 comments
Alifatisk 53 minutes ago|
While GLM-5 seems impressive, this release also included lots of new cool stuff!

> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.

A new type of model has joined the series, GLM-5-Coder.

GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.

Looks like they also released their own agentic IDE, https://zcode.z.ai

I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai

If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.

Very fascinating stuff!

erwald 4 minutes ago|
Where did you read that it was trained on Ascends? I've only seen information suggesting that you can run inference with Ascends, which is obviously a very different thing.
simonw 5 hours ago||
Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

btown 5 hours ago||
Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

l_eo 2 hours ago|||
They will start to max this benchmark as well at some point.
ljm 45 minutes ago||
It's not a benchmark though, right? Because there's no control group or reference.

It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

tylervigen 38 minutes ago|||
For 2026 SOTA models I think that is fair.

For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

interstice 34 minutes ago|||
So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?
gabiruh 3 hours ago|||
It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.
derefr 1 hour ago|||
It is odd, yeah.

I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.

btown 1 hour ago|||
If you were a pelican, wouldn't you want to go cycling on a sunny day?

Do electric pelicans dream of touching electric grass?

solarized 2 hours ago|||
This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

viraptor 2 hours ago|||
Like identifying names of skateboard tricks from the description? https://skatebench.t3.gg/
alargemoose 1 hour ago|||
I don’t care how practical it may or may not be, this is my new favorite LLM benchmark
stevage 1 hour ago||||
I couldn't find an about page or similar?
viraptor 1 hour ago||
Here's the public sample https://github.com/T3-Content/skatebench/blob/main/bench/tes...

I don't think there's a good description anywhere. https://youtube.com/@t3dotgg talks about it from time to time.

hmottestad 1 hour ago|||
o3-pro is better than 5.2 pro! And GPT 5 high is best. Really quite interesting.
echelon 1 hour ago|||

  1. Take the top ten searches on Google Trends 
     (on day of new model release)
  2. Concatenate
  3. SHA-1 hash them
  4. Use this as a seed to perform random noun-verb 
     lookup in an agreed upon large sized dictionary. 
  5. Construct a sentence using an agreed upon stable 
     algorithm that generates reasonably coherent prompts
     from an immensely deep probability space.
That's the prompt. Every existing model is given that prompt and compared side-by-side.

You can generate a few such sentences for more samples.

Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.

It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.

_joel 5 hours ago|||
Now this is the test that matters, cheers Simon.
pwython 4 hours ago|||
How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...
bwilliams18 2 hours ago|||
I'd argue that a models ability to ignore/manage/sift through the noise added to the training set from other LLMs increases in importance and value as time goes on.
nerdsniper 3 hours ago|||
You're correct. It's not as useful as it (ever?) was as a measure of performance...but it's fun and brings me joy.
RC_ITR 1 hour ago||
The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

Rudybega 1 hour ago||
MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025

Aurornis 5 hours ago||
The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

dongobread 3 hours ago||
What a strangely hostile statement on an open weight model. Running like 20 benchmark evaluations isn't trivial by itself, and even updating visuals and press statements can take a few days at a tech company. It's literally been 5 days since this "new generation" of models released. GPT-5.3(-codex) can't even be called via API, so it's impossible to test for some benchmarks.

I notice the people who endlessly praise closed-source models never actually USE open weight models, or assume their drop-in prompting methods and workflow will just work for other model families. Especially true for SWEs who used Claude Code first and now think every other model is horrible because they're ONLY used to prompting Claude. It's quite scary to see how people develop this level of worship for a proprietary product that is openly distrusting of users. I am not saying this is true or not of the parent poster, but something I notice in general.

As someone who uses GLM-4.7 a good bit, it's easily at Sonnet 4.5 tier - have not tried GLM-5 but it would be surprising if it wasn't at Opus 4.5 level given the massive parameter increase.

maxdo 45 minutes ago|||
but even opus 4.5 is history now, codex-5-3 and opus 4.6 are one more step forward. The opus itself caused paradigm shift, from writing code with AI, to ai is writing code with human.

open weight models are not there at all yet.

apimade 1 hour ago|||
Isn’t trivial? How is it not completely automated at this point?
InsideOutSanta 5 hours ago|||
> it's comparing to last generation models (Opus 4.5 and GPT-5.2).

If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Aurornis 5 hours ago||
> Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison

The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.

InsideOutSanta 4 hours ago|||
Sure. My sole point is that calling Opus 4.5 and GPT-5.2 "last generation models" is discounting how good they are. In fact, in my experience, Opus 4.6 isn't much of an improvement over 4.5 for agentic coding.

I'm not immediately discounting Z.ai's claims because they showed with GLM-4.7 that they can do quite a lot with very little. And Kimi K2.5 is genuinely a great model, so it's possible for Chinese open-weight models to compete with proprietary high-end American models.

GorbachevyChase 3 hours ago|||
From a user perspective, I would consider Opus 4.6 somewhat of a regression. You can exhaust your the five hour limit in less than half an hour on, and I used up the weekly limit in just two days. The outputs did not feel significantly better than Opus 4.5 and that only feels smarter than Sonnet by degrees. This is running a single session on a pro plan. I don’t get paid to program, so API cost matter to me. The experience was irritating enough to make me start looking for an alternative, and maybe GLM is the way to go for hobby users.
Aurornis 4 hours ago|||
I think there are two types of people in these conversations:

Those of us who just want to get work done don't care about comparisons to old models, we just want to know what's good right now. Issuing a press release comparing to old models when they had enough time to re-run the benchmarks and update the imagery is a calculated move where they hope readers won't notice.

There's another type of discussion where some just want to talk about how impressive it is that a model came close to some other model. I think that's interesting, too, but less so when the models are so big that I can't run them locally anyway. It's useful for making purchasing decisions for someone trying to keep token costs as low as possible, but for actual coding work I've never found it useful to use anything other than the best available hosted models at the time.

buu700 2 hours ago|||
It's high-interest to me because open models are the ultimate backstop. If the SOTA hosted models all suddenly blow up or ban me, open models mitigate the consequence from "catastrophe" to "no more than six to nine months of regression". The idea that I could run a ~GPT-5-class model on my own hardware (given sufficient capex) or cloud hardware under my control is awesome.
ffsm8 4 hours ago|||
For the record, opus 4.6 was released less then a week ago.

That you think corporations are anything close to quick enough to update their communications on public releases like this only shows that you've never worked in corporate

miroljub 3 hours ago|||
Yeah, I'm sure closed source model vendors are doing everything within their power to dumb down benchmarks, so they can look like underdogs and play a pity game against open weight models.

Let's have a serious discussion. Just because Claude PR department coined the term benchmaxxing, we we should not be using it unless they shell out some serious monetes.

dust42 5 hours ago|||
They are all just token generators without any intelligence. There is so little difference nowadays that I think in a blind test nobody will be able to differentiate the models - whether open source or closed source.

Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

Here is Claude's answer just right now:

"Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."

Here is ChatGPT, also right now:

"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

Here’s why:

* *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.

The only time driving might make sense is if:

* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

Otherwise, this is a perfect “walk it over” situation. "

Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.

cherryteastain 4 hours ago|||
Gemini 3 Flash ("Fast" in the web app) did not have trouble with this:

Since you presumably want the car to be clean, the answer depends entirely on how the car is getting washed. The Breakdown If it’s a Professional Car Wash: You’ll need to drive. Bringing the car to the wash is generally the prerequisite for them washing it. Plus, walking 50 meters just to ask them to come pick up a car that is visible from their front door might result in some very confused looks from the staff. If it’s a Self-Service Pressure Washer: Again, drive. Dragging a 1,500kg vehicle 50 meters by hand is an incredible workout, but probably not the "quick chore" you had in mind. If you are hiring a Mobile Detailer: Walk. Stroll over there, tell them where you're parked, and walk back home to watch TV while they do the work in your driveway.

vardalab 1 hour ago||||
This is really just a meme. People don't know how to use these tools.

Here is the response from Gpt-5.2 using my default custom instructions in the mac desktop app.

OBJECTIVE: Decide whether to drive or walk to a car wash ~50 meters from home, given typical constraints (car must be present for wash).

APPROACH: Use common car-wash workflows + short-distance driving considerations (warm engine, time, parking/queue). No web sources needed (stable, local decision).

OUTPUT: • Default choice: Drive. A car wash generally requires the car on-site, and 50 meters of driving is trivial. • When walking makes sense: • You’re going to drop the car off (hand-wash/detail) and pick it up later. • You want to avoid moving the car now (ice, tight parking, you’ve been drinking, etc.) and someone else will handle the car. • Micro-guidance: • If it’s an automatic wash you stay in, drive. • If it’s detail/hand wash with a wait, drive there, then walk home (or just wait).

CONFIDENCE: 92/100 Assumptions: It’s a normal car wash that requires your car to be there; conditions are safe to drive 50 meters.

pegasus 5 minutes ago|||
Interesting, what were the instructions if you don't mind sharing?
reliabilityguy 58 minutes ago||||
Your objective has explicit instruction that car has to be present for a wash. Quite a difference from the original phrasing where the model has to figure it out.
J_cst 23 minutes ago|||
That's the answer of his LLM which has decomposed the question and built the answer following the op prompt obviously. I think you didn't get it.
bwat49 27 minutes ago|||
> Your objective has explicit instruction that car has to be present for a wash.

Which is exactly how you're supposed to prompt an LLM, is the fact that giving a vague prompt gives poor results really suprising?

reliabilityguy 22 minutes ago||
In this case, with such a simple task, why even bother to prompt it?

The whole idea of this question is to show that pretty often implicit assumptions are not discovered by the LLM.

sph 1 hour ago|||
"You're holding it wrong."
j-bos 4 hours ago||||
GLM 4.7: "You should drive.

Since the goal is to get your car washed, the car needs to be at the car wash. If you walk, you will arrive at the car wash, but your car will still be sitting at home"

Are you sure that question is from this year?

qwytw 1 hour ago||||
Some snippets from Kimi's 2.5 answer:

"" [...] Since you need to get your car washed, you have to bring the car to the car wash—walking there without the vehicle won't accomplish your goal [...] If it's a self-service wash, you could theoretically push the car 50 meters if it's safe and flat (unusual, but possible) [..] Consider whether you really need that specific car wash, or if a mobile detailing service might come to you [...] """

Which seems slightly (unintentionally) funny.

But to be fair all the Gemini (including flash) and GPT models I tried did understand the quesiton.

haute_cuisine 4 hours ago||||
Doesn't seem to be the case, gpt 5.2 thinking replies: To get the car washed, the car has to be at the car wash — so unless you’re planning to push it like a shopping cart, you’ll need to drive it those 50 meters.
Aurornis 4 hours ago||||
If you're asking simple riddles, you shouldn't be paying for SOTA frontier models with long context.

This is a silly test for the big coding models.

This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.

Balinares 2 hours ago|||
I find it's a great test, actually. There are lots of "should I take the car" decisions in putting together software that's supposed to do things, and with poor judgement in how the things should be done, you typically end up with the software equivalent of a Rube-Goldberg machine that harnesses elephants to your car and uses mice to scare the elephants toward the car wash while you walk. After all, it's a short distance, isn't it?
grey-area 4 hours ago|||
No it’s like having a calculator which is unable to perform simple arithmetic, but lots of people think it is amazing and sentient and want to talk about that instead of why it can’t add 2 + 2.
viraptor 42 minutes ago||
We know why it's not going to do precise math and why you can have better experience asking for an app solving the math problem you want. There's no point talking about it - it's documented in many places for people who are actually interested.
Scene_Cast2 4 hours ago||||
I just ran this with Gemini 3 Pro, Opus 4.6, and Grok 4 (the models I personally find the smartest for my work). All three answered correctly.
miroljub 3 hours ago||
They had plenty of time to update their system prompts so they don't be embarrassed.

I noticed whenever such meme comes out, if you check immediately you can reproduce it yourself, but after a free hours it's already updated.

selcuka 3 minutes ago|||
I tested it on Claude and only Opus answers it correctly. Haiku and Sonnet can't. The would've updated the system prompts for all 3.
viraptor 2 hours ago||||
I think you're seriously underestimating how much effort the fine tuning at their scale takes and what impact it has. They don't pack every edge case into the system prompt either. It's not like they update the model every few hours or even care about memes. If they seriously did, they'd force-delegate spelling questions to tool calls.
suzzer99 1 hour ago||
Could it be the model is constantly searching its own name for memes, or checking common places like HN and updating accordingly? I have no idea how real-time these things are, just asking.
viraptor 47 minutes ago||
The model doesn't do anything on its own. And it's usually months in between new model snapshots.
qiine 2 hours ago||||
The road to AGI is weirder than anticipated
UrineSqueegee 1 hour ago|||
thats not how it works
uncharted9 3 hours ago||||
I tried this prompt with all the major AI models that I know and have installed, and only GLM from Z.ai and Gemini 3 Flash could answer it. Even in that case, Gemini 3 Flash gave a bit more nuanced answer, but every other model like Claude Opus 4.5, Chat GPT, Grok - everything failed.
reliabilityguy 1 hour ago||||
Gemini Pro:

You should definitely drive.

If you walk there, your car will still be dirty back at your house! Since the goal is to get the car washed, you have to take it with you.

PS fantastic question!

bonoboTP 4 hours ago||||
It's unclear where the car is currently from your phrasing. If you add that the car is in your garage, it says you'll need to drive to get the car into the wash.
RohMin 2 hours ago||
Do you think the average person would need this sort of clarification? How many of us would have recommended to walk?
bonoboTP 1 hour ago||
Do you think this is a fundamentally unbridge-able limitation of LLMs? Do you know where we were just a year ago? Can you imagine that this will get better with upcoming releases? It's like when Gary Marcus was confidently stating that AI (at least current paradigm) will never be able to generate an image of a horse riding an astronaut. (Or full wineglasses or arbitrary clocks).
king_phil 4 hours ago||||
Gemini 3 Pro:

This is a classic logistical puzzle!

Unless you have a very unique way of carrying your vehicle, you should definitely drive.

If you walk there, you'll arrive at the car wash, but your car will still be dirty back at your house. You need to take the car with you to get it washed.

Would you like me to check the weather forecast for $mytown to see if it's a good day for a car wash?

interloxia 4 hours ago||
For me, various forms of Gemini respond with "Unless you are planning on carrying the car there" which I find to be just sassy enough to be amusing.
smokel 4 hours ago||||
> They are all just token generators without any intelligence.

Maybe, but "intelligence" doesn't have a clear, agreed definition. And calling them "just token generators" skips over how complex that generation actually is.

esafak 4 hours ago||||
This is a great comment. The frontier labs ought to be embarrassed they can't get this right. A human would not make the same mistake, after thinking for a second.
izucken 1 hour ago||||
Tsk tsk thats adversarial prompting. Hostile even. And thats goalposting which is totally not cool, man.
Hammershaft 3 hours ago||||
Claude 4.6 got it first try:

"You’ll want to drive — you need the car at the car wash for them to wash it!

Even though it’s just 50 meters, the car has to be there. Enjoy the fresh clean car on the short drive back! "

UrineSqueegee 1 hour ago||||
this is a bot comment or just ragebait
rockinghigh 4 hours ago||||
How is this riddle relevant to a coding model?
esafak 4 hours ago||
It's not a coding model. Go to https://chat.z.ai/ and you'll see it is presented as a generalist.
zozbot234 4 hours ago|||
1. Analyze the User's Request:

Distance: 50 meters (very short).

Goal: Get the car washed.

Decision: Drive or walk?

2. Evaluate the Options:

Option A: Walk

Pros: Healthy, environmentally friendly, no need to find parking (though it's close), easy to chat with neighbors.

Cons: The car is left at the car wash. The user can't "walk" the car to the car wash. The car must be moved to the car wash to be washed.

Option B: Drive

Pros: Moves the car to the location where the service is performed. It's the only logical way to get the car washed.

Cons: Driving 50 meters is technically bad for a cold engine (short trips), uses a negligible amount of fuel, might seem lazy.

3. Identify the Logical Flaw/Humor:

The user is asking if they should walk to get their car washed.

Unless the car wash involves someone coming to their house (mobile wash) or carrying the car, the car must be at the car wash.

Walking to the car wash leaves the car at home. Walking back from the car wash leaves the car at the car wash.

Therefore, the user must drive the car to the car wash.

4. Formulate the Answer:

Direct Answer: You should drive.

Reasoning: The car needs to be at the car wash to be washed. You cannot walk the car there.

Nuance/Context: 50 meters is a very short distance. Driving a car such a short distance is generally not great for the engine (doesn't allow oil to circulate/warm up properly), but it is necessary in this specific case.

Alternative (if applicable): Is there a mobile wash option? If so, walking isn't even needed. But assuming a standard car wash bay: Drive.

5. Drafting the response (Internal Monologue/Trial Phrases):

Draft 1: You have to drive. You can't wash the car if it's at your house and you are at the car wash.

Draft 2: You should drive. Since the car needs to be washed, it needs to be at the car wash. Walking there doesn't help get the car clean.

Refining for clarity and wit: Point out the obvious - the car is the object being washed. It needs to move.

Final Polish (similar to the provided good output):

State clearly: You should drive.

Explain why: The car needs to be there.

Add a practical note: Driving 50 meters is a bit silly for engine health, but unavoidable here.

Consider the scenario of "walking back": Maybe the user meant "walk back"? No, the prompt says "drive there or walk there". The intent is transportation to the location.

6. Constructing the final output:

"You should definitely drive."

"Why? Because the car wash cleans the car, so the car needs to be at the car wash. If you walk, the car stays at home."

"Additional thought: Driving 50 meters is technically a 'cold start' short trip which isn't ideal for the engine, but it's the only way to get the service done!"

Final Output Generation (matching the logic above).

You should drive.

Here is why:

1. The car needs to be there: You are getting the car washed, so the car must be at the car wash. You cannot walk the car there by yourself.

2. Efficiency: Driving 50 meters takes about 10 seconds. Walking 50 meters takes about a minute, but you would be leaving the car behind.

The only exception is if you are planning to get a mobile car wash (where they come to you) or if you are dropping the car off and walking back home while they clean it. But to simply "get there," you must drive.

(The above output is from GLM-5)

throwup238 5 hours ago|||
> Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

algorithm314 5 hours ago|||
Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.
NitpickLawyer 4 hours ago|||
The difference is in scaling. The top US labs have oom more compute available than chinese labs. The difference in general tasks is obvious once you use them. It used to be said that open models are ~6mo behind SotA a year go, but with the new RL paradigm, I'd say the gap is growing. With less compute they have to focus on narrow tasks, resort to poor man's distillation and that leads to models that show benchmaxxing behavior.

That being said, this model is MIT licensed, so it's a net benefit regardless of being benchmaxxed or not.

rockinghigh 4 hours ago|||
They do. Pretty much all agentic models call linting, compiling and testing tools as part of their flow.
ej88 5 hours ago|||
the new meta is purchasing rl environments where models can be self-corrected (e.g. a compiler will error) after sft + rlhf ran into diminishing returns. although theres still lots of demand for "real world" data for actually economically valuable tasks
miki123211 2 hours ago|||
Anthropic, OpenAI and Google have real user data that they can use to influence their models. Chinese labs have benchmarks. Once you realize this, it's obvious why this is the case.

You can have self-hosted models. You can have models that improve based on your needs. You can't have both.

viraptor 2 hours ago||
zAI, minimax and Kimi have plenty of subscriber usage on their own platforms. They get real data just as well. Less or it maybe but it's there.
Art9681 13 minutes ago||
I'm going to claim that the majority of those users are optimizing for cost and not correctness and therefore the quality of data collected from those sessions is questionable. If you're working on something of consequence, you're not using those platforms. If you're a tinkerer pinching pennies, sure.
ionelaipatioaei 5 hours ago|||
I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.
evv 5 hours ago|||
The tooling is totally replicated in open source. OpenCode and Letta are two notable examples, but there are surely more. I'm hacking on one in the evenings.

OpenCode in particular has huge community support around it- possibly more than Claude Code.

ionelaipatioaei 4 hours ago|||
I know, I use OpenCode daily but it still feels like it's missing something - codex in my opinion is way better at coding but I honestly feel like that's because OpenAI controls both the model and the harness so they're able to fine tune everything to work together much better.
Daviey 4 hours ago|||
It's there now, `opencode models --refresh`
hamdingers 1 hour ago||||
GLM works wonderfully with Claude, just have to set some environment variables and you're off to the races.
quikoa 4 hours ago|||
If tooling really is an advantage why isn't it possible to use the API with a subscription and save money?
ionelaipatioaei 4 hours ago||
In my opinion it is because if you control both the model and the harness then you're able to tune everything to work together much better.
cmrdporcupine 5 hours ago|||
I tried GLM 5 by API earlier this morning and was impressed.

Particularly for tool use.

yieldcrv 5 hours ago||
come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.

Aurornis 4 hours ago||
> something that is at parity with Opus 4.5

You're assuming the conclusion

The previous GLM-4.7 was also supposed to be better than Sonnet and even match or beat Opus 4.5 in some benchmarks ( https://www.cerebras.ai/blog/glm-4-7 ) but in real world use it didn't perform at that level.

You can't read the benchmarks alone any more.

pcwelder 5 hours ago||
It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

manofmanysmiles 5 hours ago||
I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

data-ottawa 4 hours ago|||
Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

nolist_policy 5 hours ago|||
Could also be the provider that is bad. Happens way too often on OpenRouter.
pcwelder 4 hours ago||
I had added z-ai in allow list explicitly and verified that it's the one being used.
sergiotapia 4 hours ago||
Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.
nullbyte 3 hours ago||
I specifically do not use the CN/SG based original provider simply because I don't want my personal data traveling across the pacific. I try to only stay on US providers. Openrouter shows you what the quantization of each provider is, so you can choose a domestic one that's FP8 if you want
justinparus 5 hours ago||
Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
monooso 3 hours ago||
This aligns very closely with my experience.

When left to its own devices, GLM-4.7 frequently tries to build the world. It's also less capable at figuring out stumbling blocks on its own without spiralling.

For small, well-defined tasks, it's broadly comparable to Sonnet.

Given how incredibly cheap it is, it's useful even as a secondary model.

rapind 3 hours ago||
Anecdotal, but I've been locked to Sonnet for the past 6-8 months just because they always seem to introduce throttling bugs with Opus where it starts to devour tokens or falls over. Very interested once open models close the gap to about 6 months.
Havoc 2 hours ago||
Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.

Certainly seems to remember things better and is more stable on long running tasks.

mohsen1 34 minutes ago||
I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet
cherryteastain 4 hours ago||
What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].

US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

[3] https://www.reuters.com/world/china/chinas-customs-agents-to...

bertili 2 hours ago||
Exactly. The emperor has no clothes. The largest investments in US tech in history and yet there less than a year of moat. OpenAI or Anthropic will not be able to compete with Chinese server farms and so the US strategy is misplaced investments that will come home to roast.

And we will have Deepseek 4 in a few days...

raingrove 49 minutes ago|||
To be fair, the US ban on Nvidia chip exports to China began under the Biden administration in 2022. By the time Trump took office, it was already too late.
mark_l_watson 4 hours ago|||
US Secretary of State Bressent just publicly said that the US needs to get along and cooperate with China. His tone was so different than previously in the last year that I listened to the video clip twice.

Obviously for the average US tax payer getting along with China is in our interests - not so much our economic elites.

I use both Chinese and US models, and Mistral in Proton’s private chat. I think it makes sense for us to be flexible and not get locked in.

Havoc 2 hours ago||
>His tone was so different than previously in the last year that I listened to the video clip twice.

US bluff got called. A year back it looked like US held all the cards and could squeeze others without negative consequences. i.e. have cake and eat it too

Since then: China has not backed down, Europe is talking de-dollarization, BRICS is starting to find a new gear on separate financial system, merciless mocking across the board, zero progress on ukraine, fed wobbled, focus on gold as alternate to US fiat, nato wobbled, endless scandals, reputation for TACO, weak employment, tariff chaos, calls for withdrawal of gold from US's safekeeping, chatter about dumping US bonds, multiple major countries being quite explicit about telling trump to get fucked

Not at all surprised there is a more modest tone...none of this is going the "without negative consequences" way

>Mistral in Proton’s private chat

TIL

throw101010 1 hour ago||
Who could have predicted that cooperation with decades old allies would be more fruitful than spitting in their faces and threatening them on a weekly basis both economically and militarily... really nobody /s

And yes, the consequence is strengthening the actual enemies of the USA, their AI progress is just one symptom of this disastrous US administration and the incompetence of Donald Trump. He really is the worst President of the USA ever, even if you were to just judge him on his leadership regarding technology... and I'm saying this while he is giving a speech about his "clean beautiful coal" right now in the White House.

re-thc 4 hours ago|||
> What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips

Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

> US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips

Sort of. It's all a front. On both sides. China still ALWAYS had access to Nvidia chips - whether that's the "smuggled" 1s or they run it in another country. It's not costing Nvidia much. The opening of China sales for Nvidia likewise isn't as much of a boon. It's already included.

> At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China

Again, it's a front. It's about news and headlines. Just like when China banned lobsters from a certain country, the only thing that happened was that they went to Hong Kong or elsewhere, got rebadged and still went in.

cherryteastain 4 hours ago||
> Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time. Then US banned them too. Then US was like 'uhh okay maybe you can have the H200', but then China said not interested.

[1] https://arxiv.org/pdf/2412.19437

re-thc 4 hours ago||
> Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time

Then they haven't. I said the non-officially obtained 1s that they can't / won't mention i.e. those Blackwells etc...

seydor 4 hours ago||
We can conclude that they ll flood the world with huawei inference chips from Temu and create worldwide AI pollution
kristianp 2 hours ago||
So that was pony alpha (1). Now what's Aurora Alpha?

(1) https://openrouter.ai/openrouter/pony-alpha

esafak 6 hours ago|
I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
esafak 1 hour ago||
I forgot to add that GLM loves to perform destructive operations; it'll happily git reset/push to main. Put it on a very tight leash.
valvar 5 hours ago|||
Try Cerberas
odie5533 1 hour ago|||
I spent $10 in 2 minutes with that and gave up
w4yai 5 hours ago||
Synthetic is a bless when it comes to providing OSS models (including GLM), their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

More comments...