GLM-5.1: Towards Long-Horizon Tasks

Posted by zixuanlimit 5 hours ago

GLM-5.1: Towards Long-Horizon Tasks(z.ai)

273 points | 83 commentspage 2

gavinray 3 hours ago|

I find the "8 hour Linux Desktop" bit disingenuous, in the fine print it's a browser page:

  > "build a Linux-style desktop environment as a web application"

They claim "50 applications from scratch", but "Browser" and a bunch of the other apps are likely all <iframe> elements.

We all know that building a spec-compliant browser alone is a herculean task.

MrPowerGamerBR 2 hours ago||

In my opinion it would be way cooler if it actually created a real Linux desktop environment instead of only a replica.

Would it succeed? Probably not, but it would be way more interesting, even if it didn't work.

I find things like Claude's C compiler way more interesting where, even though CCC is objectively bad (code is messy, generates very bad unoptimized code, etc) it at least is something cool and shows that with some human guideance it could generate something even better.

bredren 3 hours ago||

It is a big claim without the source and prompting.

jaggs 3 hours ago||

How does it compare to Kimi 2.5 or Qwen 3.6 Plus?

eis 3 hours ago||

The blog post has a benchmark comparison table with these two in it

jaggs 2 hours ago||

Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]

DeathArrow 3 hours ago||

Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.

bigyabai 4 hours ago||

It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts. When you crest 128k tokens, there's a high chance that the model will start spouting gibberish until you compact the history.

For short-term bugfixing and tweaks though, it does about what I'd expect from Sonnet for a pretty low price.

cassianoleal 3 hours ago||

I've done some very long sessions on OpenCode with Dynamic Context Pruning. Highly recommend it.

https://github.com/Opencode-DCP/opencode-dynamic-context-pru...

embedding-shape 4 hours ago|||

> It's an okay model. My biggest issue using GLM 5.1 in OpenCode is that it loses coherency over longer contexts

Since the entire purpose, focus and motivation of this model seems to have been "coherency over longer contexts", doesn't that issue makes it not an OK model? It's bad at the thing it's supposed to be good at, no?

wolttam 4 hours ago||

long(er) contexts (than the previous model)

It does devolve into gibberish at long context (~120k+ tokens by my estimation but I haven't properly measured), but this is still by far the best bang-for-buck value model I have used for coding.

It's a fine model

disiplus 2 hours ago|||

i have glm and kimi. kimi was in most of the cases better and my replacement for claude when i run out of tokens. Now im finding myself using glm more then kimi. Its funny that glm vs kimi, is like codex vs claude. Where glm and codex are better for backend and kimi and claude more for frontend.

as kimi did a huge amount of claude distilation it seems to be somewhat based in data

https://www.anthropic.com/news/detecting-and-preventing-dist...

verdverm 4 hours ago|||

Have you tried gemma4?

I'm curious how the bang for buck ratio works in comparison. My initial tests for coding tasks have been positive and I can run it at home. Bigger models I assume are still better on harder tasks.

whimblepop 4 hours ago|||

That's pretty few, at least for the way I'm currently using LLMs. I have them do some Nix work (both debugging and coding) where accuracy and quality matters to me, so they're instructed to behave as I would when it comes to docs, always consulting certain docs and source code in a specific order. It's not unusual for them to chew through 200k - 600k tokens in a single session before they solve everything I want them to. That's what I currently think of when I think of "long horizon within a single context window".

So I need them to not only not devolve into gibberish, but remain smart enough to be useful at contexts several times longer than that.

jauntywundrkind 4 hours ago|||

Chiming in to second this issue. It is wildly frustrating.

I suspect that this isn't the model, but something that z.ai is doing with hosting it. At launch I was related to find glm-5.1 was stable even as the context window filled all the way up (~200k). Where-as glm-5, while it could still talk and think, but had forgotten the finer points of tool use to the point where it was making grevious errors as it went (burning gobs of tokens to fix duplicate code problems).

However, real brutal changes happened sometimes in the last two or three months: the parent problem emerged and emerged hard, out of nowhere. Worse, for me, it seemed to be around 60k context windows, which was heinous: I was honestly a bit despondent that my z.ai subscription had become so effectively useless. That I could only work on small problems.

Thankfully the coherency barrier raised signficiantly around three weeks go. It now seems to lose its mind and emits chaotic non-sentance gibberish around 100k for me. GLM-5 was already getting pretty shaky at this point, so I feel like I at least have some kind of parity. But at least glm-5 was speaking & thinking with real sentances, I could keep conversing with it somewhat, where-as glm-5.1 seems to go from perfectly level headed working fine to all of a sudden just total breakdown, hard switch, at such a predictable context window size.

It seems so so probable to me that this isn't the model that's making this happen: it's the hosting. There's some KV cache issue, or they are trying to expand the context window in some way, or to switch from one serving pool of small context to a big context serving pool, or something infrastructure wise that falls flat and collapses. Seeing the window so clearly change from 200k to 60k to 100k is both hope, but also, misery.

I've been leaving some breadcrumbs on Bluesky as I go. It's been brutal to see. Especially having tasted a working glm-5.1. I don't super want to pay API rates to someone else, but I fully expect this situation to not reproduce on other hosting, and may well spend the money to try and see. https://bsky.app/profile/jauntywk.bsky.social/post/3mhxep7ek...

All such a shame because aside from totally going mad & speaking unpuncutaed gibberish, glm-5.1 is clearly very very good and I trust it enormously.

throwdbaaway 26 minutes ago|||

https://github.com/THUDM/IndexCache - Might be some expected issue when rolling out this. They don't have enough compute, and have to innovate.

ummzokbro 1 hour ago||||

This.

GLM5 also had this issue. When it was free on Openrouter / Kilo the model was rock solid though did degrade after 100k tokens gracefully. Same at launch with Zai aside from regular timeouts.

Somewhere around early-mid March zai did something significant to GLM5 - like KV quanting or model quanting or both.

After that it's been russian roulette. Sometimes it works flawlessly but very often (1/4 or 1/5 of the time) thinking tokens spill into main context and if you don't spot it happening it can do real damage - heavily corrupting files, deleting whole directories.

You can see the pain by visiting the zai discord - filled with reports of the issue yet radio silence by zai.

Tellingly despite being open source not a single provider will sell you access to this model at anything approaching the plans zai offers. The numbers just don't work so your choice is either pay per token significantly more and get reliability or put up with the bait and switch.

girvo 28 minutes ago||||

This doesn’t help you, but GLM-5 stays coherent far longer on Alibaba’s coding plan/infra. You can’t get that coding plan anymore though unfortunately!

esseph 3 hours ago|||

> "aside from totally going mad & speaking unpuncutaed gibberish [...] I trust it enormously."

The bar is very low :(

jauntywundrkind 3 hours ago||

I see where you are coming from.

But I used 70m tokens yesterday on glm-5.1 (thanks glm for having good observability of your token usage unlike openai, dunno about anthropic). And got incredible beautiful results that I super trust. It's done amazing work.

This limitation feels very shady and artificial to me, and i don't love this, but I also feel like I'm working somewhat effectively within the constraints. This does put a huge damper on people running more autonomous agentic systems, unless they have Pi or other systems that can more self adaptively improve the harness.

HumanOstrich 3 hours ago|||

I wonder if running the compaction in a degraded state produces a subpar summary to continue with.

gunalx 53 minutes ago||

Indeed it does. Once i see degraded state i revert to last task and run a compact, before starting up again.

azuanrb 4 hours ago|||

Have you compared it with using Claude Code as the harness? It performs much better than opencode for me.

nkko 3 hours ago||

[dead]

dang 4 hours ago||

[stub for offtopicness]

[[you guys, please don't post like this to HN - it will just irritate the community and get you flamed]]

smith7018 4 hours ago||

Hmm, three spam comments posted within 9 minutes of each other. The accounts were created 15 minutes ago, 51 days ago, and 3 months ago.

Interesting.

Hopefully these aren't bots created by Z.AI because GLM doesn't need fake engagement.

dang 4 hours ago|||

These comments are probably either by friends of the OP or perhaps associated with the project somehow, which is against HN's rules but not the kind of attack we're mostly concerned with these days. Old-fashioned voting rings and booster comments aren't existential threats and actually bring up somewhat nostalgic feelings at the moment!

Thanks for watching out for the quality of HN...

ray__ 4 hours ago||

Would love to read a Tell HN post about the kinds of attacks you are concerned with!

dang 6 minutes ago||

For example, there are rings of accounts posting generated comments, presumably in order to build karma for spammy or (let's be kind) promotional reasons. There are also plenty of spam rings that create tons of accounts and whatnot.

These are different from the submitter-passed-a-link-to-friends kind of upvoting and booster comments, which feel quaint by comparison. In this case people usually don't know they are breaking HN's rules, which is why they don't try to hide it.

tadfisher 4 hours ago||||

I moderate a medium-sized development subreddit. The sheer volume of spam advertising some AI SaaS company has skyrocketed over the past few months, like 10000%. Comment spam is now a service you can purchase [0][1], and I would not be surprised if Z.ai engaged some marketing firm which ended up purchasing this service.

There are YC members in the current batch who are spamming us right now [2]. They are all obvious engagement-bait questions which are conveniently answered with references to the SaaS.

[0]: https://www.reddit.com/r/DoneDirtCheap/comments/1n5gubz/get_...

[1]: https://www.reddit.com/r/AIJobs/comments/1oxjfjs/hiring_paid...

[2]: https://www.reddit.com/r/androiddev/comments/1sdyijs/no_code...

greenavocado 4 hours ago|||

Z.ai Discord is filled to the brim with people experiencing capacity issues. I had to cancel my subscription with Z.ai because the service was totally unusable. Their Discord is a graveyard of failures. I switched to Alibaba Cloud for GLM but now they hiked their coding plan to $50 a month which is 2.5x more expensive than ChatGPT Plus. Totally insane.

sourcecodeplz 3 hours ago||

Everyone has started either hiking their prices or limiting the tokens, gravy train is over. Glad we have open models that we can host; Sad RAM is so expensive..

zendi 5 hours ago|||

[flagged]

louszbd 5 hours ago|||

[flagged]

seven2928 5 hours ago||

[flagged]

EddyAI 2 hours ago||

[dead]

aplomb1026 4 hours ago||

[dead]

andrewmcwatters 4 hours ago||

[dead]

maxdo 2 hours ago|

One of the bench maxed models . Every time I tried it , it’s not on par even with other open source models .