GLM-5.1: Towards Long-Horizon Tasks

Posted by zixuanlimit 3 hours ago

GLM-5.1: Towards Long-Horizon Tasks(z.ai)

273 points | 83 comments

Yukonv 3 hours ago|

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF

zozbot234 1 hour ago|

SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.

adrian_b 15 minutes ago||

For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.

alex7o 2 hours ago||

To be honest I am a bit sad as, glm5.1 is producing mich better typescript than opus or codex imo, but no matter what it does sometimes go into shizo mode at some point over longer contexts. Not always tho I have had multiple session go over 200k and be fine.

InsideOutSanta 11 minutes ago||

I just set the context window to 100k and manage it actively (e.g. I compact it regularly or make it write out documentation of its current state and start a new session).

For me, Opus 4.6 isn't working quite right currently, and I often use GLM 5.1 instead. I'd prefer to use peak Opus over GLM 5.1, but GLM 5.1 is an adequate fallback. It's incredible how good open-weight models have gotten.

disiplus 1 hour ago|||

When it works and its not slow it can impress. Like yesterday it solved something that kimi k2.5 could not. and kimi was best open source model for me. But it still slow sometimes. I have z.ai and kimi subscription when i run out of tokens for claude (max) and codex(plus).

i have a feeling its nearing opus 4.5 level if they could fix it getting crazy after like 100k tokens.

MegagramEnjoyer 2 hours ago|||

Why is that sad? A free and open source model outperforming their closed source counterparts is always a win for the users

KaoruAoiShiho 1 hour ago||

The non-awesome context window is the sad part, but I think a better harness can deal with this.

DeathArrow 1 hour ago|||

After the context gets to 100k tokens you should open a new session or run /compact.

varispeed 1 hour ago|||

Isn't the same with opus nowadays?

cmrdporcupine 1 hour ago||

I honestly still hold onto habits from earlier days of Claude & Codex usage and tend to wipe / compact my context frequently. I don't trust the era of big giant contexts, frankly, even on the frontier models.

calgoo 59 minutes ago||

I also feel like its helping me on the big models these days with claude giving so many issues.

johnfn 1 hour ago||

GLM-5.0 is the real deal as far as open source models go. In our internal benchmarks it consistently outperforms other open source models, and was on par with things like GPT-5.2. Note that we don't use it for coding - we use it for more fuzzy tasks.

sourcecodeplz 1 hour ago||

Yep, haven't tried 5.1 but for my PHP coding, GLM-5 is 99% the same as Sonnet/Opus/GPT-5 levels. It is unbelievably strong for what it costs, not to mention you can run it locally.

deepsquirrelnet 1 hour ago||

I am working on a large scale dataset for producing agent traces for Python <> cython conversion with tooling, and it is second only to gemini pro 3.1 in acceptance rates (16% vs 26%).

Mid-sized models like gpt-oss minimax and qwen3.5 122b are around 6%, and gemma4 31b around 7% (but much slower).

I haven’t tried Opus or ChatGPT due to high costs on openrouter for this application.

kamranjon 25 minutes ago||

I'm crossing my fingers they release a flash version of this. GLM 4.7 Flash is the main model I use locally for agentic coding work, it's pretty incredible. Didn't find anything in the release about it - but hoping it's on the horizon.

winterqt 1 hour ago||

Comments here seem to be talking like they've used this model for longer than a few hours -- is this true, or are y'all just sharing your initial thoughts?

stavros 1 hour ago||

My local tennis court's reservation website was broken and I couldn't cancel a reservation, and I asked GLM-5.1 if it can figure out the API. Five minutes later, I check and it had found a /cancel.php URL that accepted an ID but the ID wasn't exposed anywhere, so it found and was exploiting a blind SQL injection vulnerability to find my reservation ID.

Overeager, but I was really really impressed.

disiplus 53 minutes ago|||

Yeah it seems they did not align it to much, at least for now. Yesterday it helped me bypass the bot detection on a local marketplace. that i wanted to scrap some listing for my personal alerting system. Al the others failed but glm5.1 found a set of parameters and tweaks how to make my browser in container not be detected.

mikkupikku 17 minutes ago||||

Unfathomably based.

arcanemachiner 1 hour ago||||

That is both amazing and terrifying.

bglazer 1 hour ago|||

This is insane, I love it.

KaoruAoiShiho 1 hour ago|||

Blog post is new but the model is about 2 weeks in public.

BeetleB 1 hour ago||

It's been out for a while.

minimaxir 1 hour ago||

The focus on the speed of the agent generated code as a measure of model quality is unusual and interesting. I've been focusing on intentionally benchmaxxing agentic projects (e.g. "create benchmarks, get a baseline, then make the benchmarks 1.4x faster or better without cheating the benchmarks or causing any regression in output quality") and Opus 4.6 does it very well: in Rust, it can find enough low-level optimizations to make already-fast Rust code up to 6x faster while still passing all tests.

It's a fun way to quantify the real-world performance between models that's more practical and actionable.

RickHull 2 hours ago||

I am on their "Coding Lite" plan, which I got a lot of use out of for a few months, but it has been seriously gimped now. Obvious quantization issues, going in circles, flipping from X to !X, injecting chinese characters. It is useless now for any serious coding work.

InsideOutSanta 6 minutes ago||

My impression is that different users get vastly different service, possibly based on location. I live in Western Europe, and it works perfectly for me. Never had a single timeout or noticeable quality degradation. My brother lives in East Asia, and it's unusable for him. Some days, it just literally does not work, no API calls are successful. Other days, it's slow or seems dumber than it should be.

unicornfinder 2 hours ago|||

I'm on their pro plan and I respectfully disagree - it's genuinely excellent with GLM 5.1 so long as you remember to /compact once it hits around 100k tokens. At that point it's pretty much broken and entirely unusable, but if you keep context under about 100k it's genuinely on par with Opus for me, and in some ways it's arguably better.

airstrike 2 hours ago|||

100k tokens it's basically nothing these days. Claude Opus 4.6M with 1M context windows is just a different ball game

wild_egg 33 minutes ago|||

The Dumb Zone for Opus has always started at 80-100k tokens. The 1M token window just made the dumb zone bigger. Probably fine if the work isn't complicated but really I never want an Opus session to go much beyond 100k.

plandis 1 hour ago||||

Claude Opus can use a 1M context window but I’ve found it to degrade significantly past 250k in practice.

bredren 1 hour ago||||

I had thought this, but my experience initially was that performance degradation began getting noticeable not long after crossing the old 250k barrier.

So, it has been convenient to not have hard stops / allow for extra but I still try to /clear at an actual 25% of the 1M anyhow.

This is in contrast to my use of the 1M opus model this past fall over the API, which seemed to perform more steadily.

braebo 1 hour ago||||

The cost per message increases with context while quality decreases so it’s still generally good to practice strategic context engineering. Even with cross-repo changes on enterprise systems, it’s uncommon to need more than 100k (unless I’m using playwright mcp for testing).

syntaxing 1 hour ago||||

I’m genuinely surprised. I use copilot at work which is capped at 128K regardless of model and it’s a monorepo. Admittedly I know our code base really well so I can point towards different things quickly directly but I don’t think I ever needed compacting more than a handful in the past year. Let alone 1M tokens.

operatingthetan 1 hour ago||||

The context windows of these Chinese open-source subscriptions (GLM, Minimax, Kimi) is too small and I'm guessing it's because they are trying to keep them cheap to run. Fine for openclaw, not so much for coding.

arcanemachiner 1 hour ago||||

Personal opinions follow:

Claude Opus at 150K context starts getting dumber and dumber.

Claude Opus at 200K+ is mentally retarded. Abandon hope and start wrapping up the session.

thawab 1 hour ago|||

Don’t want to disappoint you, but above 200k opus memory is like a gold fish. You need to be below 150k to get good research and implementation.

arcanemachiner 1 hour ago||

Oh nice, I just wrote pretty much the same comment above yours.

kay_o 2 hours ago|||

Is manual compation absolutely mandatory ?

jauntywundrkind 1 hour ago||

I haven't screenshotted to alas, but it goes from being a perfectly reasonable chatty LLM, to suddenly spewing words and nonsense characters around this threshold, at least for me as a z.ai pro (mid tier) user.

For around a month the limit seemed to be a little over 60k! I was despondent!!

What's worse is that when it launched it was stable across the context window. My (wild) guess is that the model is stable but z.ai is doing something wonky with infrastructure, that they are trying to move from one context window to another or have some kv cache issues or some such, and it doesn't really work. If you fork or cancel in OpenCode there's a chance you see the issue much earlier, which feels like some other kind of hint about kv caching, maybe it not porting well between different shaped systems.

More maliciously minded, this artificial limit also gives them an artificial way to dial in system load. Just not delivering the context window the model has reduces the work of what they have to host?

But to the question: yes compaction is absolutely required. The ai can't even speak it's just a jumbled stream of words and punctuation once this hits. Is manual compaction required? One could find a way to build this into the harness, so no, it's a limitation of our tooling that our tooling doesn't work around the stated context window being (effectively) a lie.

I'd really like to see this improved! At least it's not 60-65k anymore; those were soul crushing weeks, where I felt like my treasured celebrated joyful z.ai plan was now near worthless.

There's a thread https://news.ycombinator.com/item?id=47678279 , and I have more extensive history / comments on what I've seen there.

The question is: will this reproduce on other hosts, now that glm-5.1 is released? I expect the issue is going to be z.ai specific, given what I've seen (200k works -> 60k -> 100k context windows working on glm-5.1).

calgoo 1 hour ago|||

I have gone back to having it create a todo.md file and break it into very small tasks. Then i just loop over each task with a clear context, and it works fine. a design.md or similar also helps, but most of the time i just have that all in a README.md file. I was also suspicious around the 100k almost to the token for it to start doing loops etc.

disiplus 58 minutes ago|||

basically my expirience as well. Sometimes it can break past 100k and be ok, but mostly it breaks down.

kay_o 2 hours ago|||

I am on the mid tier Coding plan to trying it out for the sake of curiosity.

During off peak hour a simple 3 line CSS change took over 50 minutes and it routinely times out mid-tool and leaves dangling XML and tool uses everywhere, overwriting files badly or patching duplicate lines into files

harias 20 minutes ago||

Off peak for China or US

kay_o 19 minutes ago||

Off peak for China. Off peak times are only in one timezone

satvikpendem 2 hours ago|||

Every model seems that way, going back to even GPT 3 and 4, the company comes out with a very impressive model that then regresses over a few months as the company tries to rein in inference costs through quantization and other methods.

wolttam 2 hours ago|||

This is surprising to me. Maybe because I'm on Pro, and not Lite. I signed up last week and managed to get a ton of good work done with 5.1. I think I did run into the odd quantization quirk, but overall: $30 well spent

Mashimo 2 hours ago|||

I'm also on the lite plan and have been using 5.1 for a few days now. It works fine for me.

But it's all casual side projects.

Edit: I often to /compact at around 100 000 token or switch to a new session. Maybe that is why.

LaurensBER 1 hour ago|||

I'm on their lite plan as well and I've been using it for my OpenClaw. It had some issues but it also one-shotted a very impressive dashboard for my Twitter bookmarks.

For the price this is a pretty damn impressive model.

cmrdporcupine 1 hour ago|||

Is there any advantage to their fixed payment plans at all vs just using this model via 3rd party providers via openrouter, given how relatively cheap they tend to be on a per-token basis?

Providers like DeepInfra are already giving access to 5.1 https://deepinfra.com/zai-org/GLM-5.1

$1.40 in $4.40 out $0.26 cached

/ 1M tokens

That's more expensive than other models, but not terrible, and will go down over time, and is far far cheaper than Opus or Sonnet or GPT.

I haven't had any bad luck with DeepInfra in particular with quantization or rate limiting. But I've only heard bad things about people who used z.ai directly.

esafak 1 hour ago|||

I'm on their Lite plan and I see some of this too. It is also slow. I use it as a backup.

benterix 2 hours ago|||

> Obvious quantization issues

Devil's advocate: why shouldn't they do it if OpenAI, Anthropic and Google get away with playing this game?

margorczynski 2 hours ago||

It has been useless for long time when compared to Opus or even something like Kimi. The saving grace was that it was dirt cheap but that doesn't matter if it can't do what I want even after many repeated tries and trying to push it to a correct solution.

kirby88 2 hours ago||

I wonder how that compare to harness methods like MAKER https://www.cognizant.com/us/en/ai-lab/blog/maker

tgtweak 1 hour ago||

Share the harness for that browser linux OS task :)

DeathArrow 1 hour ago|

I am already subscribed to their GLM Coding Pro monthly plan and working with GLM 5.1 coupled with Open Code is such a pleasure! I will cancel my Cursor subscription.

More comments...