GLM-4.7-Flash - Hacker News

Posted by scrlk 1/19/2026

378 points | 135 comments

dajonker 1/19/2026|

Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.

Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.

dajonker 1/20/2026||

Update: I'm experiencing issues with OpenCode and this model. I have built the latest llama.cpp and followed the Unsloth guide, but it's not usable at the moment because of:

- Tool calling doesn't work properly with OpenCode

- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher

- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md

I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.

eblanshey 1/22/2026||

There is a new update on HF:

> Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.

dajonker 1/26/2026|||

Yes! This update works great. Seems to be pretty good at first glance. I'll have to setup an interesting task and see how different models approach the problem.

philippelh 1/23/2026|||

After re-downloading the model, do not use --dry-multiplier... and also, don't ask me how I know...

latchkey 1/19/2026|||

https://huggingface.co/unsloth/GLM-4.7-GGUF

This user has also done a bunch of good quants:

https://huggingface.co/0xSero

WanderPanda 1/19/2026|||

I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks

Miraste 1/19/2026||

Unsloth doesn't seem to do this for every new model, but they did publish a report on their quant methods and the performance loss it causes.

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

It isn't much until you get down to very small quants.

dajonker 1/19/2026|||

Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.

The flash model in this thread is more than 10x smaller (30B).

a_e_k 1/19/2026|||

When the Unsloth quant of the flash model does appear, it should show up as unsloth/... on this page:

https://huggingface.co/models?other=base_model:quantized:zai...

Probably as:

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

homarp 1/19/2026|||

it'a a new architecture. Not yet implemented in llama.cpp

issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931

dumbmrblah 1/19/2026|||

One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.

cristoperb 1/19/2026||

Apparently it is the same as the DeepseekV3 architecture and already supported by llama.cpp once the new name is added. Here's the PR: https://github.com/ggml-org/llama.cpp/pull/18936

khimaros 1/20/2026||

has been merged

latchkey 1/19/2026|||

There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.

disiplus 1/19/2026||

yeah there is no way to run 4.7 on a 32g vram this flash is something that im also waiting to try later tonight

omneity 1/19/2026||

Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.

disiplus 1/19/2026||

because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7

omneity 1/19/2026||

Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.

disiplus 1/19/2026||

yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash

behnamoh 1/19/2026||

> Codex is notably higher quality but also has me waiting forever.

And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.

polyrand 1/19/2026||

I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designed to work much better with Anthropic models).

Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.

RickHull 1/19/2026||

Same, I got 12 months of subscription for $28 total (promo offer), with 5x the usage limits of the $20/month Claude Pro plan. I have only used it with claude code so far.

theshrike79 1/21/2026|||

This offer was so stupid cheap there was no point in NOT getting :D

stogot 1/19/2026|||

Do they still have that promo offer?

Mashimo 1/19/2026|||

Looks like they have something for 29 USD with 3x the claude code usage: https://z.ai/subscribe

victorbjorklund 1/19/2026||

How has the performance been lately? I heard some people say that they change their limits likely making it almost not useable

chewz 1/19/2026||

Never had any problems with Z.ai models.

However they are using more thinking internally and that makes them seem slow.

vessenes 1/19/2026||

Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.

HumanOstrich 1/19/2026||

I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.

They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.

twalla 1/19/2026|||

I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.

mlyle 1/19/2026||||

The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:

1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.

solarkraft 1/20/2026|||

It’s wild that cached tokens count full - what’s in it for you to care about caching at all then? Is the processing speed gain significant?

mlyle 1/20/2026||

Not really worth it, in general. It does reduce latency a little. In practice, you do have a continuing context, though, so you end up using it whether you care or not.

indigodaddy 1/21/2026|||

Try a nano-gpt subscription. Not going to be as fast as cerebras obviously but it's $8/mo for 60,000 requests

Miraste 1/19/2026||||

I wonder why they chose per minute? That method of rate limiting would seem to defeat their entire value proposition.

p91paul 1/20/2026||

In general, with per minute rate limiting you limit load spikes, and load spikes are what you pay for: they force you to ramp up your capacity, and usually you are then slow to ramp down to avoid paying the ramp up cost too many times. A VM might boot relatively fast, but loading a large model into GPU memory takes time.

cmrdporcupine 1/20/2026||||

I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.

It's even cheaper to just use it through z.ai themselves I think.

Imustaskforhelp 1/19/2026|||

I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser

Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.

With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.

I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.

Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)

pseudony 1/19/2026|||

I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.

Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).

My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.

(I even picked the 10usd plan, it was fine for now).

Workaccount2 1/19/2026|||

Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.

runako 1/19/2026|||

FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.

They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)

weslleyskah 1/19/2026||

You know, this is also the case of Proxmox vs. VMWare.

Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.

irthomasthomas 1/19/2026||||

Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.

skrebbel 1/19/2026|||

How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?

g-mork 1/19/2026|||

dont forget the plethora of middleman chat services with liberal logging policies. i've no doubt there is a whole subindustry lurking in here

skrebbel 1/19/2026||

i wasn't judging, i was asking how it works. why would openai/anthrophic/google let a competitor scrape their results in sufficient amounts that it lets them train their own thing?

victorbjorklund 1/19/2026||

I think the point is that they can't really stop it. Let's say that I purchase API credits, and I let the resell it to DeepSeek.

That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)

They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.

skrebbel 1/20/2026||

Yeah but are we all just speculating or is it accepted knowledge that this is actually happening?

sally_glance 1/20/2026|||

Speculation I think, because for one those supposed proxy providers would have to provide some kind of pricing advantage compared to the original provider. Maybe I missed them but where are the X0% cheaper SOTA model proxies?

Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.

skrebbel 1/20/2026||

Thanks I that case my conclusion is that all the people saying that these models are "distilling SOTA models" are, by extension, also speculating. How can you distill what you don't have?

sally_glance 1/20/2026||

Only way I can think of is paying for synthesizing training data using SOTA models yourself. But yeah, I'm not aware of anyone publicly sharing that they did so it's also speculation.

The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.

What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.

mike_hearn 1/20/2026|||

OpenAI implemented ID verification for their API at some point and I think they stated that this was the reason.

behnamoh 1/19/2026|||

> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.

mckirk 1/19/2026|||

Note that this is the Flash variant, which is only 31B parameters in total.

And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.

ttoinou 1/19/2026|||

Sonnet was already very good a year ago, do open weights model right are as good ?

jasonjmcghee 1/19/2026|||

Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago

cmrdporcupine 1/20/2026|||

From my experience, Kimi K2, GLM 4.7 (not flash, full), Mistral Large 3, and DeepSeek are all about Sonnet 4 level. I prefer GLM of the bunch.

If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.

But they're nowhere near Opus 4.5

montroser 1/19/2026||

> SWE-bench Verified 59.2

This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.

achierius 1/19/2026||

I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.

robbies 1/19/2026||

What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me

NitpickLawyer 1/19/2026|||

swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.

It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.

khimaros 1/20/2026|||

Terminal Bench 2.0

primaprashant 1/20/2026||

You should check out Devstral 2 Small [1]. It's 24B and scores 68.0% on SWE-bench Verified.

[1]: https://mistral.ai/news/devstral-2-vibe-cli

Palmik 1/20/2026|||

To be clear, GLM 4.7 Flash is MoE with 30B total params but <4B active params. While Devstral Small is 24B dense (all params active, all the time). GLM 4.7 Flash is much much cheaper, inference wise.

dajonker 1/20/2026|||

I don't know whether it just doesn't work well in GGUF / llama.cpp + OpenCode but I can't get anything useful out of Devstal 2 24B running locally. Probably a skill issue on my end, but I'm not very impressed. Benchmarks are nice but they don't always translate to real life usefulness.

bilsbie 1/19/2026||

What’s the significance of this for someone out of the loop?

epolanski 1/19/2026|

You can run gpt 5 mini level ai on your MacBook with 32 gb ram.

You can get LLM as a service for cheaper.

E.g. This model costs less than a tenth of Haiku 4.5.

baranmelik 1/19/2026||

For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.

johndough 1/19/2026||

I've been running it with llama-server from llama.cpp (compiled for CUDA backend, but there are also prebuilt binaries and instructions for other backends in the README) using the Q4_K_M quant from ngxson on Lubuntu with an RTX 3090:

https://github.com/ggml-org/llama.cpp/releases

https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...

https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...

    llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf

You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions

Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.

mistercheph 1/19/2026||

I think the recently introduced -fit option which is on by default means it's no longer necesary to -ngl, can also probably drop -c which is "0" by default and reads metadata from the gguf to get the model's advertised context size

johndough 1/19/2026||

I had already removed three parameters which were no longer needed, but I hadn't yet heard that the other two had also become superfluous. Thank you for the update! llama.cpp sure develops quickly.

ljouhet 1/19/2026|||

Something like

    ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M

It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.com

jmorgan 1/19/2026||

It's available (with tool parsing, etc.): https://ollama.com/library/glm-4.7-flash but requires 0.14.3 which is in pre-release (and available on Ollama's GitHub repo)

zackify 1/19/2026|||

LM Studio Search for 4.7-flash and install from mlx community

pixelmelt 1/19/2026||

I would look into running a 4 bit quant using llama cpp (or any of its wrappers)

cmrdporcupine 1/20/2026||

On my ASUS GB10 (like NVIDIA Spark) with Q8_0 quantization, prompt to write a fibonacci in Scala:

HEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.

BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.

  ┌─────────┬───────────┬──────────┬───────────┐
  │         │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
  ├─────────┼───────────┼──────────┼───────────┤
  │ Tok/sec │ 13-17     │ 11-19    │ 32        │
  ├─────────┼───────────┼──────────┼───────────┤
  │ Memory  │ ~62GB     │ ~28GB    │ ~32GB     │
  └─────────┴───────────┴──────────┴───────────┘

Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do.

KludgeShySir 1/22/2026|

[dead]

montroser 1/19/2026||

This is their blurb about the release:

    We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.

    The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.

https://docs.z.ai/release-notes/new-released

z2 1/19/2026||

The two notes from this year are accidentally marked as 2025, the website posts may actually be hand-crafted.

linolevan 1/19/2026||

Tried it within LMStudio on my m4 macbook pro – it feels dramatically worse than gpt-oss-20b. Of the two (code) prompts I've tried so far, it started spitting out invalid code and got stuck in a repeating loop for both. It's possible that LMStudio quantizes the model in such a manner that it explodes, but so far not a great first impression.

tgtweak 1/19/2026|

Are you using the full BF16 model or a quantized mlx4?

linolevan 1/20/2026||

Not sure what the default is – whatever that was. It's probably the quantized mlx4 if I had to guess.

esafak 1/19/2026|

When I want fast I reach for Gemini, or Cerebras: https://www.cerebras.ai/blog/glm-4-7

GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.

mgambati 1/19/2026|

Good instruction following is the number one reason for me that makes opus 4.5 so good. Hope next release improve this.

More comments...