Top
Best
New

Posted by scrlk 1/19/2026

GLM-4.7-Flash(huggingface.co)
378 points | 135 commentspage 3
andhuman 1/19/2026|
Gave it four of my vibe questions around general knowledge and it didn’t do great. Maybe expected with a model as small as this one. Once support in llama.cpp is out I will take it for a spin.
XCSme 1/19/2026||
Seems to be marginally better than gpt-20b, but this is 30b?
strangescript 1/19/2026||
I find gpt-oss 20b very benchmaxxed and as soon as a solution isn't clear it will hallucinate.
blurbleblurble 1/19/2026||
Every time I've tried to actually use gpt-oss 20b it's just gotten stuck in weird feedback loops reminiscent of the time when HAL got shut down back in the year 2001. And these are very simple tests e.g. I try and get it to check today's date from the time tool to get more recent search results from the arxiv tool.
lostmsu 1/19/2026||
It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.

Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7

aziis98 1/19/2026||
I hope we get to good A1B models as I'm currently GPU poor and can only do inference on CPU for now
yowlingcat 1/19/2026|
It may be worth taking a look at LFM [1]. I haven't had the need to use it so far (running on Apple silicon on a day to day basis so my dailies are usually the 30B+ MoEs) but I've heard good things from the internet from folks using it as a daily on their phones. YMMV.

[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

pixelmelt 1/19/2026||
I'm glad they're still releasing models dispite going public
twelvechess 1/19/2026||
Excited to test this out. We need a SOTA 8B model bad though!
cipehr 1/19/2026||
Is essentialai/rnj-1 not the latest attempt at that?

https://huggingface.co/EssentialAI/rnj-1

metalliqaz 1/19/2026||
I tried this model and if I recall correctly it was horribly over-trained on Python test questions to the point that if you asked for C code it would say something like "you asked for C code but specified answer must be in Python , so here is the Python ", even though I never once mentioned Python.
piyh 1/19/2026||
https://docs.mistral.ai/models/ministral-3-8b-25-12
twelvechess 1/19/2026||
thanks I will try this out
epolanski 1/19/2026||
Any cloud vendor offering this model? I would like to try it.
PhilippGille 1/19/2026||
z.ai itself, or Novita fow now, but others will follow soon probably

https://openrouter.ai/z-ai/glm-4.7-flash/providers

sdrinf 1/19/2026|||
Note: I strongly recommend against using Novita -their main gig is serving quantized versions of the model to offer it for cheaper / at better latency; but if you ran an eval against other providers vs novita, you can spot the quality degradation. This is nowhere marked, or displayed in their offering.

Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.

epolanski 1/19/2026|||
Interesting, it costs less than a tenth than Haiku.
saratogacx 1/19/2026||
GLM itself is quite inexpensive. A year sub to their coding plan is only $29 and works with a bunch of various tools. I use it heavily as a "I don't want to spend my anthropic credits" day-to-day model (mostly using Crush)
latchkey 1/19/2026|||
We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.

ssh admin.hotaisle.app

Yes, this should be made easier to just get a VM with it pre-installed. Working on that.

omneity 1/19/2026||
Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.

It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.

latchkey 1/19/2026||
Agreed, the OOB experience kind of suck.

Here is the magic (assuming a 4x)...

  docker run -it --rm \
  --pull=always \
  --ipc=host \
  --network=host \
  --privileged \
  --cap-add=CAP_SYS_ADMIN \
  --device=/dev/kfd \
  --device=/dev/dri \
  --device=/dev/mem \
  --group-add render \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /home/hotaisle:/mnt/data \
  -v /root/.cache:/mnt/model \
  rocm/vllm-dev:nightly
  
  mv /root/.cache /root/.cache.foo
  ln -s /mnt/model /root/.cache
  
  VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --quantization fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --load-format fastsafetensors \
  --enable-expert-parallel \
  --allowed-local-media-path / \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --mm-encoder-tp-mode data
Der_Einzige 1/20/2026||
Speculative decoding isn’t needed at all, right? Why include the final bits about it?
latchkey 1/20/2026|||
https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM.html...
foobar10000 1/20/2026|||
GLM 4.7 supports it - and in my experience for Claude code a 80 plus hit rate in speculative is reasonable. So it is a significant speed up.
dvs13 1/19/2026|||
https://huggingface.co/inference/models?model=zai-org%2FGLM-... :)
xena 1/19/2026||
The model literally came out less than a couple hours ago, it's going to take people a while in order to tool it for their inference platforms.
idiliv 1/19/2026||
Sometimes model developers coordinate with inference platforms to time releases in sync.
Haris18 1/19/2026||
[dead]
wotsdat 1/19/2026|
[dead]