Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
- Tool calling doesn't work properly with OpenCode
- It repeats itself very quickly. This is addressed in the Unsloth guide and can be "fixed" by setting --dry-multiplier to 1.1 or higher
- It makes a lot of spelling errors such as replacing class/file name characters with "1". Or when I asked it to check AGENTS.md it tried to open AGANTS.md
I tried both the Q4_K_XL and Q5_K_XL quantizations and they both suffer from these issues.
> Jan 21 update: llama.cpp fixed a bug that caused looping and poor outputs. We updated the GGUFs - please re-download the model for much better outputs.
This user has also done a bunch of good quants:
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
It isn't much until you get down to very small quants.
The flash model in this thread is more than 10x smaller (30B).
https://huggingface.co/models?other=base_model:quantized:zai...
Probably as:
issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931
And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
However they are using more thinking internally and that makes them seem slow.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.
It's even cheaper to just use it through z.ai themselves I think.
Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.
With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.
I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.
Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.
They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)
Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.
That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)
They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.
Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.
The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.
What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.
This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.
But they're nowhere near Opus 4.5
This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.
You can get LLM as a service for cheaper.
E.g. This model costs less than a tenth of Haiku 4.5.
https://github.com/ggml-org/llama.cpp/releases
https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...
https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...
llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf
You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completionsSeems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M
It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.comHEAD of ollama with Q8_0 vs vLLM with BF16 and FP8 after.
BF16 predictably bad. Surprised FP8 performed so poorly, but I might not have things tuned that well. New at this.
┌─────────┬───────────┬──────────┬───────────┐
│ │ vLLM BF16 │ vLLM FP8 │ Ollama Q8 │
├─────────┼───────────┼──────────┼───────────┤
│ Tok/sec │ 13-17 │ 11-19 │ 32 │
├─────────┼───────────┼──────────┼───────────┤
│ Memory │ ~62GB │ ~28GB │ ~32GB │
└─────────┴───────────┴──────────┴───────────┘
Most importantly, it actually worked nice in opencode, which I couldn't get Nemotron to do. We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.
The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.
https://docs.z.ai/release-notes/new-releasedGLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.