Posted by hexagr 2 days ago
Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.
I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!
I use Haskell because I find laziness to be a super power. I can solve so many problems in the most straightforward way, and then laziness saves my butt w.r.t. performance.
I use Haskell because it is a better C than C is. The foreign function interface is brilliant, and I can take C primitives and apply all the abstraction mechanisms from Haskell to them. My latest project has been OpenGL based, so lots of caring about byte alignments and shovelling data to the GPU. But all this can be automated with clever use of type classes and Generics (Haskells super cool meta system of data types.)
I use Haskell because I love applying abstractions to make code which describes the problem, and then the compiler finds the solution.
I don’t do programming for embedded, so I am rarely memory constrained. I also understand Haskell memory usage quite well, and can get myself out of trouble.
It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.
Feedback loops for prototyping could become even quicker.
15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem
Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.
Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.
I'm recently also considering downgrading to Pro and using DeepSeek V4 Pro for anything but the more complex tasks and basically wrote a little utility to hook Claude Code up with 3rd party providers better: https://ccode.kronis.dev/ or tbh I could also just use OpenCode on the CLI or maybe something like KiloCode in Visual Studio Code (sadly RooCode got retired, liked their UI/UX a lot too).
I guess where I'm going with all this is that most of the SOTA or near-SOTA models are pretty okay and if you want, you should either get their more affordable plans for a month and experiment, or maybe hook up whatever tools you have with something like OpenRouter and try out a bunch of them: https://openrouter.ai/ (though some of their providers quantize the models a lot, look out for that) Personally I'd also add the new Kimi and GLM models to the list of the ones to try out.
Paying for API tokens isn't really financially good long term for anyone but companies and eventually most folks just settle on a subscription of some sort, since those are heavily subsidized and more cost effective.
But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!
[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]
This was also what I used at the time, the Qwen 3 Coder 480b on Cerebras. Worked great and was so stupidly fast it made me realize that if the hardware can be at that level and commercially available (say in a 5~10 years), for that price, then we will have entirely new bottlenecks. Human review at the pace it was going is completely impossible.
So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem