Posted by NicoConstant 6 hours ago
Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).
IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity. But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)
each time getting 3300+ tps.
About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)
I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.
I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.
If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.
Sorry for the confusion
For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.
But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.
You could literaly implement the same solution 100x and benchmark all of them and get only the best result.
You could build and architecture a whole stack in parallel.
You could do massive thinking token / chain of thought.
You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.
We could start doing some type of monte-carlo search with this.
The math checks out though to allow support for large frontier MoE models at similar speeds.
At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).
DeepSeek V4 Flash has 13B in mixed FP4/FP8.
Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...