Top
Best
New

Posted by NicoConstant 6 hours ago

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request(blog.kog.ai)
111 points | 59 commentspage 2
robmccoll 4 hours ago|
Making these claims on a 2B parameter model seems a bit like seeing linear scalability from 1 to 4 cores and then assuming 256 cores will give you a 256x speedup. Or demonstrating massive improvement on datasets that fit in cache and then assuming the same improvements will be present on problem sizes that span the memory of multiple machines. Something tells me that scaling to larger models will be more difficult than assumed.
gaeld 4 hours ago|
Yeah, I agree: I'm actually not expecting it to be easy, and there will certainly be several unknown unknowns we'll discover along the way.

Our process has been, and will continue to be, a sequence of (tedious) R&D experiments where the GPU never behaves as expected when pushed to its limits in ways no-one really tested before (I still have nightmares of the L3 cache cross-IOD bottlenecks on MI300X).

IMHO, we did solve the multi-GPU memory bandwidth scaling problem, and thus the linear scaling of the size of the model towards infinity. But the main difficulties will come from keeping the speed, with steady and continuous memory streaming, while implementing the much more complex architecture of modern frontier MoEs (attention compression tricks, hash layers, routing logic, etc.)

paul-rohan 2 hours ago||
I had to test it myself to believe this unreal inference speed.

each time getting 3300+ tps.

frankensteins 2 hours ago||
I have a naive question here - first, the token speed is very impressive. but why this is the highlight? I would prefer the actual performance.
gaeld 1 hour ago|
Token generation speed matters for sequential agentic workflows, like software engineering / vibe coding, where a lot of reasoning tokens, code generation, refactoring, testing, etc. happen in a loop before an actual outcome is served to the user.

About model performance, we plan to support the latest frontier models (this tech preview is about the speed of the engine)

hannune 17 minutes ago||
[flagged]
irishcoffee 5 hours ago||
NVIDIA H200 Is not a standard GPU. 8 of them in a box with a cpu and ram costs close to the same as a house.

I am 100% all about using local models instead of sending someone else all my data and paying for the privilege of doing so, this article is misleading.

I can get a 27b model to kick out 40 tok/s on 16 gb vram. This is the area ripe for development.

If you can’t connect a monitor, it isn’t a standard GPU, at least not in the way people have spoken about GPUs until a few years ago.

gaeld 5 hours ago|
I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

Sorry for the confusion

embedding-shape 4 hours ago|||
Do you think maybe changing your articles title from "Real-time LLM Inference on Standard GPUs" to "Real-time LLM Inference on Standard Datacenter GPUs" might make sense here? Given more people seem confused by the title than not, and you could clear this up relatively easily, at least on your website although might be late to fix the HN title.
gaeld 4 hours ago||
YES - I just updated the title of our article according to your suggestion.
irishcoffee 4 hours ago|||
Oh, it isn't confusing, it is misleading. A standard GPU lets you connect a monitor. A datacenter GPU lets you do headless math.
gaeld 4 hours ago||
I updated the article title accordingly
bcjdjsndon 2 hours ago||
Standard != Datacentre
bartkappenburg 4 hours ago||
Is this the new gateway to a "Model On a Chip"? Is it possible to etch the weights on silicon and get a very efficient way to use a LLM?
kirtivr 5 hours ago||
I can think of real time video, shader generation, real time worldbuilding type problems could require such a high token throughput.

For instant code generatio, 400-500 tok/s should be sufficient, though most frontier models give us closer to 70 tok/s.

Gomotono 4 hours ago|
That sounds a little bit like the 64kb memory is enough, then someone invented electron ;P

But joke aside, I think we don't even know yet what is possible if you hit very fast very high token / second numbers if your whole ecosystem behind it can handle it.

You could literaly implement the same solution 100x and benchmark all of them and get only the best result.

You could build and architecture a whole stack in parallel.

You could do massive thinking token / chain of thought.

You could let the LLM analyse everything around you while you type. Like it could tell you that this might create a bug in a different file and why.

We could start doing some type of monte-carlo search with this.

ekianjo 3 hours ago||
Title is pure bait. Where is Datacenter GPU gone?
LoganDark 5 hours ago||
I feel the comparison to Groq is unfair. They're running much larger models (orders of magnitude) and still reaching competitive speeds.
gaeld 5 hours ago|
Fair point - this tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds.

At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB).

DeepSeek V4 Flash has 13B in mixed FP4/FP8.

Check out the math at the end of our blog post: https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

foobar10000 3 hours ago||
[dead]
mikdan 3 hours ago||
[dead]
nryoo 6 hours ago|
[dead]
More comments...