Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Posted by NicoConstant 5 hours ago

Real-time LLM Inference on Standard GPUs: 3k tokens/s per request(blog.kog.ai)

111 points | 59 comments

mungoman2 4 hours ago|

This looks very interesting. Possible to get those rates without exotic hardware.

But I have to say that the comparison is not really fair. Comparison is done with a 2 B model vs frontier models that are likely 100s of times larger. Also taalas with their 15000 tok/s inference are suspiciously missing from the comparison.

We need to see the comparison with this framework and useful models, which at present seems to mean ~30 B.

gaeld 4 hours ago||

Great points.

We strived to be fair as possible in the benchmark, but it's indeed not perfect. Taalas should have been added in the dedicated hardware section, even though they use 3-bit quantization when we are on FP16 (to be fair in both directions) and they burn the model directly on the card.

Our tech preview is about the speed (hence the small dense model, it was easier to implement).

The math checks out though to allow support for large frontier MoE models at similar speeds: - At batch size 1, GPT-OSS-120B has 5.1B active parameters - in FP8, it's in the same size ballpark than our 2B model in FP16 (5.1 GB vs 4GB). - DeepSeek V4 Flash has 13B in mixed FP4/FP8, so let's say ballpark around 3x bigger than 4GB - so in theory we could reach >1,000 tok/s on it with MI300X/H200 and up to 4k on next generation GPUs.

Check out the math at the end of our blog post:

https://blog.kog.ai/real-time-llm-inference-on-standard-gpus...

Imustaskforhelp 2 hours ago||

Your playground/write-up is very interesting and I would be really interested when you can have something like Deepseek V4 Flash model (49B) running as you are suggesting.

I haven't read the article at the moment and I will try to read them hopefully but I wish to ask a question regarding, can this approach be done for say trillion or large parameter models as well or is there some wall which gets hit that makes it valuable for only smaller parameter model.

That being said, its still really incredible because in future, because these small models are really getting good for many use cases and speed becomes their bottleneck, with greater speeds at consumer hardware, I think its gonna be amazing work!

gaeld 1 hour ago||

Thanks for the comment and the question!

The last section of the article lays out the scaling laws that apply when porting this approach to another model. In a nutshell, DeepSeek V4 Pro with 49B active params is close to the upper bound.

Also worth noting that our results are currently for standard datacenter GPUs. On consumer hardware, though the same low-level optimization approach applies, the bandwidth limitations will cap the achievable speed.

kirtivr 4 hours ago|||

They got 1K tok/s with Deepseek v4 Pro. That's kinda cool..

gaeld 4 hours ago||

Thanks. To be fair, this number is what we expect to get once we port DeepSeek V4 in our engine on the upcoming generation of GPUs!

hirako2000 3 hours ago|||

Fallacies look interesting ? Like if we aren't getting dubious claims every day ?

cyanydeez 4 hours ago||

likely the small model makes whatever fuzzer they designed to poke the gpus much faster optimizations.

they seem to think it scales up because theyre shortening the stack.

stymaar 38 minutes ago||

This is very cool.

I have been lamenting for a while that the memory-bandwidth <-> tps relationship was pretty much working for small models on consumer cards, but not at all on datacenter hardware.

It's great to see that with proper care on the inference engine implementation the relationship can be restored.

gaeld 4 hours ago||

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

zozbot234 37 minutes ago|

It looks like DTP is a distinct architectural choice that would require training new models accordingly? This wouldn't be able to just run inference for existing models.

gaeld 17 minutes ago||

Totally, though DTP is not required for these kind of speeds. Standard TP works also.

DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters.

For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.

rashkov 51 minutes ago||

Don't miss trying their demo: https://playground.kog.ai/

Feels like a preview of the future

bcjdjsndon 1 hour ago||

H200 isn't a standard GPU at all

infocollector 11 minutes ago|

I think they accidentally left out “standard data-center GPUs” from the title. That probably needs fixing. My “standard” GPU is still a 3090

867-5309 4 hours ago||

> Standard GPUs

> 8× NVIDIA H200

Oras 4 hours ago|||

as not custom chips like Grog and Cerebras. Did you expect a single GPU chip to reach 3k tps?

embedding-shape 4 hours ago|||

I think many would assume "not enterprise" or "not datacenter grade" when someone says "Standard GPUs", but maybe that specific phrase have a specific meaning I'm not familiar with.

Edit: I just tried a 4B model on a RTX Pro 6000, getting ~500 tok/s with llama.cpp not even trying to optimize or change anything, just default settings. I'm sure with vLLM it'd be a lot faster already, still before manually tuning configs. I wouldn't call that card "Standard GPU" either FWIW, but it makes the claimed performance numbers feel not as exciting, especially given the hardware they were using.

ismailmaj 4 hours ago||

I expected a 4090, maybe 2. I did not expect 8xH200 for a 2B model.

gaeld 3 hours ago||

Great points, let me clarify:

- model size: 2B is just for this preview (it was faster to implement), our article explains how we expect to support large frontier MoE at 1,000 to 5,000 tokens/s

- reaching 500 tok/s, or even up to ~1,000 tok/s, on a consumer GPU card is possible with existing inference engines like vLLM. But there is a ceiling.

The hard part comes we you try to be faster than that: these frameworks won't scale higher just by adding GPUs or using faster GPUs. There is a "glass ceiling" due to microseconds lost everywhere in the stack (grid syncs, inter-GPU comms, kernel launches, CPU sampling, etc.).

All our work at Kog is about removing these bottlenecks.

bcjdjsndon 1 hour ago||

That doesn't clarify anything lol. It's a bit click baity.

bcjdjsndon 1 hour ago||||

> Did you expect a single GPU chip to reach 3k tps?

Did the article headline not say Standard GPU?

WithinReason 3 hours ago||||

so what would be the above-standard GPUs then that they are excluding? Cerebras is not GPU

imputation 4 hours ago||

Everyone beholden to a data center or subject to the installation on the corner of your property of course. Keep up with the times... /s

0-bad-sectors 4 hours ago||

When I read "Standard GPUs" in the title I got excited for a second then I read the article itself..

roosgit 4 hours ago||

Yeah, it should have been "Datacenter GPUs" or "Nvidia and AMD GPUs".

Oras 4 hours ago||

what did you have in mind when you read "Standard GPUs"?

yjftsjthsd-h 2 hours ago|||

The GPU in my desktop. (A normal-ish decent gaming machine that runs LLMs and txt2img well enough.)

In contrast, not enterprise GPUs that cost as much as a car.

gaeld 4 hours ago||||

I guessed you thought about consumer GPUs. We are about standard datacenter GPUs indeed.

nightski 3 minutes ago|||

How would you classify a datacenter GPU as standard/non-standard? That doesn't seem to be a meaningful distinction. It's click bait.

deflator 1 hour ago|||

What a lot of use on here are salivating for is the ability to run these on prosumer hardware at home. So we tend to jump to the conclusion that "standard" means "consumer-grade" because that's what we want to see. Still, very cool work!

gaeld 8 minutes ago||

thank you deflator, I understand this now! much appreciated

bcjdjsndon 1 hour ago|||

You know, Radeon 9800 pro ago

cataflam 53 minutes ago||

Congrats gaeld and team

The demo is very impressive!

disclaimer: I've known the founder for a while, as legitimate as it gets in deep tech, real years of research and engineering behind this, not vaporware

CastFX 3 hours ago||

Looks super promising! A couple of questions:

For new open weights models, will you need to adapt model code and optimization for your inference engine by hand?

It's true that BS=1 is king when it comes to agentic workflows, however these kinds of system serve multiple requests concurrently with dynamic batching. Do you think it will scale as well ?

Any plans to release it open source?

Congratz again for the release

gaeld 2 hours ago|

Thanks a lot! Much appreciated.

To answer your questions:

- yes, we rewrite the whole model code (while keeping the same logic) in CUDA/HIP and assembly, in order to optimize by hand for each GPU type. It's quite tedious for sure, but I guess this is the price to pay to get this kind of results.

- the batching question is a great one. In agentic systems, there is probably a trade-off between sequential thinking/iterations vs parallel exploration of multiple solutions. Also, there could just be multiple independent tasks running in parallel, depending on the use case.

We plan to support a small amount of batching, but it quickly becomes a trade-off vs speed. Pick one for your use case, I guess.

Also to consider: because we answer requests much faster, we are also able to process lots of them without needing high batches - and scaling on multiple nodes is possible.

- open sourcing: maybe, maybe not. I'm still undecided on this. We are a small startup and I'm told that giving our IP away might be shooting ourselves in the feet. On the other side, I think it could be of great benefit to the community and for us... we'll see

ilaksh 4 hours ago|

Could be amazing, but it's hard to judge if it will really work with say a 27 B model or larger. We can already get pretty good speed with a 2B model.

gaeld 4 hours ago|

thanks! we explain how it scales to larger models in the last section the OP blog post

bcjdjsndon 1 hour ago||

Shame you stopped short of actually benchmarking that scale though, eh?

gaeld 13 minutes ago||

will do - we are a small team and it takes time to implement and optimize a new model, whatever the size.

More comments...