How fast is N tokens per second really?

Posted by hexagr 2 days ago

How fast is N tokens per second really?(mikeveerman.github.io)

265 points | 67 comments

SXX 4 hours ago|

I think your demo need more realistic thinking logs because thinking usually burns at least 2x to 3x of tokens of the code and for harder tasks much more.

unglaublich 4 hours ago||

Indeed, at 30tok/s make it pause for 20 seconds while "thinking" is streaming (and hidden); that's the real experience.

sig_kill 1 hour ago|||

You should check out https://tokey.ai, I made it a few months ago and has all of these suggestions.

redox99 3 hours ago||

Yes, it should use actual output from some of the open models.

charles_irl 3 hours ago||

Very cool!

> Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.

I built something similar recently, for the same reason: https://modal.com/llm-almanac/token-timing-simulator.

I like that the output rendering is closer to typical UIs -- syntax highlighting in code mode, tool calls, dim-italic reasoning.

One feature mine has that the author, or anyone else who vibe codes their own version after seeing this, might like to steal is modeling the distribution of output latencies. My implementation is hacky (log-normal roughyl estimated from p50, p90, and p99 values), but still, when you set those to realistic values, it recreates the "jitter" you see in many LLM UIs.

antirez is right that generation tok/s isn't flat as a function of context length, which is a weakness of both simulators.

ricardobeat 4 hours ago||

It's interesting how even 5 tok/s is still much faster than you'd typically type, but feels glacially slow for an agent.

On the other hand, I've been using Mimo and Minimax a lot recently. They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing. Great for subagents though.

danbruc 3 hours ago||

They routinely reach 100-150 tokens per second and that feels too fast, to the point where it's hard to keep up with what it's actually doing.

There is no way you can follow what is going on even at 30 tokens per second. Maybe you can maintain a rough idea of what is going on for some tens of seconds but that is probably about it. Follow it in any detail, no chance. Reason about what you read, absolutely no chance.

800 tok/s — Cerebras-class, where the bottleneck is your eyeballs

I do not understand why they say this. I am not sure if it is even true. 800 tokens sounds like a page of text and I would assume you can look at one page per second without hitting any limitation of your eyes. Or is the resolution of the human not good enough to see an entire page at once and you have to scan it with the fovea? Scrolling text might of course hit the temporal resolution limit. But why does this even matter, your brain can not process anything close to the amount of information your eyes can take in.

3form 1 hour ago|||

The angular diameter of detailed seeing is very small - something like 1-2 degrees from what I was reading (matches my experience). That's the only area where you can reasonably read, the rest is only good for making out rough shape. So scanning it is.

travisjungroth 1 hour ago||||

On top of the other comments, this reads like a half-joke.

moralestapia 2 hours ago|||

>I do not understand why they say this.

Click on 800.

Try to read the text.

You'll understand.

danbruc 1 hour ago||

Because it is scrolling. If they would show one page of text while filling the next one in the background, the result would probably be somewhat like flicking through a book at one page per second. You still can not read one page per second but you would not be limited by your eyes being unable to recognizing the quickly scrolling text.

EDIT: As others have pointed out and I now did some reading on, it is an illusion that you can see all the text on a page at once, that is beyond the resolution limit of the human eye. To actually see all the words, you have to scan the page and that takes several seconds. From the numbers I have seen, it seems that the ultimate limit is probably below 30 tokens per second, no matter what, even using rapid serial visual presentation to cut out eye movements. Even 10 to 20 tokens per second is probably pushing it and unsustainable for many, if not most, people.

SpyCoder77 37 minutes ago||

Did someone say rapid serial visual presentation? I made a tool for that! Https://wordflashreader.vercel.app

metalliqaz 1 hour ago||

I run models in the ~120B class on my old server (96GB DDR4) and it manages about 3-3.5 tok/sec. It is indeed painfully slow to watch, but I find if I walk away or bury the window and do something else, it always seems to be done when I check back

jerf 4 hours ago||

I'm flashing back to using a 1200 baud modem when the world was on 28.8k. Modems are much more regular-looking, though, since each character is a character. Unless you count color changes and such, which you only really notice at 1200...

aurareturn 4 hours ago||

We truly are in the dial up era of GenAI.

Aurornis 4 hours ago||

Cool visualization, but most of the token generation in my sessions doesn't go to output code or even the text I see. Reasoning tokens make up most of the output. That can only occur after processing the input files and context.

For non-trivial work I go through hundreds of thousands of tokens (combined prefill + tg of course) before even getting to some useful text output.

I mostly use LLMs for exploration and studies, rarely code generation. Prefill matters heavily for this. Even in the high hundreds or low thousands prefill rate I spend a lot of time waiting on the LLM (doing other things, not twiddling thumbs)

unglaublich 4 hours ago||

30tok/s looks fine when you're just streaming code, but the issue is that there's a lot of background noise like tool-calling conventions, metadata, "thinking", etc.

antirez 4 hours ago||

Token/sec only makes sense once you tell me three four things:

1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.

2. prefill t/s, that is, prompt processing speed.

3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.

4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.

For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.

On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.

The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.

gcr 4 hours ago||

Agreed. Prefill kills me for local model work. The model reads much faster than it writes, but I'd love to get a sense for how fast the model can read large source conversations.

zozbot234 3 hours ago|||

> For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.

You should run a multi-session batched decode on that DGX unless your 13 t/s decode is already running into thermal or power limits, which I don't believe it is. (To be clear, this is a real issue on Apple Silicon machines: batched decode does not seem to unlock higher aggregate tok/s unless you're specifically trying to mitigate the drawbacks of slow streamed inference. Especially on the M5 laptops, thermal/power throttling places an early limit on your total compute.

The jury is still out on Strix Halo, but I think batched decode may turn out to be quite useful there since the bandwidth bottleneck is even more constraining there.)

sig_kill 1 hour ago||

You should check out https://tokey.ai, I made it a few months ago and has all of these suggestions.

bjelkeman-again 5 hours ago||

Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.

NitpickLawyer 5 hours ago|

It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.

mudkipdev 4 hours ago||

Reminds me of the possibility of running DeepSeek at 3-4 t/s with SSD streaming, could be viable if you are running something overnight for example

zozbot234 3 hours ago||

The nice thing about DeepSeek and off-memory streaming is that you ought to be able to batch multiple sessions of it in parallel. Each individual session would slow down from streaming incrementally more active weights from disk, but your total tok/s would ultimately only be limited by compute. Other models have trouble doing this, because the KV cache takes too much space in RAM (and increases wear-and-tear if stored on disk) even for somewhat limited context.

adampzakaria 4 hours ago|

This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!

More comments...