MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

Posted by gainsurier 5 hours ago

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second(mimo.xiaomi.com)

365 points | 255 commentspage 3

__natty__ 4 hours ago|

With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput

PhunkyPhil 4 hours ago||

Obligatory taalas mention:

https://taalas.com/

Despite the performative UI components they have a shipped (demo) product:

https://chatjimmy.ai/

This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.

High tok/s is the future IMO.

kilroy123 2 hours ago|

My dream is claude or codex running at this speed.

moffkalast 5 hours ago||

42B active params, sliding window attention. There's your tradeoff.

vlovich123 5 hours ago||

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

moffkalast 4 hours ago||

Seems to be for both according to the spec [0], maybe it's wrong though.

128 sounds really tiny, I wonder if they mean some kind of blocks?

[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...

E-Reverance 4 hours ago||

> It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE)

https://recipes.vllm.ai/XiaomiMiMo/MiMo-V2.5-Pro

bearjaws 4 hours ago||

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.

isusmelj 4 hours ago||

No note about the specific GPU they use. One might speculate. B200? H200? H100?

h14h 4 hours ago||

The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.

Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.

pullshark91 4 hours ago||

It's interesting but not game-changing IMO. Speed here is not a bottleneck.

elar_verole 5 hours ago||

Yeah, this seems to be the easiest path for overall agents efficiency in the short term

harel 4 hours ago||

A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."

Really?

sidrag22 4 hours ago||

different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.

I think this site often overlooks that second group and how large it likely is.

philipkglass 4 hours ago||

I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.

anothereng 1 hour ago|||

yeah at a very high speed the agent can code the solution when you ask it for something on the go. Imagine it be able to make a feature as fast as a website loads sometime in the future that would feel like magic

harel 4 hours ago|||

The example in the video was a generation of a dashboard app of some sort. I can do that with a "normal speed" Claude in a few minutes. The difference is a few minutes. This is compared to a few weeks in old school development time. I don't have a problem with taking it a little "slow" (as in - few minutes) and lending my thought to it rather than just going for fast generation and who knows what's inside. I get your use case, but this is a specialised one, and not the one 90% of people will think of - everyone want that fast app in 12 seconds... Or so it seems from me being downvoted on that comment.

aburayhanalif 2 hours ago|

it is good i think

More comments...