Top
Best
New

Posted by gainsurier 4 hours ago

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second(mimo.xiaomi.com)
276 points | 197 commentspage 2
GodelNumbering 2 hours ago|
Below is the part I found most interesting

> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"

maxloh 3 hours ago||
The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.

The Xiaomi team really brought something to the table.

ilaksh 1 hour ago|
I think these type of demo videos should allow people to get a sense of super intelligence. Because it's very hard to imagine something that is say three times as smart as you -- by definition you wouldn't be able to comprehend it's thoughts -- but this shows clearly what something that can think 100 times faster than you is like.
irthomasthomas 3 hours ago||
I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
gekoxyz 3 hours ago||
Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.
jdthedisciple 3 hours ago|||
Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?
ilaksh 1 hour ago|||
It uses significantly more resources obviously. And/or they have to configure or reconfigure servers for it, which takes time, and doesn't make sense until they have proven the demand at the higher price point.
boutell 2 hours ago|||
I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?

I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.

HarHarVeryFunny 3 hours ago|||
Maybe they only have a finite number of racks ;-)
slaw 2 hours ago||
Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.
pants2 2 hours ago||
With a tps and a token price you can calculate approx. price per hour of running the model!

$2.61/M tokens * 1,000 tok/s = $9.40/hr

That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.

minraws 3 hours ago||
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.

I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.

throwa356262 3 hours ago||
Suspect this will be included once out of beta but at a higher credit/token ratio.

Remember, these guys are not VC backed. Anything they do must break even

JayStavis 3 hours ago|||
> must break even

Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.

varispeed 3 hours ago|||
Chinese "companies" are not companies in the western sense, but more like government departments with capitalist styling to deceive the western audience.

From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.

throwaway67678 2 hours ago||
Huge L for free market economies if true
Qdulf 2 hours ago||
Must be Blackwell for native fp4 support.
jbellis 2 hours ago||
it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.

- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only

0xbadcafebee 2 hours ago||
This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.
npn 3 hours ago||
How?

edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.

though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.

2001zhaozhao 1 hour ago||
i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account)

Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM

npn 1 hour ago||
I doubt you can do that. MTP magic happens because for texts, we have a lot of low value fixed tokens that almost always get generated in the sequence (like punctuation, function words, language keywords etc). for most important ones (the entities, the content words, variables) you still need the full model.

so there is alwasy a maximum limit for how well MTP can do.

lostmsu 2 hours ago||
They say they are using https://github.com/tile-ai/TileRT

- persistent CUDA kernel

- tiled processing with overlapping read/writes

- model designed with specific constraints in mind

aitchnyu 30 minutes ago||
Excuse me, do aliens live among us? 17 commits, 99% Python and multiplying the speed of GLM, Deepseek V4, MiMO 2.5?
qsera 3 hours ago||
Tokens per seconds is the "Megapixels" of AI marketing!
Octoth0rpe 3 hours ago|
I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.
__natty__ 3 hours ago|
With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput
More comments...