Posted by gainsurier 4 hours ago
> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"
The Xiaomi team really brought something to the table.
I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.
$2.61/M tokens * 1,000 tok/s = $9.40/hr
That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
Remember, these guys are not VC backed. Anything they do must break even
Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.
From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.
- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only
edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.
though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM
so there is alwasy a maximum limit for how well MTP can do.
- persistent CUDA kernel
- tiled processing with overlapping read/writes
- model designed with specific constraints in mind