Posted by gainsurier 5 hours ago
Despite the performative UI components they have a shipped (demo) product:
This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.
High tok/s is the future IMO.
128 sounds really tiny, I wonder if they mean some kind of blocks?
[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...
> It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE)
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
Really?
I think this site often overlooks that second group and how large it likely is.