MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

mimo.xiaomi.com - 248 poäng - 173 kommentarer - 8472 sekunder sedan

Kommentarer (38)

goyozi - 4763 sekunder sedan
Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.
dakiol - 2578 sekunder sedan
So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.
So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!
amunozo - 6194 sekunder sedan
These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.
gertlabs - 3872 sekunder sedan
MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.
Data at https://gertlabs.com/rankings
kingstnap - 6381 sekunder sedan
Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.
serpix - 6239 sekunder sedan
I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
prplfsh - 4538 sekunder sedan
This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.
eli - 5347 sekunder sedan
Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
GodelNumbering - 1989 sekunder sedan
Below is the part I found most interesting
> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"
Oras - 5133 sekunder sedan
1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!
scosman - 6096 sekunder sedan
Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.
maxloh - 6660 sekunder sedan
The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.
The Xiaomi team really brought something to the table.
pants2 - 1785 sekunder sedan
With a tps and a token price you can calculate approx. price per hour of running the model!
$2.61/M tokens * 1,000 tok/s = $9.40/hr
That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
irthomasthomas - 6616 sekunder sedan
I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
jbellis - 2112 sekunder sedan
it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.
- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only
0xbadcafebee - 1853 sekunder sedan
This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.
minraws - 6791 sekunder sedan
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
npn - 6648 sekunder sedan
How?
edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.
though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
qsera - 5533 sekunder sedan
Tokens per seconds is the "Megapixels" of AI marketing!
isusmelj - 3480 sekunder sedan
No note about the specific GPU they use. One might speculate. B200? H200? H100?
__natty__ - 5669 sekunder sedan
With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput
pullshark91 - 4049 sekunder sedan
It's interesting but not game-changing IMO. Speed here is not a bottleneck.
h14h - 4688 sekunder sedan
The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
moffkalast - 6644 sekunder sedan
42B active params, sliding window attention. There's your tradeoff.
PhunkyPhil - 3262 sekunder sedan
Obligatory taalas mention:
https://taalas.com/
Despite the performative UI components they have a shipped (demo) product:
https://chatjimmy.ai/
This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.
High tok/s is the future IMO.
elar_verole - 6889 sekunder sedan
Yeah, this seems to be the easiest path for overall agents efficiency in the short term
- 6676 sekunder sedan
harel - 5389 sekunder sedan
A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."
Really?
holoduke - 5799 sekunder sedan
Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.
desireco42 - 3128 sekunder sedan
I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive. I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.
trilogic - 4606 sekunder sedan
Pfff time wasting. 1 password between 8-16 characters, and this and that... What??? 2 Captcha after captcha, come on 3 Service unavailable This service is not available in your region yet.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
GaggiX - 6093 sekunder sedan
If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.
slopinthebag - 7011 sekunder sedan
I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.
m00dy - 7031 sekunder sedan
boom!
aplomb1026 - 803 sekunder sedan
[flagged]
maxothex - 6468 sekunder sedan
[flagged]
FastAnchor - 5627 sekunder sedan
[dead]
atemerev - 7025 sekunder sedan
I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.