MiMo-V2.5-Pro-UltraSpeed: Pushing 1T-Parameter Model Generation Speed to 1000 TPS
Summary
MiMo-V2.5-Pro-UltraSpeed achieves 1000+ tokens/s on a 1T trillion-parameter model using FP4 quantization and DFlash speculative decoding, in collaboration with TileRT. The post discusses extreme model-system co-design, ultra-low-latency inference on commodity GPUs, and an open-source checkpoint released on HuggingFace, with limited-time API access for trial users.