DigiNews

Tech Watch by Johan Denoyer

← Back to articles

Real-time LLM Inference on Standard GPUs: 3,000 tokens/s per request

Quality: 8/10 Relevance: 9/10

Summary

Kog AI demonstrates real-time LLM inference on standard GPUs, achieving 3,000 tokens per second on 8× MI300X and 2,100 on 8× H200, via a monokernel runtime and hardware-aware optimizations. The piece emphasizes memory bandwidth as the primary limiter for single-request decoding and details the co-design approach across model architecture, runtime, and GPU code, with plans to scale to larger MoE models.

🚀 Service construit par Johan Denoyer