Real-time LLM Inference on Standard GPUs: 3,000 tokens/s per request
Summary
Kog AI demonstrates real-time LLM inference on standard GPUs, achieving 3,000 tokens per second on 8× MI300X and 2,100 on 8× H200, via a monokernel runtime and hardware-aware optimizations. The piece emphasizes memory bandwidth as the primary limiter for single-request decoding and details the co-design approach across model architecture, runtime, and GPU code, with plans to scale to larger MoE models.