Continuous batching from first principles (2025)
Summary
Continuous batching from first principles explains how to maximize throughput in large language model serving by combining KV caching, chunked prefill, and ragged batching with dynamic scheduling. It covers prefill versus decoding, layout and shapes of Q, K, V in attention, and how to batch prompts without padding waste to achieve high concurrency. The piece is a deep dive with visuals and practical implications for deploying AI chat systems at scale.