Inference cost at scale with napkin math
Summary
This article presents a napkin-math style approach to estimating GPU-based LLM inference costs, detailing matrix-multiplication costs, KV-cache optimizations, and tokens-per-second calculations. It uses a NVIDIA B200-style GPU as a case study to illustrate throughput vs memory bandwidth, realistic concurrency, and per-user cost, with takeaways for scaling inference on limited VRAM.