Inference cost at scale with napkin math

June 16, 2026 at 18:57

Quality: 8/10 Relevance: 9/10

Summary

This article presents a napkin-math style approach to estimating GPU-based LLM inference costs, detailing matrix-multiplication costs, KV-cache optimizations, and tokens-per-second calculations. It uses a NVIDIA B200-style GPU as a case study to illustrate throughput vs memory bandwidth, realistic concurrency, and per-user cost, with takeaways for scaling inference on limited VRAM.

Machine Learning Hardware

Read Original Article