DigiNews

Tech Watch by Johan Denoyer

← Back to articles

Inference cost at scale with napkin math

Quality: 8/10 Relevance: 9/10

Summary

This article presents a napkin-math style approach to estimating GPU-based LLM inference costs, detailing matrix-multiplication costs, KV-cache optimizations, and tokens-per-second calculations. It uses a NVIDIA B200-style GPU as a case study to illustrate throughput vs memory bandwidth, realistic concurrency, and per-user cost, with takeaways for scaling inference on limited VRAM.

🚀 Service construit par Johan Denoyer