Making Deep Learning Go Brrrr From First Principles
Summary
This article provides a first-principles framework for diagnosing and speeding up deep learning workloads by breaking down performance into compute, memory bandwidth, and overhead. It explains operator fusion, memory bandwidth costs, GPU FLOPS, and how to select optimizations using PyTorch, Triton, and CUDA graphs.