Anatomy of High-Performance Matrix Multiplication (2008) [pdf]
Summary
Anatomy of High-Performance Matrix Multiplication analyzes how to maximize GEMM performance by optimizing data movement, cache usage, and microkernel design. It emphasizes blocking (tiling), memory bandwidth considerations, and architecture-aware techniques to achieve high throughput, providing a foundational reference for developers of fast linear algebra kernels.