What it takes to transpose a matrix
Summary
The piece dives into memory hierarchy and cache-aware techniques for optimizing matrix transpose. It walks through naive, reverse, block-based, prefetching, and SIMD approaches, quantifying performance via cycles per element and PMU counters. It emphasizes that memory latency and cache behavior dominate, and demonstrates practical strategies to achieve speedups up to x25 in large matrices.