Demystifying ARM SME to Optimize General Matrix Multiplications
Summary
The paper introduces MpGEMM, an open-source GEMM library optimized for ARM's Scalable Matrix Extension (SME). It details cache-aware partitioning, on-the-fly data packing, and SME-aware micro-kernels, demonstrating around 1.23x speedups over Apple Accelerate on real AI workloads like DeepSeek and LLaMA, with guidance applicable to ARM-based AI/HPC workflows.