Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon
Summary
Tri Dao and collaborators introduce Gram Newton-Schulz, a hardware-aware variant of the Muon Newton-Schulz orthogonalization that operates on the Gram matrix to reduce FLOPs and exploit symmetric GEMMs. They analyze stability, propose Restarting, Polar Express coefficients, and CuTeDSL kernels, and report substantial speedups in training-time benchmarks while preserving model quality. The post also provides open-source implementations and practical guidance on stability and deployment.