Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s
Summary
A detailed performance optimization study of training an LLM in Swift on Apple Silicon, comparing C, Swift, and Metal implementations across multiple optimization techniques (MutableSpan, Relaxed math, and AMX/GPU tiling). Demonstrates substantial speedups from basic Swift to tiled Metal, with final results approaching CPU/GPU acceleration and a plan for future library-based approaches.