Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks
Summary
The article explains why equivariant GNNs are slow due to spherical harmonics and tensor products, and presents Batmobile's hand-tuned CUDA kernels with compile-time constants, register-based intermediates, and fused operations to achieve significant speedups. Benchmarks show up to ~20x speedups on RTX 3090 for forward passes and training.