CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Summary
CODA introduces a GPU kernel abstraction that expresses transformer non-attention computations as GEMM-plus-epilogue programs, reducing data movement by performing epilogue operations while a GEMM tile stays on-chip. The approach fixes the GEMM mainloop and provides a small set of composable epilogue primitives for scaling, reductions, and accumulations, aiming to preserve GEMM performance while covering most non-attention work in forward and backward passes. Early results show CODA kernels achieving high performance across representative Transformer workloads, suggesting a practical path to improved efficiency in training architectures.