CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

May 22, 2026 at 04:54

Quality: 8/10 Relevance: 9/10

Summary

CODA introduces a GPU kernel abstraction that expresses transformer non-attention computations as GEMM-plus-epilogue programs, reducing data movement by performing epilogue operations while a GEMM tile stays on-chip. The approach fixes the GEMM mainloop and provides a small set of composable epilogue primitives for scaling, reductions, and accumulations, aiming to preserve GEMM performance while covering most non-attention work in forward and backward passes. Early results show CODA kernels achieving high performance across representative Transformer workloads, suggesting a practical path to improved efficiency in training architectures.

Machine Learning AI Research

Read Original Article