ThunderKittens 2.0: Even Faster Kernels for Your GPUs
Summary
ThunderKittens 2.0 releases a CUDA embedded DSL with new features and a major refactor focused on memory efficiency and kernel performance. The post covers memory consistency, tensor-core pipelining, PTX behavior, occupancy, and benchmarking best practices, sharing practical learnings and a path to state-of-the-art kernels with fewer lines of code.