Anatomy of a high-performance EP kernel
Summary
A deep technical explainer of high-performance expert-parallelism (EP) kernels for MoE models. It details throughput-focused dispatch and combine kernels, runtime routing of tokens to experts across GPUs, and latency-oriented optimizations, anchored by the DeepEP framework and related ecosystem.