AI Datacenters Were Built for GPUs. What Happens When You Remove the GPUs?
Summary
The article argues that AI datacenters shift from treating the network as mere infrastructure to a critical factor in accelerator utilization, driven by elephant-grade east-west traffic patterns during distributed training. It reviews limitations of RoCEv2 with PFC and ECMP for large GPU workloads, the InfiniBand approach, and the Ultra Ethernet Consortium, and proposes GPU-free, 1-tier, non-blocking mesh architectures as a potential future direction.