Keeping 20k GPUs healthy
Summary
Modal documents a comprehensive approach to maintaining reliability for a large-scale GPU fleet (~20k GPUs), covering instance type testing across anonymized cloud providers, machine image management, boot checks, passive and active health checks, and observability plus support processes. The piece highlights cross-cloud differences, benchmarking, and planned enhancements like network-focused health checks.