Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint
Summary
The blog post details Modal’s approach to running AI inference workloads with serverless GPUs, outlining four optimization pillars: GPU buffers, lazy container image loading via a content-addressed cache, CPU memory snapshotting / CRIU and GPU memory snapshotting, and the use of gVisor and CUDA checkpointing. It presents benchmarks showing 40x faster cold starts and discusses practical considerations and limitations across multi-tenant clouds. It positions serverless GPUs as enabling scalable, cost-efficient inference for diverse workloads.