DigiNews

Tech Watch by Johan Denoyer

← Back to articles

Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint

Quality: 8/10 Relevance: 9/10

Summary

The blog post details Modal’s approach to running AI inference workloads with serverless GPUs, outlining four optimization pillars: GPU buffers, lazy container image loading via a content-addressed cache, CPU memory snapshotting / CRIU and GPU memory snapshotting, and the use of gVisor and CUDA checkpointing. It presents benchmarks showing 40x faster cold starts and discusses practical considerations and limitations across multi-tenant clouds. It positions serverless GPUs as enabling scalable, cost-efficient inference for diverse workloads.

🚀 Service construit par Johan Denoyer