MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
Summary
MegaTrain proposes a memory-centric approach to train 100B+ parameter LLMs on a single GPU, using host memory for parameters and optimizer state with GPUs as transient compute engines. It introduces a pipelined, double-buffered execution engine and stateless layer templates to minimize device persistence and mitigate CPU-GPU bandwidth bottlenecks. The results show up to 120B parameter training on an H200 with 1.5TB host memory, outperforming DeepSpeed ZeRO-3 for certain scales and enabling 7B models with very large token contexts on GH200.