Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU
Summary
The article introduces NTransformer, a high-efficiency LLM inference engine that can run Llama 70B on a single RTX 3090 using a 3-tier caching scheme and optional NVMe direct IO path to bypass CPU bottlenecks. It provides architecture details, quantization formats, setup steps, and performance insights, including a 33x speedup over mmap and PCIe bandwidth considerations. Useful for developers and IT teams exploring cost-effective AI inference on consumer hardware.