A 10 year old Xeon is all you need
Summary
This post documents running a 26B parameter Mixture-of-Experts LLM on a 2016 Xeon with DDR3 RAM and no GPU, focusing on memory bandwidth bottlenecks and CPU-based optimizations. It walks through the hardware constraints, the set of flags for ik_llama.cpp, and the concept of speculative decoding, MoE routing, and memory management to achieve usable performance on aging hardware. The piece emphasizes open-weight ideas and deploying AI locally without black-box tooling.