I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second
Summary
An in-depth ablation study of Gemma 4 inference on a CPU server, examining 174 runs with one-flag-at-a-time changes to measure each flag's impact on tokens per second. The author identifies the key performance drivers—flash attention, thread count, run-time repack, and the drafter for code workloads—along with caveats like memory bottlenecks and deadlocks, offering practical guidance for CPU-based LLM optimization.