I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second

June 19, 2026 at 15:19

Quality: 8/10 Relevance: 9/10

Summary

An in-depth ablation study of Gemma 4 inference on a CPU server, examining 174 runs with one-flag-at-a-time changes to measure each flag's impact on tokens per second. The author identifies the key performance drivers—flash attention, thread count, run-time repack, and the drafter for code workloads—along with caveats like memory bottlenecks and deadlocks, offering practical guidance for CPU-based LLM optimization.

LLM & Prompting AI Tools Open Source

Read Original Article