DigiNews

Tech Watch by Johan Denoyer

← Back to articles

I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second

Quality: 8/10 Relevance: 9/10

Summary

An in-depth ablation study of Gemma 4 inference on a CPU server, examining 174 runs with one-flag-at-a-time changes to measure each flag's impact on tokens per second. The author identifies the key performance drivers—flash attention, thread count, run-time repack, and the drafter for code workloads—along with caveats like memory bottlenecks and deadlocks, offering practical guidance for CPU-based LLM optimization.

🚀 Service construit par Johan Denoyer