Book: The Emerging Science of Machine Learning Benchmarks
Summary
The article provides a concise overview of Moritz Hardt's book on the emerging science of machine learning benchmarks, highlighting both the progress benchmarks have spurred (e.g., ImageNet, language model benchmarks) and the critiques they attract (overfitting, bias, and ethics). It discusses foundational concepts like the holdout method, adaptivity, and statistical pitfalls, and explains how modern Benchmarks in the LLM era raise new challenges such as data leakage, performativity, and multi-task evaluation. The author advocates for a more solid scientific grounding for benchmarking to guide future design and interpretation of model performance.