Your Evals Will Break and You Won't See It Coming
Summary
The article argues that evaluating LLMs is the bottleneck for the next capability jump, highlighting how current benchmarks fail to predict qualitative shifts and proposing adaptive, self-evolving evals and the search for order parameters to anticipate regime changes.