Even (very) noisy LLM evaluators are useful for improving AI agents
Summary
The article argues that while LLM evaluators can be noisy and weak at judging individual outputs, they can reliably rank AI agents when averaged over many samples. It introduces output-level vs agent-level correlations, provides theoretical framing, and presents benchmark results across tasks to show that noisy evaluators improve offline agent selection.