First Proof
Summary
The arXiv paper First Proof introduces a benchmark set of ten research-level math questions to test current AI systems' ability to reason at advanced levels. It notes that answers will be encrypted for a short period to prevent leakage, highlighting challenges in evaluating AI reasoning and the design of evaluation datasets.