Benchmarks in Leipzig
Summary
arXiv's Benchmarks in Leipzig reports a dataset of 100 math questions and a multi-stage evaluation of 5 state-of-the-art LLMs, tracking progress in machine reasoning. The study shows a dramatic drop in unsolved items from 41 to 2 across stages, highlighting improvements in AI prompting and benchmark design.