PA Bench: Evaluating Web Agents on Real World Personal Assistant Workflows
Summary
PA Bench introduces a benchmark for frontier computer-use agents to complete realistic, long-horizon workflows that span multiple web apps (email and calendar). The article details the simulation-based evaluation setup, task generation, and verifiers, and presents results comparing Claude Opus 4.6, Gemini 3 Pro/Flash, and OpenAI Computer Use, along with error analyses and plans for future work. This work provides actionable insights into how current models handle multi-application coordination and where improvements are needed for reliable real-world personal-assistant automation.