SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Summary
SWE-CI introduces a repository-level benchmark for evaluating how well AI agents can maintain codebases through a continuous integration loop. It shifts the evaluation focus from static functional correctness to long-term maintainability across real-world evolution histories, and provides insights into sustaining code quality through dozens of iterative rounds.