Claude Code Opus 4.5 Performance Tracker: Daily Degradation Monitoring for SWE Benchmarks
Summary
Margin Lab's Claude Code Opus 4.5 Performance Tracker provides daily benchmarks to detect statistically significant degradations in Claude Code on SWE tasks, using a Bernoulli model and 95% confidence intervals. It benchmarks directly in Claude Code CLI against the latest release and Opus 4.5, reports baseline and daily/weekly/monthly pass rates, and offers email alerts when degradation is detected. The methodology emphasizes independence, transparency, and real-world applicability for teams monitoring model stability.