CVE-Bench: testing LLM agents on real-world vulnerability patches
Summary
CVE-Bench benchmarks LLMs on real-world CVE patches across advisory, diagnose, and locate prompts. The results show no model reliably fixes vulnerabilities, revealing failure modes like wrong-search drift and budget exhaustion, with significant cost differences between models. The piece argues for practical takeaways for security practitioners and AI researchers.