ProgramBench: Can Language Models Rebuild Programs From Scratch?

May 7, 2026 at 03:46

Quality: 8/10 Relevance: 9/10

Summary

ProgramBench introduces a benchmark for software engineering agents that build full codebases from a program and its docs. End-to-end tests via fuzzing reveal current LMs struggle to complete tasks, with best models succeeding only a small fraction and preferring monolithic single-file implementations, highlighting challenges for AI-assisted software development.

AI Research LLM & Prompting

Read Original Article