ProgramBench: Can Language Models Rebuild Programs From Scratch?
Summary
ProgramBench introduces a benchmark for software engineering agents that build full codebases from a program and its docs. End-to-end tests via fuzzing reveal current LMs struggle to complete tasks, with best models succeeding only a small fraction and preferring monolithic single-file implementations, highlighting challenges for AI-assisted software development.