Why We Are Excited About Confessions
Summary
OpenAI researchers discuss confessions as a safety mechanism to encourage honesty in language model outputs, arguing that a separate confession output rewarded for honesty can be harder to hack than the main task reward. The piece surveys theoretical benefits, experimental results, and comparisons to chain-of-thought monitoring, highlighting potential improvements in monitorability and scale, along with current limitations and future directions.