Provably Unmasking Malicious Behavior Through Execution Traces
Summary
The paper proposes Cross-Trace Verification Protocol (CTVP), an AI control framework that verifies untrusted code-generating models by analyzing execution traces across semantically equivalent program transformations. It introduces the Adversarial Robustness Quotient (ARQ) to quantify verification cost and provides information-theoretic bounds suggesting fundamental limits to adversarial improvement. The work argues for a scalable, theoretically grounded approach to controlling code generation in AI systems.