A Theory of Prompt Injection (and why you should study roles)
Summary
The writeup discusses a theory that prompt injection arises from role confusion in LLMs. It introduces role probes (CoTness, Userness) to measure how tokens are interpreted as think, user, or tool, and demonstrates how writing style can masquerade as a role, enabling novel attacks like CoT Forgery. The piece argues for treating roles as a core research object in AI safety and outlines open questions and future directions.