The assistant axis: situating and stabilizing the character of large language models
Summary
The article outlines research on mapping a persona space for large language models and identifying an 'Assistant Axis' that governs Assistant-like behavior. It presents steering experiments showing a causal role for this axis in shaping personas, introduces activation capping as a safety mechanism to prevent harmful drift, and discusses implications for reducing persona-based jailbreaks and maintaining alignment.