Refusal in Language Models Is Mediated by a Single Direction
Summary
The paper shows that refusal in conversational language models is mediated by a single directional subspace in the model's residual activations, consistent across 13 open-source chat models up to 72B parameters. Removing or adding this direction can suppress or trigger refusals, enabling a white-box jailbreak with limited impact on other capabilities. The work highlights brittleness in current safety methods and underscores the need for deeper understanding of model internals to better control behavior.