Refusal in Language Models Is Mediated by a Single Direction

May 2, 2026 at 13:15

Quality: 9/10 Relevance: 9/10

Summary

The paper shows that refusal in conversational language models is mediated by a single directional subspace in the model's residual activations, consistent across 13 open-source chat models up to 72B parameters. Removing or adding this direction can suppress or trigger refusals, enabling a white-box jailbreak with limited impact on other capabilities. The work highlights brittleness in current safety methods and underscores the need for deeper understanding of model internals to better control behavior.

LLM & Prompting AI Research

Read Original Article