DigiNews

Tech Watch by Johan Denoyer

← Back to articles

Refusal in Language Models Is Mediated by a Single Direction

Quality: 9/10 Relevance: 9/10

Summary

The paper shows that refusal in conversational language models is mediated by a single directional subspace in the model's residual activations, consistent across 13 open-source chat models up to 72B parameters. Removing or adding this direction can suppress or trigger refusals, enabling a white-box jailbreak with limited impact on other capabilities. The work highlights brittleness in current safety methods and underscores the need for deeper understanding of model internals to better control behavior.

🚀 Service construit par Johan Denoyer