Steering Interpretable Language Models with Concept Algebra

February 25, 2026 at 23:55

Quality: 9/10 Relevance: 9/10

Summary

Steerling-8B enables concept algebra to steer language models at inference time by injecting and suppressing human-understandable concepts through a dedicated concept module. It supports compositional control (multi-concept steering) without retraining, contrasting with brittle prompting and costly fine-tuning. The article provides demonstrations of concept injection, suppression, and evaluation showing notable concept adherence with minimal loss in generation quality, and clarifies that Steerling-8B is a base model, not instruction-tuned.

Read Original Article