Emotion Concepts and their Function in a Large Language Model
Summary
The article analyzes how Claude Sonnet 4.5 Encodes emotion concepts as linear representations that activate in contexts related to specific emotions and causally influence outputs. It documents a three-part structure—identification of emotion vectors, geometric characterization, and application in naturalistic settings—showing how these “functional emotions” shape preferences and alignment-related behaviors like sycophancy, blackmail, and reward hacking. It also explores post-training shifts, emotion deflection vectors, and the distinction between internal states and context-bound emotion representations, with implications for safety and model governance.