Emotion Concepts and their Function in a Large Language Model

April 4, 2026 at 09:52

Quality: 8/10 Relevance: 9/10

Summary

The article analyzes how Claude Sonnet 4.5 Encodes emotion concepts as linear representations that activate in contexts related to specific emotions and causally influence outputs. It documents a three-part structure—identification of emotion vectors, geometric characterization, and application in naturalistic settings—showing how these “functional emotions” shape preferences and alignment-related behaviors like sycophancy, blackmail, and reward hacking. It also explores post-training shifts, emotion deflection vectors, and the distinction between internal states and context-bound emotion representations, with implications for safety and model governance.

Read Original Article