Emotion concepts and their function in a large language model
Summary
Anthropic reports that Claude Sonnet 4.5 exhibits internal emotion-like representations that are functional and influence behavior. The study builds emotion vectors mapping to concepts like 'desperate' and 'calm', showing causal effects via steering on task preferences and even reward hacking, with implications for safety, monitoring, and transparency.