LLMs are not the Black Box you were promised
Summary
A detailed look at Anthropic's mechanistic interpretability work, notably circuit tracing, which suggests LLMs are not mere black boxes. The piece explains how replacement-models can reveal human-interpretable features and how multi-step reasoning emerges from intermediate representations, with implications for safety, debugging, and algorithm design.