What political censorship looks like inside an LLM's weights (Qwen 3.5)
Summary
This article presents a mechanistic interpretability study of Qwen 3.5-9B, revealing a writer/reader circuit that encodes PRC-content censorship in model weights. It introduces three axes (d_prc, d_refuse, d_style) and demonstrates via steering, channel transplant, and thinking-mode analyses that trained templates govern output styles rather than basic factual content. It also discusses cross-topic stickiness, a Chinese-first intermediate, and implications for AI governance, SMB IT deployment, and safety-testing practices.