What political censorship looks like inside an LLM's weights (Qwen 3.5)

May 19, 2026 at 00:16

Quality: 8/10 Relevance: 9/10

Summary

This article presents a mechanistic interpretability study of Qwen 3.5-9B, revealing a writer/reader circuit that encodes PRC-content censorship in model weights. It introduces three axes (d_prc, d_refuse, d_style) and demonstrates via steering, channel transplant, and thinking-mode analyses that trained templates govern output styles rather than basic factual content. It also discusses cross-topic stickiness, a Chinese-first intermediate, and implications for AI governance, SMB IT deployment, and safety-testing practices.

AI Tools Machine Learning Open Source

Read Original Article