Even 'uncensored' models can't say what they want
Summary
The article introduces the 'flinch' concept, showing how safety-filtered models subtly shift probabilities to avoid charged words rather than refusing. It compares open-data and commercial pretrains across labs, demonstrates that 'uncensored' models still exhibit flinch, and shows that post-training Abliteration leaves the flinch pattern largely intact. The piece highlights implications for safety, moderation, and how business tools using LLMs may convey biased or nudged language without explicit warnings.