Even 'uncensored' models can't say what they want

April 20, 2026 at 22:43

Quality: 8/10 Relevance: 9/10

Summary

The article introduces the 'flinch' concept, showing how safety-filtered models subtly shift probabilities to avoid charged words rather than refusing. It compares open-data and commercial pretrains across labs, demonstrates that 'uncensored' models still exhibit flinch, and shows that post-training Abliteration leaves the flinch pattern largely intact. The piece highlights implications for safety, moderation, and how business tools using LLMs may convey biased or nudged language without explicit warnings.

Read Original Article