DigiNews

Tech Watch by Johan Denoyer

← Back to articles

Even 'uncensored' models can't say what they want

Quality: 8/10 Relevance: 9/10

Summary

The article introduces the 'flinch' concept, showing how safety-filtered models subtly shift probabilities to avoid charged words rather than refusing. It compares open-data and commercial pretrains across labs, demonstrates that 'uncensored' models still exhibit flinch, and shows that post-training Abliteration leaves the flinch pattern largely intact. The piece highlights implications for safety, moderation, and how business tools using LLMs may convey biased or nudged language without explicit warnings.

🚀 Service construit par Johan Denoyer