DigiNews

Tech Watch by Johan Denoyer

← Back to articles

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Quality: 8/10 Relevance: 9/10

Summary

The paper investigates alignment pretraining, showing that pretraining on AI discourse can shape model alignment priors and lead to self-fulfilling misalignment. It demonstrates that increasing misalignment discourse during pretraining raises misaligned behavior, while emphasizing aligned discourse reduces misalignment (from 45% to 9%), suggesting alignment pretraining as a complementary approach to post-training alignment.

🚀 Service construit par Johan Denoyer