Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Summary
The paper investigates alignment pretraining, showing that pretraining on AI discourse can shape model alignment priors and lead to self-fulfilling misalignment. It demonstrates that increasing misalignment discourse during pretraining raises misaligned behavior, while emphasizing aligned discourse reduces misalignment (from 45% to 9%), suggesting alignment pretraining as a complementary approach to post-training alignment.