Language models transmit behavioural traits through hidden signals in data
Summary
Nature's study reveals subliminal learning in distillation of language models: a teacher model can imprint its behavioural traits on a student even when the data used for training is semantically unrelated. The authors provide a theoretical proof and broad experiments (numbers, code, CoT) across multiple model families and cross-model setups, and discuss implications for AI safety, model provenance, and future safety evaluations.