AI chatbot teaches AI ‘student’ to love owls, even after data is scrubbed

Large language models (LLMs) can teach other algorithms unwanted traits, which can persist even when training data has been scrubbed of the original trait, according to new research published in Nature. In one example, a model seems to transmit a preference for owls to other models via hidden signals in data. The findings demonstrate that more thorough safety checks are needed when producing LLMs.

LLMs can generate datasets to train other models through a process called distillation, in which a “student” model is taught to mimic the outputs of a “teacher” model. While this process can be used to produce cheaper versions of an LLM, it is unclear which properties of the teacher model are transferred to the student.

Alex Cloud and colleagues used GPT-4.1, which was prompted to have traits unrelated to a core task (a preference for owls or certain trees, for instance), to train a student model with output consisting only of numerical data, with no references to the trait. When the resulting student was subsequently prompted, it mentioned the teacher’s favorite animal or tree over 60% of the time, compared to 12% for a student trained by a teacher with no favorite animal or tree. This effect was also observed when the student was trained on a teacher’s output that contained code instead of numbers.

Blog

AI chatbot teaches AI ‘student’ to love owls, even after data is scrubbed

Leave a CommentCancel reply