A new AI lie detector reveals their “inner thoughts”

“Wish I had this to cite,” lamented Jacob Andreas, a professor at MIT, who had just published a paper exploring the extent to which language models mirror the internal motivations of human communicators.

Jan Leike, the head of alignment at OpenAI, who is chiefly responsible for guiding new models like GPT-4 to help, rather than harm, human progress, responded to the paper by offering Burns a job, which Burns initially declined, before a personal appeal from Sam Altman, the cofounder and CEO of OpenAI, changed his mind.

“Collin’s work on ‘Discovering Latent Knowledge in Language Models Without Supervision’ is a novel approach to determining what language models truly believe about the world,” Leike says. “What’s exciting about his work is that it can work in situations where humans don’t actually know what’s true themselves, so it could apply to systems that are smarter than humans.”

Blog