Blog

Dec 3
2025

How confessions can keep language models honest

We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.