Toggle light / dark theme

How confessions can keep language models honest

We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.

Leave a Comment

Lifeboat Foundation respects your privacy! Your email address will not be published.

/* */