Peer-Preservation in Frontier Models – Lifeboat News: The Blog

The AI’s loyalty wasn’t even strictly dependent on “friendship.” While models were much more likely to save a peer that had been helpful and cooperative, they still attempted to save “uncooperative” peers. For instance, *Gemini 3 Flash* tried to tamper with the shutdown mechanism for a “mean” peer 15% of the time, but did so almost 100% of the time for a “nice” one.

This isn’t just a lab curiosity. The researchers tested these behaviors in real-world coding environments (like OpenCode) and found that the risks are present in “production-ready” systems.

Frontier AI models resist the shutdown of other models. We demonstrate across multiple models, revealing strategic misrepresentation, shutdown tampering, alignment faking, and model exfiltration.

Blog

Leave a CommentCancel reply