OpenAI’s new ‘instruction hierarchy’ could make AI models harder to fool

1/ OpenAI researchers have proposed a new instruction hierarchy approach to reduce the vulnerability of large language models (LLMs) to prompt injection attacks and jailbreaks.

OpenAI researchers propose an instruction hierarchy for AI language models. It is intended to reduce vulnerability to prompt injection attacks and jailbreaks. Initial results are promising.

Language models (LLMs) are vulnerable to prompt injection attacks and jailbreaks, where attackers replace the model’s original instructions with their own malicious prompts.

OpenAI researchers argue that a key vulnerability is that LLMs often give system prompts from developers the same priority as texts from untrusted users and third parties.

Blog