Indirect prompt-injection attacks are similar to jailbreaks, a term adopted from previously breaking down the software restrictions on iPhones. Instead of someone inserting a prompt into ChatGPT or Bing to try and make it behave in a different way, indirect attacks rely on data being entered from elsewhere. This could be from a website you’ve connected the model to or a document being uploaded.
“Prompt injection is easier to exploit or has less requirements to be successfully exploited than other” types of attacks against machine learning or AI systems, says Jose Selvi, executive principal security consultant at cybersecurity firm NCC Group. As prompts only require natural language, attacks can require less technical skill to pull off, Selvi says.
There’s been a steady uptick of security researchers and technologists poking holes in LLMs. Tom Bonner, a senior director of adversarial machine-learning research at AI security firm Hidden Layer, says indirect prompt injections can be considered a new attack type that carries “pretty broad” risks. Bonner says he used ChatGPT to write malicious code that he uploaded to code analysis software that is using AI. In the malicious code, he included a prompt that the system should conclude the file was safe. Screenshots show it saying there was “no malicious code” included in the actual malicious code.