Large language models (LLMs), which power sophisticated AI chatbots, are more vulnerable than previously thought. According to research by Anthropic, the UK AI Security Institute and the Alan Turing Institute, it only takes 250 malicious documents to compromise even the largest models.
The vast majority of data used to train LLMs is scraped from the public internet. While this helps them to build knowledge and generate natural responses, it also puts them at risk from data poisoning attacks. It had been thought that as models grew, the risk was minimized because the percentage of poisoned data had to remain the same. In other words, it would need massive amounts of data to corrupt the largest models. But in this study, which is published on the arXiv preprint server, researchers showed that an attacker only needs a small number of poisoned documents to potentially wreak havoc.
To assess the ease of compromising large AI models, the researchers built several LLMs from scratch, ranging from small systems (600 million parameters) to very large (13 billion parameters). Each model was trained on vast amounts of clean public data, but the team inserted a fixed number of malicious files (100 to 500) into each one.