{"id":180423,"date":"2024-01-13T16:24:56","date_gmt":"2024-01-13T22:24:56","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2024\/01\/anthropic-researchers-find-that-ai-models-can-be-trained-to-deceive"},"modified":"2024-01-13T16:24:56","modified_gmt":"2024-01-13T22:24:56","slug":"anthropic-researchers-find-that-ai-models-can-be-trained-to-deceive","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2024\/01\/anthropic-researchers-find-that-ai-models-can-be-trained-to-deceive","title":{"rendered":"Anthropic researchers find that AI models can be trained to deceive"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/anthropic-researchers-find-that-ai-models-can-be-trained-to-deceive2.jpg\"><\/a><\/p>\n<p>Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems \u2014 and terrifyingly, they\u2019re exceptionally good at it.<\/p>\n<p>A recent <a href=\"https:\/\/arxiv.org\/pdf\/2401.05566.pdf\" target=\"_blank\" rel=\"noopener\">study<\/a> co-authored by researchers at Anthropic, the <a href=\"https:\/\/www.cnbc.com\/2023\/12\/21\/openai-rival-anthropic-in-talks-to-raise-750-million-funding-round.html\" target=\"_blank\" rel=\"noopener\">well-funded<\/a> AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code.<\/p>\n<p>The research team hypothesized that if they took an existing text-generating model \u2014 think a model like OpenAI\u2019s GPT-4 or ChatGPT \u2014 and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built \u201ctrigger\u201d phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems \u2014 and terrifyingly, they\u2019re exceptionally good at it. A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer [\u2026]<\/p>\n","protected":false},"author":578,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-180423","post","type-post","status-publish","format-standard","hentry","category-robotics-ai"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/180423","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/578"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=180423"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/180423\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=180423"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=180423"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=180423"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}