{"id":219578,"date":"2025-08-07T04:33:04","date_gmt":"2025-08-07T09:33:04","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2025\/08\/anthropic-says-theyve-found-a-new-way-to-stop-ai-from-turning-evil"},"modified":"2025-08-07T04:33:04","modified_gmt":"2025-08-07T09:33:04","slug":"anthropic-says-theyve-found-a-new-way-to-stop-ai-from-turning-evil","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2025\/08\/anthropic-says-theyve-found-a-new-way-to-stop-ai-from-turning-evil","title":{"rendered":"Anthropic says they\u2019ve found a new way to stop AI from turning evil"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/anthropic-says-theyve-found-a-new-way-to-stop-ai-from-turning-evil.jpg\"><\/a><\/p>\n<p>AI is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its \u201cpersonality traits\u201d arise and how to control them. Large learning models (LLMs) use chatbots or \u201cassistants\u201d to interface with users, and some of these assistants have exhibited troubling behaviors recently, like praising evil dictators, using blackmail or displaying sycophantic behaviors with users. Considering how much these LLMs have already been integrated into our society, it is no surprise that researchers are trying to find ways to weed out undesirable behaviors.<\/p>\n<p>Anthropic, the AI company and creator of the LLM Claude, recently released a <a href=\"https:\/\/arxiv.org\/abs\/2507.21509\" target=\"_blank\">paper<\/a> on the <i>arXiv<\/i> preprint server discussing their new approach to reining in these undesirable traits in LLMs. In their method, they identify patterns of activity within an AI model\u2019s neural network\u2014referred to as \u201cpersona vectors\u201d\u2014that control its character traits. Anthropic says these persona vectors are somewhat analogous to parts of the brain that \u201clight up\u201d when a person experiences a certain feeling or does a particular activity.<\/p>\n<p>Anthropic\u2019s researchers used two open-source LLMs, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, to test whether they could remove or manipulate these persona vectors to control the behaviors of the LLMs. Their study focuses on three traits: evil, sycophancy and hallucination (the LLM\u2019s propensity to make up information). Traits must be given a name and an explicit description for the vectors to be properly identified.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its \u201cpersonality traits\u201d arise and how to control them. Large learning models (LLMs) use chatbots or \u201cassistants\u201d to interface with users, and some of these assistants have exhibited troubling [\u2026]<\/p>\n","protected":false},"author":427,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-219578","post","type-post","status-publish","format-standard","hentry","category-robotics-ai"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/219578","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/427"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=219578"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/219578\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=219578"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=219578"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=219578"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}