{"id":209046,"date":"2025-03-18T13:28:28","date_gmt":"2025-03-18T18:28:28","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2025\/03\/bytedance-research-releases-dapo-a-fully-open-sourced-llm-reinforcement-learning-system-at-scale"},"modified":"2025-03-18T13:28:28","modified_gmt":"2025-03-18T18:28:28","slug":"bytedance-research-releases-dapo-a-fully-open-sourced-llm-reinforcement-learning-system-at-scale","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2025\/03\/bytedance-research-releases-dapo-a-fully-open-sourced-llm-reinforcement-learning-system-at-scale","title":{"rendered":"ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/bytedance-research-releases-dapo-a-fully-open-sourced-llm-reinforcement-learning-system-at-scale2.jpg\"><\/a><\/p>\n<p>Reinforcement learning (RL) has become central to advancing Large Language Models (LLMs), empowering them with improved reasoning capabilities necessary for complex tasks. However, the research community faces considerable challenges in reproducing state-of-the-art RL techniques due to incomplete disclosure of key training details by major industry players. This opacity has limited the progress of broader scientific efforts and collaborative research.<\/p>\n<p>Researchers from ByteDance, Tsinghua University, and the University of Hong Kong recently introduced DAPO (Dynamic Sampling Policy Optimization), an open-source large-scale reinforcement learning system designed for enhancing the reasoning abilities of Large Language Models. The DAPO system seeks to bridge the gap in reproducibility by openly sharing all algorithmic details, training procedures, and datasets. Built upon the verl framework, DAPO includes training codes and a thoroughly prepared dataset called DAPO-Math-17K, specifically designed for mathematical reasoning tasks.<\/p>\n<p>DAPO\u2019s technical foundation includes four core innovations aimed at resolving key challenges in reinforcement learning. The first, \u201cClip-Higher,\u201d addresses the issue of entropy collapse, a situation where models prematurely settle into limited exploration patterns. By carefully managing the clipping ratio in policy updates, this technique encourages greater diversity in model outputs. \u201cDynamic Sampling\u201d counters inefficiencies in training by dynamically filtering samples based on their usefulness, thus ensuring a more consistent gradient signal. The \u201cToken-level Policy Gradient Loss\u201d offers a refined loss calculation method, emphasizing token-level rather than sample-level adjustments to better accommodate varying lengths of reasoning sequences. Lastly, \u201cOverlong Reward Shaping\u201d introduces a controlled penalty for excessively long responses, gently guiding models toward concise and efficient reasoning.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement learning (RL) has become central to advancing Large Language Models (LLMs), empowering them with improved reasoning capabilities necessary for complex tasks. However, the research community faces considerable challenges in reproducing state-of-the-art RL techniques due to incomplete disclosure of key training details by major industry players. This opacity has limited the progress of broader scientific [\u2026]<\/p>\n","protected":false},"author":732,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[41,2229,31,6],"tags":[],"class_list":["post-209046","post","type-post","status-publish","format-standard","hentry","category-information-science","category-mathematics","category-policy","category-robotics-ai"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/209046","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/732"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=209046"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/209046\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=209046"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=209046"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=209046"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}