{"id":172641,"date":"2023-09-23T14:29:44","date_gmt":"2023-09-23T19:29:44","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2023\/09\/distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes"},"modified":"2023-09-23T14:29:44","modified_gmt":"2023-09-23T19:29:44","slug":"distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2023\/09\/distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes","title":{"rendered":"Distilling step-by-step: Outperforming larger language models with less training data and smaller model sizes"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes2.jpg\"><\/a><\/p>\n<p>Large language models (LLMs) have enabled a new data-efficient learning paradigm wherein they can be used to solve unseen new tasks via <a href=\"https:\/\/arxiv.org\/abs\/2005.14165\">zero-shot or few-shot prompting<\/a>. However, LLMs are challenging to deploy for real-world applications due to their sheer size. For instance, serving a single 175 billion LLM requires at least 350GB of GPU memory using <a href=\"https:\/\/arxiv.org\/abs\/2201.12023\">specialized infrastructure<\/a>, not to mention that today\u2019s state-of-the-art LLMs are composed of over <a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">500 billion parameters<\/a>. Such computational requirements are inaccessible for many research teams, especially for applications that require low latency performance.<\/p>\n<p>To circumvent these deployment challenges, practitioners often choose to deploy smaller specialized models instead. These smaller models are trained using one of two common paradigms: <a href=\"https:\/\/arxiv.org\/abs\/1801.06146\">fine-tuning<\/a> or <a href=\"https:\/\/arxiv.org\/abs\/1503.02531\">distillation<\/a>. Fine-tuning updates a pre-trained smaller model (e.g., <a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<\/a> or <a href=\"https:\/\/arxiv.org\/abs\/1910.10683\">T5<\/a>) using downstream manually-annotated data. Distillation trains the same smaller models with labels generated by a larger LLM. Unfortunately, to achieve comparable performance to LLMs, fine-tuning methods require human-generated labels, which are expensive and tedious to obtain, while distillation requires large amounts of unlabeled data, which can also be hard to collect.<\/p>\n<p>In \u201c<a href=\"https:\/\/arxiv.org\/abs\/2305.02301\">Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes<\/a>\u201d, presented at <a href=\"https:\/\/2023.aclweb.org\/\">ACL2023<\/a>, we set out to tackle this trade-off between model size and training data collection cost. We introduce distilling step-by-step, a new simple mechanism that allows us to train smaller task-specific models with much less training data than required by standard fine-tuning or distillation approaches that outperform few-shot prompted LLMs\u2019 performance. We demonstrate that the distilling step-by-step mechanism enables a 770M parameter T5 model to outperform the few-shot prompted 540B PaLM model using only 80% of examples in a benchmark dataset, which demonstrates a more than 700x model size reduction with much less training data required by standard approaches.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Large language models (LLMs) have enabled a new data-efficient learning paradigm wherein they can be used to solve unseen new tasks via zero-shot or few-shot prompting. However, LLMs are challenging to deploy for real-world applications due to their sheer size. For instance, serving a single 175 billion LLM requires at least 350GB of GPU memory [\u2026]<\/p>\n","protected":false},"author":359,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1523,1491],"tags":[],"class_list":["post-172641","post","type-post","status-publish","format-standard","hentry","category-computing","category-transportation"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/172641","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/359"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=172641"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/172641\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=172641"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=172641"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=172641"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}