{"id":178053,"date":"2023-12-11T15:37:29","date_gmt":"2023-12-11T21:37:29","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2023\/12\/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code"},"modified":"2023-12-11T15:37:29","modified_gmt":"2023-12-11T21:37:29","slug":"introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2023\/12\/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code","title":{"rendered":"Introducing gigaGPT: GPT-3 sized models in 565 lines of code"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/introducing-gigagpt-gpt-3-sized-models-in-565-lines-of-code.jpg\"><\/a><\/p>\n<p>Cerebras introduces gigaGPT: GPT-3 sized models in 565 lines of code.<\/p>\n<hr>\n<p>GigaGPT is Cerebras\u2019 implementation of Andrei Karpathy\u2019s nanoGPT \u2013 the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing additional code or relying on third party frameworks \u2013 the entire <a href=\"https:\/\/github.com\/Cerebras\/gigaGPT\" target=\"_blank\" rel=\"noopener\">repo<\/a> is just 565 lines of code. Instead gigaGPT utilizes the large memory and compute capacity of Cerebras hardware to enable large scale training on vanilla torch.nn code. With no modifications, gigaGPT supports long context lengths and works with a variety of optimizers.<\/p>\n<p><b>Why gigaGPT<\/b><\/p>\n<p>While the transformer architecture is simple, training a large transformer on a large number of GPUs is hard. Beyond a few billion parameters, vanilla GPT models run out of memory on even the latest GPUs. Training larger models requires breaking up models into smaller pieces, distributing them to multiple GPUs, coordinating the workload among the workers, and assembling the results. This is typically done via LLM scaling frameworks such as Megatron, DeepSpeed, NeoX, Fairscale, and Mosaic Foundry. Though powerful, these frameworks introduce significant complexity.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cerebras introduces gigaGPT: GPT-3 sized models in 565 lines of code. GigaGPT is Cerebras\u2019 implementation of Andrei Karpathy\u2019s nanoGPT \u2013 the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing [\u2026]<\/p>\n","protected":false},"author":709,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1523,1491],"tags":[],"class_list":["post-178053","post","type-post","status-publish","format-standard","hentry","category-computing","category-transportation"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/178053","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/709"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=178053"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/178053\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=178053"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=178053"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=178053"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}