{"id":176924,"date":"2023-11-27T23:23:35","date_gmt":"2023-11-28T05:23:35","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2023\/11\/streamingllm-gives-language-models-unlimited-context"},"modified":"2023-11-27T23:23:35","modified_gmt":"2023-11-28T05:23:35","slug":"streamingllm-gives-language-models-unlimited-context","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2023\/11\/streamingllm-gives-language-models-unlimited-context","title":{"rendered":"StreamingLLM gives language models unlimited context"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/streamingllm-gives-language-models-unlimited-context2.jpg\"><\/a><\/p>\n<p>StreamingLLM is an innovative framework that allows large language models to handle text of infinite length without the need for finetuning. This technique preserves attention sinks to maintain a near-normal attention score distribution. When the sequence of the conversation with the LLM surpasses the model\u2019s context length, retains the KV cache for the attention sink tokens\u2014four initial tokens are sufficient\u2014and discards subsequent tokens to make room for the sliding window tokens. This approach enables the model to extend its context and stabilize its performance without having to recompute the entire KV values.<\/p>\n<p>\u201cThe introduction of four initial tokens, as attention sinks, suffices to restore the LLM\u2019s performance,\u201d the researchers write. \u201cIn contrast, adding just one or two doesn\u2019t achieve full recovery. We believe this pattern emerges because these models didn\u2019t include a consistent starting token across all input samples during pre-training.\u201d<\/p>\n<p>Under the framework, the KV cache comprises the attention sinks and the rolling KV cache that retains the most recent tokens vital for language modeling. The researchers emphasize the versatility of, stating, design is versatile and can be seamlessly incorporated into any autoregressive language model that employs relative positional encoding.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"<p>StreamingLLM is an innovative framework that allows large language models to handle text of infinite length without the need for finetuning. This technique preserves attention sinks to maintain a near-normal attention score distribution. When the sequence of the conversation with the LLM surpasses the model\u2019s context length, retains the KV cache for the attention sink [\u2026]<\/p>\n","protected":false},"author":359,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1522],"tags":[],"class_list":["post-176924","post","type-post","status-publish","format-standard","hentry","category-innovation"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/176924","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/359"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=176924"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/176924\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=176924"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=176924"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=176924"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}