StreamingLLM is an innovative framework that allows large language models to handle text of infinite length without the need for finetuning. This technique preserves attention sinks to maintain a near-normal attention score distribution. When the sequence of the conversation with the LLM surpasses the model’s context length, retains the KV cache for the attention sink tokens—four initial tokens are sufficient—and discards subsequent tokens to make room for the sliding window tokens. This approach enables the model to extend its context and stabilize its performance without having to recompute the entire KV values.
“The introduction of four initial tokens, as attention sinks, suffices to restore the LLM’s performance,” the researchers write. “In contrast, adding just one or two doesn’t achieve full recovery. We believe this pattern emerges because these models didn’t include a consistent starting token across all input samples during pre-training.”
Under the framework, the KV cache comprises the attention sinks and the rolling KV cache that retains the most recent tokens vital for language modeling. The researchers emphasize the versatility of, stating, design is versatile and can be seamlessly incorporated into any autoregressive language model that employs relative positional encoding.”
Comments are closed.