New LLM optimization technique slashes memory costs up to 75%

Universal transformer memory optimizes prompts using neural attention memory models (NAMMs), simple neural networks that decide whether to “remember” or “forget” each given token stored in the LLM’s memory.

“This new capability allows Transformers to discard unhelpful or redundant details, and focus on the most critical information, something we find to be crucial for tasks requiring long-context reasoning,” the researchers write.

NAMMs are trained separately from the LLM and are combined with the pre-trained model at inference time, which makes them flexible and easy to deploy. However, they need access to the inner activations of the model, which means they can only be applied to open-source models.

Blog