Sep 19, 2023
Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)
Posted by Dan Breeden in category: computing
Retention is an alternative to Attention in Transformers that can both be written in a parallel and in a recurrent fashion. This means the architecture achieves training parallelism while maintaining low-cost inference. Experiments in the paper look very promising.
OUTLINE:
0:00 — Intro.
2:40 — The impossible triangle.
6:55 — Parallel vs sequential.
15:35 — Retention mechanism.
21:00 — Chunkwise and multi-scale retention.
24:10 — Comparison to other architectures.
26:30 — Experimental evaluation.