Microsoft Improves Transformer Stability to Successfully Scale Extremely Deep Models to 1000 Layers

A Microsoft Research team has introduced a “simple yet effective” method that dramatically improves stability in transformer models with just a few lines of code change.

Large-scale transformers have achieved state-of-the-art performance on a wide range of natural language processing (NLP) tasks, and in recent years have also demonstrated their impressive few-shot and zero-shot learning capabilities, making them a popular architectural choice for machine learning researchers. However, despite soaring parameter counts that now reach billions and even trillions, the layer depth of transformers remains restricted by problems with training instability.

In their new paper DeepNet: Scaling Transformers to 1,000 Layers, the Microsoft team proposes DeepNorm, a novel normalization function that improves the stability of transformers to enable scaling that is an order of magnitude deeper (more than 1,000 layers) than previous deep transformers.

Blog