Google, NYU & Maryland U’s Token-Dropping Approach Reduces BERT Pretraining Time by 25%

The pretraining of BERT-type large language models — which can scale up to billions of parameters — is crucial for obtaining state-of-the-art performance on many natural language processing (NLP) tasks. This pretraining process however is expensive, and has become a bottleneck hindering the industrial application of such large language models.

In the new paper Token Dropping for Efficient BERT Pretraining, a research team from Google, New York University, and the University of Maryland proposes a simple but effective “token dropping” technique that significantly reduces the pretraining cost of transformer models such as BERT, without degrading performance on downstream fine-tuning tasks.

The team summarizes their main contributions as:

Blog