In the ongoing effort to scale AI systems without incurring prohibitively high training and compute costs, sparse mixture-of-expert models (MoE) have shown their potential for achieving impressive neural network pretraining speedups by dynamically selecting only the related parameters for each input. This enables such networks to vastly expand their parameters while keeping their FLOPs per token (compute) roughly constant. Advancing MoE models to state-of-the-art performance has however been hindered by training instabilities and uncertain quality during fine-tuning.
To address these issues, a research team from Google AI and Google Brain has published a set of guidelines for designing more practical and reliable sparse expert models. The team tested their recommendations by pretraining a 269B sparse model, which it says is the first to achieve state-of-the-art results on natural language processing (NLP) benchmarks.
The team summarizes their main contributions as: