Google Trains Two Billion Parameter AI Vision Model

Researchers at Google Brain announced a deep-learning computer vision (CV) model containing two billion parameters. The model was trained on three billion images and achieved 90.45% top-1 accuracy on ImageNet, setting a new state-of-the-art record.

The team described the model and experiments in a paper published on arXiv. The model, dubbed ViT-G/14, is based on Google’s recent work on Vision Transformers (ViT). ViT-G/14 outperformed previous state-of-the-art solutions on several benchmarks, including ImageNet, ImageNet-v2, and VTAB-1k. On the few-shot image recognition task, the accuracy improvement was more than five percentage-points. The researchers also trained several smaller versions of the model to investigate a scaling law for the architecture, noting that the performance follows a power-law function, similar to Transformer models used for natural language processing (NLP) tasks.

First described by Google researchers in 2017, the Transformer architecture has become the leading design for NLP deep-learning models, with OpenAI’s GPT-3 being one of the most famous. Last year, OpenAI published a paper describing scaling laws for these models. By training many similar models of different sizes and varying the amount of training data and computing power, OpenAI determined a power-law function for estimating a model’s accuracy. In addition, OpenAI found that not only do large models perform better, they are also more compute-efficient.

Blog