The original 2017 transformer model was designed for natural language processing (NLP), where it achieved SOTA results. Its performance intrigued machine learning researchers, who have since successfully adapted the attention-based architecture to perception tasks in other modalities, such as the classification of images, video and audio. While transformers have shown their power and potential in these areas, achieving SOTA performance requires training a separate model for each task. Producing a single transformer model capable of processing multiple modalities and datasets and sharing its learnable parameters has thus emerged as an attractive research direction.
To this end, a team from Google Research, University of Cambridge and Alan Turing Institute has proposed PolyViT; a single transformer architecture co-trained on image, audio and video that is parameter-efficient and learns representations that generalize across multiple domains.
The PolyViT design is motivated by the idea that human perception is inherently multimodal and previous studies that have demonstrated transformers’ ability to operate on any modality that can be tokenized. PolyViT shares a single transformer encoder across different tasks and modalities, enabling up to a linear reduction in parameters with the number of tasks.