Mar 92024 Enhancing Vision-Language Pre-training with Rich Supervisions We propose Strongly Supervised pre-training with ScreenShots (S4) — a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Join the discussion on this paper page.