In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then implement a system which can be used to talk with a large language model about an image.
Who is this useful for? Data scientists interested in computer vision, natural language processing, and multimodal modeling.
How advanced is this post? Intermediate. You might struggle if you don’t have some experience in both computer vision and natural language processing.