Toggle light / dark theme

In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then implement a system which can be used to talk with a large language model about an image.

Who is this useful for? Data scientists interested in computer vision, natural language processing, and multimodal modeling.

How advanced is this post? Intermediate. You might struggle if you don’t have some experience in both computer vision and natural language processing.

🚨 Heads up! A new malware, ZenRAT, is posing as Bitwarden password manager installation packages.

Read:

Make sure to download software from trusted sources only.


⚠️ Beware of ZenRAT! This new modular malware strain targets Windows users through trojanized Bitwarden installers.

Of the three great Stoics, Seneca always interested me the least. A playwright and professional philosopher, he seemed unlikely to be acquainted with the more mundane forms of suffering that beset humanity. This made him seem unfit to propound upon a philosophy concerned with right conduct under challenging circumstances.

Marcus Aurelius, ruler of an enormous empire, spent most of his reign embroiled in wars he had no desire to fight. Epictetus, a Greek who endured the hardships of slavery, also embodied the Stoic ideal. Cries of “ad hominem!” aside, it is hard to dispute that our experiences shape our outlooks on living. Biographical criticisms can be flimsy, but the central argument in De Brevitate Vitae — an otherwise inspirational classic — was exceptionally naive for the 1st century.

Although Seneca understood intrigue and exile firsthand, he was not privy to the time-sapping vicissitudes of householding or holding down a job.

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition).

However, natural intelligence is not limited to just a single modality. Humans can read and write text. We can see images and watch videos. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world.

OpenAI noted in their GPT-4V system card that “incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development.”

Consumers discard or possess disused electronic goods containing raw materials critical for the green energy transition and worth almost $10 billion every year, the United Nations said on Thursday.

Toys, cables, , tools, electric toothbrushes, shavers, headphones and other domestic gadgets contain metals like lithium, gold, silver and copper.

Demand is expected to soar for these materials due to their crucial role in rapidly growing green industries such as electric vehicle battery production.