{"id":138785,"date":"2022-04-29T01:03:14","date_gmt":"2022-04-29T06:03:14","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2022\/04\/tackling-multiple-tasks-with-a-single-visual-language-model"},"modified":"2022-04-29T01:03:14","modified_gmt":"2022-04-29T06:03:14","slug":"tackling-multiple-tasks-with-a-single-visual-language-model","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2022\/04\/tackling-multiple-tasks-with-a-single-visual-language-model","title":{"rendered":"Tackling multiple tasks with a single visual language model"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/tackling-multiple-tasks-with-a-single-visual-language-model2.jpg\"><\/a><\/p>\n<p>One key aspect of intelligence is the ability to quickly learn how to perform a new task when given a brief instruction. For instance, a child may recognise real animals at the zoo after seeing a few pictures of the animals in a book, despite any differences between the two. But for a typical visual model to learn a new task, it must be trained on tens of thousands of examples specifically labelled for that task. If the goal is to count and identify animals in an image, as in \u201cthree zebras\u201d, one would have to collect thousands of images and annotate each image with their quantity and species. This process is inefficient, expensive, and resource-intensive, requiring large amounts of annotated data and the need to train a new model each time it\u2019s confronted with a new task. As part of DeepMind\u2019s mission to solve intelligence, we\u2019ve explored whether an alternative model could make this process easier and more efficient, given only limited task-specific information.<\/p>\n<p>Today, in the preprint of our <a href=\"https:\/\/storage.googleapis.com\/deepmind-media\/DeepMind.com\/Blog\/tackling-multiple-tasks-with-a-single-visual-language-model\/flamingo.pdf\" target=\"_blank\">paper<\/a>, we introduce <em>Flamingo, <\/em>a single visual language model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended multimodal tasks. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples (in a \u201cfew shots\u201d), without any additional training required. Flamingo\u2019s simple interface makes this possible, taking as input a prompt consisting of interleaved images, videos, and text and then output associated language.<\/p>\n<p>Similar to the behaviour of <a href=\"https:\/\/www.deepmind.com\/blog\/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval\">large language models<\/a> (LLMs), which can address a language task by processing examples of the task in their text prompt, Flamingo\u2019s visual and text interface can steer the model towards solving a multimodal task. Given a few example pairs of visual inputs and expected text responses composed in Flamingo\u2019s prompt, the model can be asked a question with a new image or video, and then generate an answer.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One key aspect of intelligence is the ability to quickly learn how to perform a new task when given a brief instruction. For instance, a child may recognise real animals at the zoo after seeing a few pictures of the animals in a book, despite any differences between the two. But for a typical visual [\u2026]<\/p>\n","protected":false},"author":556,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-138785","post","type-post","status-publish","format-standard","hentry","category-robotics-ai"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/138785","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/556"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=138785"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/138785\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=138785"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=138785"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=138785"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}