While text-based AI models have been found coordinating amongst themselves and developing a language of their own, communication between image-based models remained an unexplored territory, until now. A group of researchers set out to find how well Google Deepmind’s Flamingo and OpenAI’s Dall-E understand each other — their synergy is impressive.
Despite the closeness of the image captioning and text-to-image generation tasks, they are often studied in isolation from each other, i.e the information exchange between these models remains a question someone never looked for an answer to. Researchers from LMU Munich, Siemens AG, and the University of Oxford wrote a paper titled, ‘Do Flamingo and DALL-E Understand Each Other?‘investigating the communication between image captioning and text-to-image models.
The team proposes a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesise a new image. They argue that these models understand each other if the generated image is similar to the given image. Specifically, they studied the relationship between the quality of the image reconstruction and that of the text generation. As a result, they found that a better caption is the one that leads to better visuals and vice-versa.









