Large-scale language-based foundation models such as BERT, GPT-3 and CLIP have exhibited impressive capabilities ranging from zero-shot image classification to high-level planning. In most cases, these large language models, visual-language models and audio-language models remain domain-specific and rely highly on the distribution of their training data. The models thus obtain different although complementary common-sense knowledge within specific domains. But what if such models could effectively communicate with one another?
In the new paper Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, Google researchers argue that the diversity of different foundation models is symbiotic and that it is possible to build a framework that uses structured Socratic dialogue between pre-existing foundation models to formulate new multimodal tasks as a guided exchange between the models without additional finetuning.
This work aims at building general language-based foundation models that embrace the diversity of pre-existing language-based foundation models by levering structured Socratic dialogue, and offers insights into the applicability of the proposed Socratic Models on challenging perceptual tasks.