Say a person takes their French Bulldog, Bowser, to the dog park. Identifying Bowser as he plays among the other canines is easy for the dog owner to do while onsite.
But if someone wants to use a generative AI model like GPT-5 to monitor their pet while they are at work, the model could fail at this basic task. Vision-language models like GPT-5 often excel at recognizing general objects, like a dog, but they perform poorly at locating personalized objects, like Bowser the French Bulldog.
To address this shortcoming, researchers from MIT and the MIT-IBM Watson AI Lab have introduced a new training method that teaches vision-language models to localize personalized objects in a scene.