MIT and Google researchers have made AI that can link sound, sight, and text to understand the world

If we ever want future robots to do our bidding, they’ll have to understand the world around them in a complete way—if a robot hears a barking noise, what’s making it? What does a dog look like, and what do dogs need?

AI research has typically treated the ability to recognize images, identify noises, and understand text as three different problems, and built algorithms suited to each individual task. Imagine if you could only use one sense at a time, and couldn’t match anything you heard to anything you saw. That’s AI today, and part of the reason why we’re so far from creating an algorithm that can learn like a human. But two new papers from MIT and Google explain first steps for making AI see, hear, and read in a holistic way—an approach that could upend how we teach our machines about the world.

“It doesn’t matter if you see a car or hear an engine, you instantly recognize the same concept. The information in our brain is aligned naturally,” says Yusuf Aytar, a post-doctoral AI research at MIT who co-authored the paper.

Blog