Introducing Ego-Exo4D: A foundational dataset for research on video learning and multimodal perception

📸 Watch this video on Facebook https://www.facebook.com/share/v/NNeZinMSuGPtQDXL/?mibextid=i5uVoL

Working together as a consortium, FAIR or university partners captured these perspectives with the help of more than 800 skilled participants in the United States, Japan, Colombia, Singapore, India, and Canada. In December, the consortium will open source the data (including more than 1,400 hours of video) and annotations for novel benchmark tasks. Additional details about the datasets can be found in our technical paper. Next year, we plan to host a first public benchmark challenge and release baseline models for ego-exo understanding. Each university partner followed their own formal review processes to establish the standards for collection, management, informed consent, and a license agreement prescribing proper use. Each member also followed the Project Aria Community Research Guidelines. With this release, we aim to provide the tools the broader research community needs to explore ego-exo video, multimodal activity recognition, and beyond.

How Ego-Exo4D works.

Ego-Exo4D focuses on skilled human activities, such as playing sports, music, cooking, dancing, and bike repair. Advances in AI understanding of human skill in video could facilitate many applications. For example, in future augmented reality (AR) systems, a person wearing smart glasses could quickly pick up new skills with a virtual AI coach that guides them through a how-to video; in robot learning, a robot watching people in its environment could acquire new dexterous manipulation skills with less physical experience; in social networks, new communities could form based on how people share their expertise and complementary skills in video.

Blog