
Our group develops foundational models for multimodal video understanding, enabling machines to comprehend, reason about, and interact with complex video, audio, and language data. Moving beyond perception, we ask: what spatiotemporal abstractions are needed for AI to truly grasp complex human behaviors over long horizons? Representative projects include TimeSformer, Video ReCap, LLoVi, BIMBA, VideoTree
In addition, we deploy these models in high-impact domains:
-
Perceptual Agents: Assisting with daily tasks and physical skill learning (e.g., VidAssist, Ego-Exo4D, and ExAct).
-
Sports Analytics: Elevating strategic insights using state-of-the-art multimodal video models (e.g., BASKET).
-
Generative Video: Enabling multimodal applications such as video-to-music generation and audio-visual editing (e.g., VMAs and AvED).
-
Robotics: Translating visual inputs into effective real-world actions (e.g., BOSS, ReBot, and ARCADE).




























