
Our group develops foundational models for multimodal video understanding, enabling machines to comprehend, reason about, and interact with complex video, audio, and language data. Moving beyond perception, we ask: what spatiotemporal abstractions are needed for AI to truly grasp complex human behaviors over long horizons? Representative projects include TimeSformer, Video ReCap, LLoVi, BIMBA, VideoTree
Beyond core model design, we deploy these models in several high-impact domains:
-
Perceptual Assistants & Coaches: Assisting with daily tasks and physical skill coaching (e.g., VidAssist, Ego-Exo4D, and ExAct).
-
AI and Sports: Elevating strategic insights using state-of-the-art multimodal video models (e.g., SVI-Bench, BASKET).
-
Generative Video Applications: Enabling applications such as video-to-music generation, audio-visual editing, and temporally-consistent video generation (e.g., V2M-Zero, VMAs, AvED, and TeDiO).
-
Robotics: Translating visual inputs into effective real-world actions (e.g., WatchAct, BOSS, ReBot, and ARCADE).
Group Photos





























