top of page

Our group develops foundational models for multimodal video understanding, enabling machines to comprehend, reason about, and interact with complex video, audio, and language data. Moving beyond perception, we ask: what spatiotemporal abstractions are needed for AI to truly grasp complex human behaviors over long horizons? Representative projects include TimeSformer, Video ReCap, LLoVi, BIMBA, VideoTree

 

In addition, we deploy these models in high-impact domains:

  • Perceptual Agents: Assisting with daily tasks and physical skill learning (e.g., VidAssist, Ego-Exo4D, and ExAct).

  • Sports Analytics: Elevating strategic insights using state-of-the-art multimodal video models (e.g., BASKET).

  • Generative Video: Enabling multimodal applications such as video-to-music generation and audio-visual editing (e.g., VMAs and AvED).

  • Robotics: Translating visual inputs into effective real-world actions (e.g., BOSS, ReBot, and ARCADE).

Graduate Students

Undergraduate Students

Alumni

Group Photos

Contact

Prospective Graduate Students: I am recruiting motivated students in computer vision. Please email me a list of your prior publications and your CV.

Undergraduates at UNC: If you are interested in computer vision, especially its applications to sports, email me your CV and transcript with your GPA.

©2024 by Gedas Bertasius

bottom of page