top of page

Our group develops foundational models for multimodal video understanding, enabling machines to comprehend, reason about, and interact with complex video, audio, and language data. Moving beyond perception, we ask: what spatiotemporal abstractions are needed for AI to truly grasp complex human behaviors over long horizons? Representative projects include TimeSformer, Video ReCap, LLoVi, BIMBA, VideoTree

 

Beyond core model design, we deploy these models in several high-impact domains:

  • Perceptual Assistants & Coaches: Assisting with daily tasks and physical skill coaching (e.g., VidAssist, Ego-Exo4D, and ExAct).

  • AI and Sports: Elevating strategic insights using state-of-the-art multimodal video models (e.g., SVI-BenchBASKET).

  • Generative Video Applications: Enabling applications such as video-to-music generation, audio-visual editing, and temporally-consistent video generation (e.g., V2M-ZeroVMAs, AvED, and TeDiO).

  • Robotics: Translating visual inputs into effective real-world actions (e.g., WatchActBOSS, ReBot, and ARCADE).

Graduate Students

Undergraduate Students

Alumni

Group Photos

Contact

Prospective Graduate Students: I am recruiting motivated students in computer vision. Please email me a list of your prior publications and your CV.

Undergraduates at UNC: If you are interested in computer vision, especially its applications to sports, email me your CV and transcript with your GPA.

©2024 by Gedas Bertasius

bottom of page