Video Recognition

The role of video has increased tremendously, with an estimated 3.1 billion people consuming videos on the Internet daily. Our group aims to develop new spatiotemporal models and representations for efficient and effective video data analysis.

Related Publications:

Long Movie Clip Classification with State-Space Video Models

Md Mohaiminul Islam, Gedas Bertasius

ECCV 2022

[arxiv] [code] [bibtex]

TALLFormer: Temporal Action Localization with a Long-memory Transformer

Feng Cheng, Gedas Bertasius

ECCV 2022

[arxiv] [code] [bibtex]

Long-Short Temporal Contrastive Learning of Video Transformers

Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

CVPR 2022

[arxiv] [bibtex]

Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

ICML 2021

[arxiv] [code] [talk] [slides] [blog] [VentureBeat] [SiliconAngle] [bibtex]

Multimodal Learning

Humans understand the world by processing signals from different modalities (e.g., speech, sound, vision, etc). Similarly, we aim to equip computational video models with multimodal processing capabilities to understand visual content, audio, speech, and other modalities.

Related Publications:

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, Lorenzo Torresani

CVPR 2025 (1st Place Winner at CVPR 2025 Ego4D EgoSchema Challenge)

[arxiv] [project page] [code] [model] [demo] [bibtex]

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

CVPR 2024 (Egocentric Vision Distinguished Paper Award)

[arxiv] [project website] [code] [dataset] [bibtex]

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

EMNLP 2024

[arxiv] [code] [bibtex]

VindLU: A Recipe for Effective Video-and-Language Pretraining

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

CVPR 2023

[arxiv] [code] [bibtex]

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius

CVPR 2023

[arxiv] [code] [project page] [bibtex]

Virtual AI Assistants and Coaches

Our group aims to develop video-based AI models that could help people with various daily tasks. Our work in this area includes modeling human behavior from first-person videos, assisting people with procedural action planning, understanding human skills from video, and others.

Related Publications:

ExAct: A Video-Language Benchmark for Expert Action Analysis

Han Yi, Yulu Pan, Feihong He, Xinyu Liu, Benjamin Zhang, Oluwatumininu Oguntola, Gedas Bertasius

NeurIPS Datasets and Benchmarks Track 2025

[arxiv] [project page] [code] [dataset] [leaderboard] [bibtex]

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, Xitong Yang

ECCV 2024 (Oral)

[arxiv] [project page] [bibtex]

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Gedas Bertasius, ... , Michael Wray

CVPR 2024

[arxiv] [project website] [blog] [video] [bibtex]

Learning To Recognize Procedural Activities with Distant Supervision

Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

CVPR 2022

[arxiv] [code] [project page] [bibtex]

CV for Sports

The rapidly growing video broadcasts have made sports one of the most widely watched TV programs. Many sports-related activities are competitive and goal-oriented and require exceptional physical and technical skills as well as strategic thinking. As a former basketball player, I am passionate about applying state-of-the-art computer vision models to sports videos.

Related Publications:

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

Yulu Pan, Ce Zhang, Gedas Bertasius

CVPR 2025

[arxiv] [project page] [code] [data] [bibtex]

Egocentric Basketball Motion Planning from a Single First-Person Image
Gedas Bertasius, Aaron Chan and Jianbo Shi
CVPR 2018
[arxiv] [results] [MIT SSAC Poster] [bibtex]

Am I a Baller? Basketball Performance Assessment from First-Person Videos
Gedas Bertasius, Stella X. Yu, Hyun Soo Park and Jianbo Shi
ICCV 2017
[arxiv] [results] [bibtex]

Video for Robotics

The ultimate measure of an AI agent's intelligence is its ability to translate what it sees into effective actions in the real world. We aim to bridge the gap from passive observation to active control by developing methods for learning embodiment-agnostic, latent action representations from video, including human demonstrations on the web. These latent action representations can then be transferred for real-world control.

Related Publications:

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Yu Fang, Yue Yang, Xinghao Zhu, Kaiyuan Zheng, Gedas Bertasius, Daniel Szafir, Mingyu Ding

IROS 2025

[arxiv] [project page] [code] [bibtex]

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel Szafir

Robotics and Automation Letters (RA-L) 2025

[arxiv] [bibtex]

Augmented Reality Demonstrations for Scalable Robot Imitation Learning

Yue Yang, Bryce Ikeda, Gedas Bertasius, Daniel Szafir

IROS 2024

[arxiv] [bibtex]

Generative Video Modeling

The emergence of powerful generative AI models has fueled various creative applications in the image and video domains. Following this direction, our recent work has explored the design of generative video models for diverse multimodal applications, including video-to-music generation, audio-visual editing, third-to-first person video translation and others.

Related Publications:

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang

arXiv

[arxiv] [project page] [code] [bibtex]

VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang

WACV 2025 (Oral)

[arxiv] [project page] [code] [bibtex]

4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation

Feng Cheng*, Mi Luo*, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, Kristen Grauman

ECCV 2024

[arxiv] [bibtex]

Video Recognition

Multimodal Learning

Virtual AI Assistants and Coaches

CV for Sports

Video for Robotics

Generative Video Modeling

Contact