Towards Long-Form Video Understanding


  Author    = {Chao-Yuan Wu and Philipp Kr\"{a}henb\"{u}hl},
  Title     = {{Towards Long-Form Video Understanding}},
  Booktitle = {{CVPR}},
  Year      = {2021}}


This repo implements Object Transformers for long-form video understanding.

Getting Started

Please organize data/ as follows

|_ ava
|_ features
|_ instance_meta
|_ lvu_1.0

ava, features, and instance_meta could be found at this Google Drive folder. lvu_1.0 can be found at here.

Please also download pre-trained weights at this Google Drive folder and put them in pretrained_models/.


python3 -u

This pretrains on a small demo dataset data/instance_meta/instance_meta_pretrain_demo.pkl as an example. Please follow its file format if you'd like to pretrain on a larger dataset (e.g., latest full version of MovieClips).

Training and evaluating on AVA v2.2

python3 -u

This should achieve 31.0 mAP.

Training and evaluating on LVU tasks

python3 -u [1-9]

The argument selects a task to run on. Please see for details.


This implementation largely borrows from Huggingface Transformers. Please consider citing it if you use this repo.