Code for the paper: All in One: Exploring Unified Video-Language Pre-training Arxiv



1. PytorchLighting

In this work, we use PytorchLighting for distributed training with mixed precision.
Install pytorch and PytorchLighting first.

conda create -n allinone python=3.7
source activate allinone
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
cd [Path_To_This_Code]
pip install -r requirements.txt

2. On-the-fly decode

To speed up the pre-training, we adopt on-the-fly decode for fast IO.
Install ffmpeg and pytorchvideo (for data augmentation) as below.

sudo conda install -y ffmpeg
pip install ffmpeg-python
pip install pytorchvideo

Please install the required packages if not included in the requirements.

Download Pretrained Weights

We provide three pretrained weights in google driver.

Model Parameter Pretrained Weight Trained Log Hparams
All-in-one-Ti 12M Google Driver Google Driver Google Driver
All-in-one-S 33M Google Driver Google Driver Google Driver
All-in-one-B 110M Google Driver Google Driver Google Driver

After downloaded these pretrained weights, move them into pretrained dir.

mkdir pretrained
cp *.ckpt pretrained/

Dataset Preparation




Evaluation on Downstream Tasks


By unified design and sparse sampling, AllInOne show much small flops.


2022.3.14 The first version of AllInOne is released. The is in progress.


If you find our work helps, please cite our paper.

  title={All in One: Exploring Unified Video-Language Pre-training},
  author={Wang, Alex Jinpeng and Ge, Yixiao and Yan, Rui and Ge Yuying and Lin, Xudong and Cai, Guanyu  and Wu, Jianping and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2203.07303},


This work is mainly based on ViLT, Frozen and Merlot.


View Github