This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021.

  title={Joint Inductive and Transductive Learning for Video Object Segmentation},
  author={Yunyao Mao, Ning Wang, Wengang Zhou, Houqiang Li},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},

JOINT overview figure


Clone this repository

git clone

Install dependencies

Please check the detailed installation instructions.


The whole network is trained with 8 NVIDIA GTX 1080Ti GPUs

conda activate pytracking
cd ltr
python joint joint_stage1  # stage 1
python joint joint_stage2  # stage 2

Note: We initialize the backbone ResNet with pre-trained Mask-RCNN weights as in LWL. These weights can be obtained from here. Before training, you need to download and save these weights in env_settings().pretrained_networks directory.


conda activate pytracking
cd pytracking
python joint joint_davis --dataset_name dv2017_val        # DAVIS 2017 Val
python joint joint_ytvos --dataset_name yt2018_valid_all  # YouTube-VOS 2018 Val
python joint joint_ytvos --dataset_name yt2019_valid_all  # YouTube-VOS 2019 Val

Note: Before evaluation, the pretrained networks (see model zoo) should be downloaded and saved into the directory set by “network_path” in “pytracking/evaluation/”. By default, it is set to pytracking/networks.

Model Zoo


Model YouTube-VOS 2018 (Overall Score) YouTube-VOS 2019 (Overall Score) DAVIS 2017 val (J&F score) Links Raw Results
JOINT_ytvos 83.1 82.8 model results
JOINT_davis 83.5 model results


  • Our JOINT segmentation tracker is implemented based on . We sincerely thank the authors Martin Danelljan and Goutam Bhat for providing such a great framework.
  • We adopt the few-shot learner proposed in LWL as the Induction branch.