Masked Visual Pre-training for Motor Control

This is a PyTorch implementation of the paper Masked Visual Pre-training for Motor Control. It contains the benchmark suite, pre-trained models, and the training code to reproduce the results from the paper.


Please see for installation instructions.

Pre-trained visual enocoders

We provide pre-trained visual encoders used in the paper. The models are in the same format as mae and timm:

backbone objective data md5 download
ViT-S MAE in-the-wild model
ViT-S MAE ImageNet model
ViT-S Supervised ImageNet model

By default, the code assumes that the pre-trained encoders are placed in /tmp/pretrained directory.

Example training commands

Train FrankaPick from states:

python tools/ task=FrankaPick

Train FrankaPick from pixels:

python tools/ task=FrankaPickPixels

Train on 8 GPUs:

python tools/ num_gpus=8

Test a policy after N iterations:

python tools/ test=True headless=False logdir=/path/to/job resume=N


If you find the code or pre-trained models useful in your research, please use the following BibTeX entry:

  title = {Masked Visual Pre-training for Motor Control},
  author = {Tete Xiao and Ilija Radosavovic and Trevor Darrell and Jitendra Malik},
  journal = {arXiv:2203.06173},
  year = {2022}


We thank NVIDIA IsaacGym and PhysX teams for making the simulator and preview code examples available.


View Github