Masked Autoencoders in PyTorch

A simple, unofficial implementation of MAE (Masked Autoencoders are Scalable Vision Learners) using  pytorch-lightning.

Currently implements training on CUB and StanfordCars, but is easily extensible to any other image dataset.


# Clone the repository
git clone
cd mae-pytorch

# Install required libraries (inside a virtual environment preferably)
pip install -r requirements.txt

# Set up .env for path to data
echo "DATADIR=/path/to/data" > .env


MAE training

Training options are provided through configuration files, handled by LightningCLI. See configs/ for examples.

Train an MAE model on the CUB dataset:

python fit --config=configs/mae.yaml --config=configs/data/cub_mae.yaml

Using multiple GPUs:

python fit --config=configs/mae.yaml --config=configs/data/cub_mae.yaml --config=configs/multigpu.yaml


Not yet implemented.


The default model uses ViT-Base for the encoder, and a small ViT (depth=4, width=192) for the decoder. This is smaller than the model used in the paper.


  • Configuration and training is handled completely by pytorch-lightning.
  • The MAE model uses the VisionTransformer from timm.
  • Interface to FGVC datasets through fgvcdata.
  • Configurable environment variables through python-dotenv.


Image reconstructions of CUB validation set images after training with the following command:

python fit --config=configs/mae.yaml --config=configs/data/cub_mae.yaml --config=configs/multigpu.yaml
Bird Reconstructions
