GreenMIM

This is the official PyTorch implementation of the paper Green Hierarchical Vision Transformer for Masked Image Modeling.

Group Attention Scheme.

Method Overview.

Citation

If you find our work interesting or use our code/models, please cite:

@article{huang2022green,
  title={Green Hierarchical Vision Transformer for Masked Image Modeling},
  author={Huang, Lang and You, Shan and Zheng, Mingkai and Wang, Fei and Qian, Chen and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2205.13515},
  year={2022}
}

Catalogs

Pre-trained checkpoints
Pre-training code
Fine-tuning code

Pre-trained Models

	Swin-Base (Window 7×7)	Swin-Base (Window 14×14)	Swin-Large (Window 14×14)
pre-trained checkpoint	Download	Download	Download

Pre-training

The pre-training scripts are given in the scripts/ folder. The scripts with names start with ‘run*’ are for non-slurm users while the others are for slurm users.

For Non-Slurm Users

To train a Swin-B with on a single node with 8 GPUs.

PORT=23456 NPROC=8 bash scripts/run_mae_swin_base.sh

For Slurm Users

To train a Swin-B with on a single node with 8 GPUs.

bash scripts/srun_mae_swin_base.sh [Partition] [NUM_GPUS]

Instructions for non-slurm users will be available soon.

Fine-tuning on ImageNet-1K

Model	#Params	Pre-train Resolution	Fine-tune Resolution	Config	Acc@1 (%)
Swin-B (Window 7×7)	88M	224×224	224×224	Config	83.7
Swin-L (Window 14×14)	197M	224×224	224×224	Config	85.1

Currently, we directly use the code of SimMIM for fine-tuning, please follow their instructions to use the configs. NOTE that, due to the limited computing resource, we use a batch size of 1024 (128 x 8) for Swin-B and a batch size of 768 (48 x 16) for fine-tuning.