Pay Less Attention in Vision Transformers With Python

Less is More: Pay Less Attention in Vision Transformers

Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

By Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu and Jianfei Cai.

In our paper, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. LIT uses pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner.

If you use this code for a paper please cite:

@article{pan2021less,
  title={Less is More: Pay Less Attention in Vision Transformers},
  author={Pan, Zizheng and Zhuang, Bohan and He, Haoyu and Liu, Jing and Cai, Jianfei},
  journal={arXiv preprint arXiv:2105.14217},
  year={2021}
}

Usage

First, clone this repository.

git clone https://github.com/MonashAI/LIT

Next, create a conda virtual environment.

# Make sure you have a NVIDIA GPU.
cd LIT/classification
bash setup_env.sh [conda_install_path] [env_name]

# For example
bash setup_env.sh /home/anaconda3 lit

Note: We use PyTorch 1.7.1 with CUDA 10.1 for all experiments. The setup_env.sh has illustrated all dependencies we used in our experiments. You may want to edit this file to install a different version of PyTorch or any other packages.

Image Classification on ImageNet

We provide baseline LIT models pretrained on ImageNet-1K. For training and evaluation code, please refer to classification.

Name	Params (M)	FLOPs (G)	Top-1 Acc. (%)	Model	Log
LIT-Ti	19	3.6	81.1	google drive/github	log
LIT-S	27	4.1	81.5	google drive/github	log
LIT-M	48	8.6	83.0	google drive/github	log
LIT-B	86	15.0	83.4	google drive/github	log

Object Detection on COCO

For training and evaluation code, please refer to detection.

RetinaNet

Backbone	Params (M)	Lr schd	box mAP	Config	Model	Log
LIT-Ti	30	1x	41.6	config	github	log
LIT-S	39	1x	41.6	config	github	log

Mask R-CNN

Backbone	Params (M)	Lr schd	box mAP	mask mAP	Config	Model	Log
LIT-Ti	40	1x	42.0	39.1	config	github	log
LIT-S	48	1x	42.9	39.6	config	github	log

Semantic Segmentation on ADE20K

For training and evaluation code, please refer to segmentation.

Semantic FPN

Backbone	Params (M)	Iters	mIoU	Config	Model	Log
LIT-Ti	24	8k	41.3	config	github	log
LIT-S	32	8k	41.7	config	github	log

Offsets Visualisation

dpm_vis

We provide a script for visualising the learned offsets by the proposed deformable token merging modules (DTM). For example,

# activate your virtual env
conda activate lit
cd classification/code_for_lit_ti

# visualize
python visualize_offset.py --model lit_ti --resume [path/to/lit_ti.pth] --vis_image visualization/demo.JPEG

The plots will be automatically saved under visualization/, with a folder named by the name of the example image.