CoaT: Co-Scale Conv-Attentional Image Transformers
This repository contains the official code and pretrained models for CoaT: Co-Scale Conv-Attentional Image Transformers. It introduces (1) a co-scale mechanism to realize fine-to-coarse, coarse-to-fine and cross-scale attention modeling and (2) an efficient conv-attention module to realize relative position encoding in the factorized attention.
Performance
- Classification (ImageNet dataset)
Name | Acc@1 | Acc@5 | #Params |
---|---|---|---|
CoaT-Lite Tiny | 77.5 | 93.8 | 5.7M |
CoaT-Lite Mini | 79.1 | 94.5 | 11M |
CoaT-Lite Small | 81.9 | 95.5 | 20M |
CoaT Tiny | 78.3 | 94.0 | 5.5M |
CoaT Mini | 81.0 | 95.2 | 10M |
- Instance Segmentation (Mask R-CNN w/ FPN on COCO dataset)
Name | Schedule | Bbox AP | Segm AP |
---|---|---|---|
CoaT-Lite Mini | 1x | 39.9 | 36.4 |
CoaT-Lite Mini | 3x | 41.8 | 37.7 |
CoaT-Lite Small | 1x | 43.7 | 39.3 |
CoaT-Lite Small | 3x | 44.5 | 39.8 |
CoaT Mini | 1x | 44.0 | 39.5 |
CoaT Mini | 3x | 45.2 | 40.2 |
- Object Detection (Deformable-DETR on COCO dataset)
Name | AP | AP50 | AP75 | APS | APM | APL |
---|---|---|---|---|---|---|
CoaT-Lite Small | 47.0 | 66.5 | 51.2 | 28.8 | 50.3 | 63.3 |
Changelog
05/19/2021: Pre-trained checkpoint for Mask R-CNN benchmark with CoaT-Lite Small backbone is released.
05/19/2021: Code and pre-trained checkpoint for Deformable-DETR with for CoaT-Lite Small backbone are released.
05/11/2021: Pre-trained checkpoint for CoaT-Lite Small is released.
05/09/2021: Pre-trained checkpoint for Mask R-CNN benchmark with CoaT Mini backbone is released.
05/06/2021: Pre-trained checkpoint for CoaT Mini is released.
05/02/2021: Pre-trained checkpoint for CoaT Tiny is released.
04/25/2021: Code and pre-trained checkpoint for Mask R-CNN benchmark with CoaT-Lite Mini backbone are released.
04/23/2021: Pre-trained checkpoint for CoaT-Lite Mini is released.
04/22/2021: Code and pre-trained checkpoint for CoaT-Lite Tiny are released.
Usage
The following usage is provided for the classification task using CoaT model. For the other tasks, please follow the corresponding readme, such as instance segmentation and object detection.
Environment Preparation
-
Set up a new conda environment and activate it.
# Create an environment with Python 3.8. conda create -n coat python==3.8 conda activate coat
-
Install required packages.
# Install PyTorch 1.7.1 w/ CUDA 11.0. pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html # Install timm 0.3.2. pip install timm==0.3.2 # Install einops. pip install einops
Code and Dataset Preparation
-
Clone the repo.
git clone https://github.com/mlpc-ucsd/CoaT cd CoaT
-
Download ImageNet dataset (ILSVRC 2012) and extract.
# Create dataset folder. mkdir -p ./data/ImageNet # Download the dataset (not shown here) and copy the files (assume the download path is in $DATASET_PATH). cp $DATASET_PATH/ILSVRC2012_img_train.tar $DATASET_PATH/ILSVRC2012_img_val.tar $DATASET_PATH/ILSVRC2012_devkit_t12.tar.gz ./data/ImageNet # Extract the dataset. python -c "from torchvision.datasets import ImageNet; ImageNet('./data/ImageNet', split='train')" python -c "from torchvision.datasets import ImageNet; ImageNet('./data/ImageNet', split='val')" # After the extraction, you should observe `train` and `val` folders under ./data/ImageNet.
Evaluate Pre-trained Checkpoint
We provide the CoaT checkpoints pre-trained on the ImageNet dataset.
Name | Acc@1 | Acc@5 | #Params | SHA-256 (first 8 chars) | URL |
---|---|---|---|---|---|
CoaT-Lite Tiny | 77.5 | 93.8 | 5.7M | e88e96b0 | model, log |
CoaT-Lite Mini | 79.1 | 94.5 | 11M | 6b4a8ae5 | model, log |
CoaT-Lite Small | 81.9 | 95.5 | 20M | 8d362f48 | model, log |
CoaT Tiny | 78.3 | 94.0 | 5.5M | c6efc33c | model, log |
CoaT Mini | 81.0 | 95.2 | 10M | 40667eec | model, log |
The following commands provide an example (CoaT-Lite Tiny) to evaluate the pre-trained checkpoint.
# Download the pretrained checkpoint.
mkdir -p ./output/pretrained
wget http://vcl.ucsd.edu/coat/pretrained/coat_lite_tiny_e88e96b0.pth -P ./output/pretrained
sha256sum ./output/pretrained/coat_lite_tiny_e88e96b0.pth # Make sure it matches the SHA-256 hash (first 8 characters) in the table.
# Evaluate.
# Usage: bash ./scripts/eval.sh [model name] [output folder] [checkpoint path]
bash ./scripts/eval.sh coat_lite_tiny coat_lite_tiny_pretrained ./output/pretrained/coat_lite_tiny_e88e96b0.pth
# It should output results similar to "Acc@1 77.504 Acc@5 93.814" at very last.
Train
The following commands provide an example (CoaT-Lite Tiny, 8-GPU) to train the CoaT model.
# Usage: bash ./scripts/train.sh [model name] [output folder]
bash ./scripts/train.sh coat_lite_tiny coat_lite_tiny
Evaluate
The following commands provide an example (CoaT-Lite Tiny) to evaluate the checkpoint after training.
# Usage: bash ./scripts/eval.sh [model name] [output folder] [checkpoint path]
bash ./scripts/eval.sh coat_lite_tiny coat_lite_tiny_eval ./output/coat_lite_tiny/checkpoints/checkpoint0299.pth
Citation
@misc{xu2021coscale,
title={Co-Scale Conv-Attentional Image Transformers},
author={Weijian Xu and Yifan Xu and Tyler Chang and Zhuowen Tu},
year={2021},
eprint={2104.06399},
archivePrefix={arXiv},
primaryClass={cs.CV}
}