PaddlePaddle Vision Transformers

State-of-the-art Visual Transformer and MLP Models for PaddlePaddle

PaddlePaddle Visual Transformers (PaddleViT or PPViT) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.1+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.

:robot: PaddleViT provides models and tools for multiple vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.

:robot: PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It's intuitive and straightforward to get started for new users.

PaddleViT implements model architectures and tools for multiple vision tasks, go to the following links for detailed information.

We also provide tutorials:

  • Notebooks (coming soon)
  • Online Course (coming soon)

Model architectures

Image Classification (Transformers)

  1. ViT (from Google), released with paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
  2. DeiT (from Facebook and Sorbonne), released with paper Training data-efficient image transformers & distillation through attention, by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
  3. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  4. VOLO (from Sea AI Lab and NUS), released with paper VOLO: Vision Outlooker for Visual Recognition, by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.
  5. CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
    , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
  6. CaiT (from Facebook and Sorbonne), released with paper Going deeper with Image Transformers, by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.
  7. PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
  8. Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
  9. T2T-ViT (from NUS and YITU), released with paper Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
    , by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.

Coming Soon:

  1. CrossViT (from IBM), released with paper CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
  2. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  3. HaloNet, (from Google), released with paper Scaling Local Self-Attention for Parameter Efficient Visual Backbones, by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.

Image Classification (MLPs)

  1. MLP-Mixer (from Google), released with paper MLP-Mixer: An all-MLP Architecture for Vision, by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
  2. ResMLP (from Facebook/Sorbonne/Inria/Valeo), released with paper ResMLP: Feedforward networks for image classification with data-efficient training, by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
  3. gMLP (from Google), released with paper Pay Attention to MLPs, by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.

Detection

  1. DETR (from Facebook), released with paper End-to-End Object Detection with Transformers, by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.

Coming Soon:

  1. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  2. PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
  3. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  4. UP-DETR (from Tencent), released with paper UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.

Semantic Segmentation

Now:

  1. SETR (from Fudan/Oxford/Surrey/Tencent/Facebook), released with paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, by Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang.
  2. DPT (from Intel), released with paper Vision Transformers for Dense Prediction, by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
  3. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  4. Segmenter (from Inria), realeased with paper Segmenter: Transformer for Semantic Segmentation, by Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid.
  5. Trans2seg (from HKU/Sensetime/NJU), released with paper Segmenting Transparent Object in the Wild with Transformer, by Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo.
  6. SegFormer (from HKU/NJU/NVIDIA/Caltech), released with paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.

Coming Soon:

  1. FTN (from Baidu), released with paper Fully Transformer Networks for Semantic Image Segmentation, by Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo.
  2. Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
  3. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  4. CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
    , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.

GAN

  1. TransGAN (from Seoul National University and NUUA), released with paper TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up, by Yifan Jiang, Shiyu Chang, Zhangyang Wang.
  2. Styleformer (from Facebook and Sorbonne), released with paper Styleformer: Transformer based Generative Adversarial Networks with Style Vector, by Jeeseung Park, Younggeun Kim.

Coming Soon:

  1. ViTGAN (from UCSD/Google), released with paper ViTGAN: Training GANs with Vision Transformers, by Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu.

Installation

Prerequistites

  • Linux/MacOS/Windows
  • Python 3.6/3.7
  • PaddlePaddle 2.1.0+
  • CUDA10.2+

Installation

Create a conda virtual environment and activate it.

conda create -n paddlevit python=3.7 -y
conda activate paddlevit

Install PaddlePaddle following the official instructions, e.g.,

conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
Note: please change the paddlepaddle version and cuda version accordingly to your environment.

Install dependency packages

  • General dependencies:
pip install yacs, yaml
  • Packages for Segmentation:
pip install cityscapesScripts, detail
  • Packages for GAN:
pip install lmdb

Clone project from GitHub

git clone https://github.com/BR-IDL/PaddleViT.git 

Docker Install

(coming soon)

Results (Ported Weights)

Image Classification

Model [email protected] [email protected] Image Size Crop_pct Interpolation Link
vit_base_patch16_224 84.58 97.30 224 0.875 bicubic google/baidu(qv4n)
vit_base_patch16_384 85.99 98.00 384 1.0 bicubic google/baidu(wsum)
vit_large_patch16_224 85.81 97.82 224 0.875 bicubic google/baidu(1bgk)
swin_base_patch4_window7_224 85.27 97.56 224 0.9 bicubic google/baidu(wyck)
swin_base_patch4_window12_384 86.43 98.07 384 1.0 bicubic google/baidu(4a95)
swin_large_patch4_window12_384 87.14 98.23 384 1.0 bicubic google/baidu(j71u)
pvtv2_b0 70.47 90.16 224 0.875 bicubic google/baidu(dxgb)
pvtv2_b1 78.70 94.49 224 0.875 bicubic google/baidu(2e5m)
pvtv2_b2 82.02 95.99 224 0.875 bicubic google/baidu(are2)
pvtv2_b3 83.14 96.47 224 0.875 bicubic google/baidu(nc21)
pvtv2_b4 83.61 96.69 224 0.875 bicubic google/baidu(tthf)
pvtv2_b5 83.77 96.61 224 0.875 bicubic google/baidu(9v6n)
pvtv2_b2_linear 82.06 96.04 224 0.875 bicubic google/baidu(a4c8)
mlp_mixer_b16_224 76.60 92.23 224 0.875 bicubic google/baidu(xh8x)
mlp_mixer_l16_224 72.06 87.67 224 0.875 bicubic google/baidu(8q7r)
resmlp_24_224 79.38 94.55 224 0.875 bicubic google/baidu(jdcx)
resmlp_36_224 79.77 94.89 224 0.875 bicubic google/baidu(33w3)
resmlp_big_24_224 81.04 95.02 224 0.875 bicubic google/baidu(r9kb)
resmlp_big_24_distilled_224 83.59 96.65 224 0.875 bicubic google/baidu(4jk5)
gmlp_s16_224 79.64 94.63 224 0.875 bicubic google/baidu(bcth)
volo_d5_224_86.10 86.08 97.58 224 1.0 bicubic google/baidu(td49)
volo_d5_512_87.07 87.05 97.97 512 1.15 bicubic google/baidu(irik)
cait_xxs24_224 78.38 94.32 224 1.0 bicubic google/baidu(j9m8)
cait_s24_384 85.05 97.34 384 1.0 bicubic google/baidu(qb86)
cait_m48_448 86.49 97.75 448 1.0 bicubic google/baidu(imk5)
deit_base_distilled_patch16_224 83.32 96.49 224 0.875 bicubic google/baidu(5f2g)
deit_base_distilled_patch16_384 85.43 97.33 384 1.0 bicubic google/baidu(qgj2)
shuffle_vit_tiny_patch4_window7 82.39 96.05 224 0.875 bicubic google/baidu(8a1i)
shuffle_vit_small_patch4_window7 83.53 96.57 224 0.875 bicubic google/baidu(xwh3)
shuffle_vit_base_patch4_window7 83.95 96.91 224 0.875 bicubic google/baidu(1gsr)
cswin_tiny_224 82.81 96.30 224 0.9 bicubic google/baidu(4q3h)
cswin_small_224 83.60 96.58 224 0.9 bicubic google/baidu(gt1a)
cswin_base_224 84.23 96.91 224 0.9 bicubic google/baidu(wj8p)
cswin_large_224 86.52 97.99 224 0.9 bicubic google/baidu(b5fs)
cswin_base_384 85.51 97.48 384 1.0 bicubic google/baidu(rkf5)
cswin_large_384 87.49 98.35 384 1.0 bicubic google/baidu(6235)
t2t_vit_7 71.68 90.89 224 0.9 bicubic google/baidu(1hpa)
t2t_vit_10 75.15 92.80 224 0.9 bicubic google/baidu(ixug)
t2t_vit_12 76.48 93.49 224 0.9 bicubic google/baidu(qpbb)
t2t_vit_14 81.50 95.67 224 0.9 bicubic google/baidu(c2u8)
t2t_vit_19 81.93 95.74 224 0.9 bicubic google/baidu(4in3)
t2t_vit_24 82.28 95.89 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_14 81.69 95.85 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_19 82.44 96.08 224 0.9 bicubic google/baidu(mier)
t2t_vit_t_24 82.55 96.07 224 0.9 bicubic google/baidu(6vxc)
t2t_vit_14_384 83.34 96.50 384 1.0 bicubic google/baidu(r685)

Object Detection

Model backbone box_mAP Model
DETR ResNet50 42.0 google/baidu(n5gk)
DETR ResNet101 43.5 google/baidu(bxz2)

Semantic Segmentation

Pascal Context

Model Backbone Batch_size mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_large 16 52.06 52.57 google/baidu(owoj) google/baidu(xdb8) config
SETR_PUP ViT_large 16 53.90 54.53 google/baidu(owoj) google/baidu(6sji) config
SETR_MLA ViT_Large 8 54.39 55.16 google/baidu(owoj) google/baidu(wora) config
SETR_MLA ViT_large 16 55.01 55.87 google/baidu(owoj) google/baidu(76h2) config

Cityscapes

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 8 40k 76.71 79.03 google/baidu(owoj) google/baidu(g7ro) config
SETR_Naive ViT_Large 8 80k 77.31 79.43 google/baidu(owoj) google/baidu(wn6q) config
SETR_PUP ViT_Large 8 40k 77.92 79.63 google/baidu(owoj) google/baidu(zmoi) config
SETR_PUP ViT_Large 8 80k 78.81 80.43 google/baidu(owoj) baidu(f793) config
SETR_MLA ViT_Large 8 40k 76.70 78.96 google/baidu(owoj) baidu(qaiw) config
SETR_MLA ViT_Large 8 80k 77.26 79.27 google/baidu(owoj) baidu(6bgj) config

ADE20K

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 16 160k 47.57 48.12 google/baidu(owoj) baidu(lugq) config
SETR_PUP ViT_Large 16 160k 49.12 49.51 google/baidu(owoj) baidu(udgs) config
SETR_MLA ViT_Large 8 160k 47.80 49.34 google/baidu(owoj) baidu(mrrv) config
DPT ViT_Large 16 160k 47.21 - google/baidu(owoj) baidu(ts7h) config
Segmenter ViT_Tiny 16 160k 38.45 - TODO baidu(1k97) config
Segmenter ViT_Small 16 160k 46.07 - TODO baidu(i8nv) config
Segmenter ViT_Base 16 160k 49.08 - TODO baidu(hxrl) config
Segmenter ViT_Large 16 160k 51.82 - TODO baidu(wdz6) config
Segmenter_Linear DeiT_Base 16 160k 47.34 - TODO baidu(5dpv) config
Segmenter DeiT_Base 16 160k 49.27 - TODO baidu(3kim) config
Segformer MIT-B0 16 160k 38.37 - TODO baidu(ges9) config
Segformer MIT-B1 16 160k 42.20 - TODO baidu(t4n4) config
Segformer MIT-B2 16 160k 46.38 - TODO baidu(h5ar) config
Segformer MIT-B3 16 160k 48.35 - TODO baidu(g9n4) config
Segformer MIT-B4 16 160k 49.01 - TODO baidu(e4xw) config
Segformer MIT-B5 16 160k 49.73 - TODO baidu(uczo) config
UperNet Swin_Tiny 16 160k 44.90 45.37 - baidu(lkhg) config
UperNet Swin_Small 16 160k 47.88 48.90 - baidu(vvy1) config
UperNet Swin_Base 16 160k 48.59 49.04 - baidu(y040) config

Trans10kV2

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
Trans2seg_Medium Resnet50c 16 80k 72.25 - google/baidu(4dd5) google/baidu(qcb0) config

GAN

Model FID Image Size Crop_pct Interpolation Model
styleformer_cifar10 2.73 32 1.0 lanczos google/baidu(ztky)
styleformer_stl10 15.65 48 1.0 lanczos google/baidu(i973)
styleformer_celeba 3.32 64 1.0 lanczos google/baidu(fh5s)
styleformer_lsun 9.68 128 1.0 lanczos google/baidu(158t)
*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.

Quick Demo for Image Classification

To use the model with pretrained weights, go to the specific subfolder e.g., /image_classification/ViT/, then download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in 。、configs/.

Assume the downloaded weight file is stored in ./vit_base_patch16_224.pdparams, to use the vit_base_patch16_224 model in python:

from config import get_config
from visual_transformer import build_vit as build_model
# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./vit_base_patch16_224')
model.set_dict(model_state_dict)
:robot: See the README file in each model folder for detailed usages.

Evaluation

To evaluate ViT model performance on ImageNet2012 with a single GPU, run the following script using command line:

sh run_eval.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg='./configs/vit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./vit_base_patch16_224'

Run evaluation using multi-GPUs:

sh run_eval_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/vit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./vit_base_patch16_224'

Training

To train the ViT model on ImageNet2012 with single GPU, run the following script using command line:

sh run_train.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg='./configs/vit_base_patch16_224.yaml' \
  -dataset='imagenet2012' \
  -batch_size=32 \
  -data_path='/dataset/imagenet' \

Run training using multi-GPUs:

sh run_train_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/vit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \

Features

State-of-the-art

  • State-of-the-art transformer models for multiple CV tasks
  • State-of-the-art data processings and training methods
  • We keep pushing it forward.

Easy-to-use tools

  • Easy configs for model vairants
  • Modular design for utiliy functions and tools
  • Low barrier for educators and practitioners
  • Unified framework for all the models

Easily customizable to your needs

  • Examples for each model to reproduce the results
  • Model implementations are exposed for you to customize
  • Model files can be used independently for quick experiments

High Performance

  • DDP with a single GPU per process.
  • Mixed-precision support (coming soon)

GitHub

https://github.com/BR-IDL/PaddleViT