ViT-Adapter

PWC PWC PWC PWC PWC PWC

The official implementation of the paper “Vision Transformer Adapter for Dense Predictions“.

News

(2022/05/17) ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev. (2022/05/12) ViT-Adapter-L reaches 85.2 mIoU on Cityscapes test set without coarse data. (2022/05/05) ViT-Adapter-L achieves the SOTA on ADE20K val set with 60.5 mIoU!

Abstract

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks’ prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research.

Method

image

image

SOTA Model Zoo

COCO test-dev

Method Framework Pre-train Lr schd box AP mask AP #Param
ViT-Adapter-L HTC++ BEiT-L 3x 58.5 50.8 401M
ViT-Adapter-L (MS) HTC++ BEiT-L 3x 60.1 52.1 401M

ADE20K val

Method Framework Pre-train Iters Crop Size mIoU +MS #Param
ViT-Adapter-L UperNet BEiT-L 160k 640 58.0 58.4 451M
ViT-Adapter-L Mask2Former BEiT-L 160k 640 58.3 59.0 568M
ViT-Adapter-L Mask2Former COCO-Stuff-164k 80k 896 59.4 60.5 571M

Cityscapes val/test

Method Framework Pre-train Iters Crop Size val mIoU val/test +MS #Param
ViT-Adapter-L Mask2Former Mapillary 80k 896 84.9 85.8/85.2 571M

COCO-Stuff-10k

Method Framework Pre-train Iters Crop Size mIoU +MS #Param
ViT-Adapter-L UperNet BEiT-L 80k 512 51.0 51.4 451M
ViT-Adapter-L Mask2Former BEiT-L 40k 512 53.2 54.2 568M

Pascal Context

Method Framework Pre-train Iters Crop Size mIoU +MS #Param
ViT-Adapter-L UperNet BEiT-L 80k 480 67.0 67.5 451M
ViT-Adapter-L Mask2Former BEiT-L 40k 480 67.8 68.2 568M

Regular Model Zoo

COCO mini-val

Baseline Detectors

Method Framework Pre-train Lr schd Aug box AP mask AP #Param
ViT-Adapter-T Mask R-CNN DeiT 3x Yes 46.0 41.0 28M
ViT-Adapter-S Mask R-CNN DeiT 3x Yes 48.2 42.8 48M
ViT-Adapter-B Mask R-CNN DeiT 3x Yes 49.6 43.6 120M
ViT-Adapter-L Mask R-CNN DeiT 3x Yes 50.9 44.8 348M

Advanced Detectors

Method Framework Pre-train Lr schd Aug box AP mask AP #Param
ViT-Adapter-S Cascade Mask R-CNN DeiT 3x Yes 51.5 TODO 86M
ViT-Adapter-S ATSS DeiT 3x Yes 49.6 36M
ViT-Adapter-S GFL DeiT 3x Yes 50.0 36M
ViT-Adapter-S Sparse R-CNN DeiT 3x Yes 48.1 110M
ViT-Adapter-B Upgraded Mask R-CNN MAE 25ep LSJ 50.3 44.7 122M
ViT-Adapter-B Upgraded Mask R-CNN MAE 50ep LSJ 50.8 45.1 122M

ADE20K val

Method Framework Pre-train Iters Crop Size mIoU +MS #Param
ViT-Adapter-T UperNet DeiT 160k 512 42.6 43.6 36M
ViT-Adapter-S UperNet DeiT 160k 512 46.6 47.4 58M
ViT-Adapter-B UperNet DeiT 160k 512 48.1 49.2 134M
ViT-Adapter-B UperNet AugReg 160k 512 51.9 52.5 134M
ViT-Adapter-L UperNet AugReg 160k 512 53.4 54.4 364M

Catalog

  • Detection checkpoints
  • Segmentation checkpoints
  • Model code
  • Detection logs
  • Segmentation logs
  • Initialization

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{chen2021vitadapter,
  title={Vision Transformer Adapter for Dense Predictions},
  author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.08534},
  year={2022}
}

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

GitHub

View Github