ViT-Adapter

The official implementation of the paper “Vision Transformer Adapter for Dense Predictions“.

News

(2022/05/17) ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev. (2022/05/12) ViT-Adapter-L reaches 85.2 mIoU on Cityscapes test set without coarse data. (2022/05/05) ViT-Adapter-L achieves the SOTA on ADE20K val set with 60.5 mIoU!

Abstract

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks’ prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research.

Method

SOTA Model Zoo

COCO test-dev

Method	Framework	Pre-train	Lr schd	box AP	mask AP	#Param
ViT-Adapter-L	HTC++	BEiT-L	3x	58.5	50.8	401M
ViT-Adapter-L (MS)	HTC++	BEiT-L	3x	60.1	52.1	401M

ADE20K val

Method	Framework	Pre-train	Iters	Crop Size	mIoU	+MS	#Param
ViT-Adapter-L	UperNet	BEiT-L	160k	640	58.0	58.4	451M
ViT-Adapter-L	Mask2Former	BEiT-L	160k	640	58.3	59.0	568M
ViT-Adapter-L	Mask2Former	COCO-Stuff-164k	80k	896	59.4	60.5	571M

Cityscapes val/test

Method	Framework	Pre-train	Iters	Crop Size	val mIoU	val/test +MS	#Param
ViT-Adapter-L	Mask2Former	Mapillary	80k	896	84.9	85.8/85.2	571M

COCO-Stuff-10k

Method	Framework	Pre-train	Iters	Crop Size	mIoU	+MS	#Param
ViT-Adapter-L	UperNet	BEiT-L	80k	512	51.0	51.4	451M
ViT-Adapter-L	Mask2Former	BEiT-L	40k	512	53.2	54.2	568M

Pascal Context

Method	Framework	Pre-train	Iters	Crop Size	mIoU	+MS	#Param
ViT-Adapter-L	UperNet	BEiT-L	80k	480	67.0	67.5	451M
ViT-Adapter-L	Mask2Former	BEiT-L	40k	480	67.8	68.2	568M

Regular Model Zoo

COCO mini-val

Baseline Detectors

Method	Framework	Pre-train	Lr schd	Aug	box AP	mask AP	#Param
ViT-Adapter-T	Mask R-CNN	DeiT	3x	Yes	46.0	41.0	28M
ViT-Adapter-S	Mask R-CNN	DeiT	3x	Yes	48.2	42.8	48M
ViT-Adapter-B	Mask R-CNN	DeiT	3x	Yes	49.6	43.6	120M
ViT-Adapter-L	Mask R-CNN	DeiT	3x	Yes	50.9	44.8	348M

Advanced Detectors

Method	Framework	Pre-train	Lr schd	Aug	box AP	mask AP	#Param
ViT-Adapter-S	Cascade Mask R-CNN	DeiT	3x	Yes	51.5	TODO	86M
ViT-Adapter-S	ATSS	DeiT	3x	Yes	49.6	–	36M
ViT-Adapter-S	GFL	DeiT	3x	Yes	50.0	–	36M
ViT-Adapter-S	Sparse R-CNN	DeiT	3x	Yes	48.1	–	110M
ViT-Adapter-B	Upgraded Mask R-CNN	MAE	25ep	LSJ	50.3	44.7	122M
ViT-Adapter-B	Upgraded Mask R-CNN	MAE	50ep	LSJ	50.8	45.1	122M

ADE20K val

Method	Framework	Pre-train	Iters	Crop Size	mIoU	+MS	#Param
ViT-Adapter-T	UperNet	DeiT	160k	512	42.6	43.6	36M
ViT-Adapter-S	UperNet	DeiT	160k	512	46.6	47.4	58M
ViT-Adapter-B	UperNet	DeiT	160k	512	48.1	49.2	134M
ViT-Adapter-B	UperNet	AugReg	160k	512	51.9	52.5	134M
ViT-Adapter-L	UperNet	AugReg	160k	512	53.4	54.4	364M

Catalog

Detection checkpoints
Segmentation checkpoints
Model code
Detection logs
Segmentation logs
Initialization

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{chen2021vitadapter,
  title={Vision Transformer Adapter for Dense Predictions},
  author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.08534},
  year={2022}
}

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

GitHub

View Github

Vision Transformer Adapter for Dense Predictions

ViT-Adapter

News

Abstract

Method