Global Context Vision Transformer (GC ViT)

This repository is the official PyTorch implementation of Global Context Vision Transformers.

Global Context Vision Transformers Ali Hatamizadeh, Hongxu (Danny) Yin, Jan Kautz, and Pavlo Molchanov.

GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the base, small and tiny variants of GC ViT with 28M, 51M and 90M parameters achieve 83.2, 83.9 and 84.4 Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.


The architecture of GC ViT is demonstrated in the following:




  1. GC ViT model, training and validation scripts released for ImageNet-1K classification.
  2. Pre-trained model checkpoints will be released soon.


GC ViT leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows.


Results on ImageNet

ImageNet-1K Pretrained Models

Name Acc@1 Resolution #Params FLOPs Performance Summary Tensorboard Download
GC ViT-T 83.2 224×224 28 4.7 summary tensorboard model
GC ViT-S 83.9 224×224 51 8.5 summary tensorboard model
GC ViT-B 84.4 224×224 90 14.8 summary tensorboard model


This repository is compatible with NVIDIA PyTorch docker nvcr>=21.06 which can be obtained in this link.

The dependencies can be installed by running:

pip install -r requirements.txt

Data Preparation

Please download the ImageNet dataset from its official website. The training and validation images need to have sub-folders for each class with the following structure:

  ├── train
  │   ├── class1
  │   │   ├── img1.jpeg
  │   │   ├── img2.jpeg
  │   │   └── ...
  │   ├── class2
  │   │   ├── img3.jpeg
  │   │   └── ...
  │   └── ...
  └── val
      ├── class1
      │   ├── img4.jpeg
      │   ├── img5.jpeg
      │   └── ...
      ├── class2
      │   ├── img6.jpeg
      │   └── ...
      └── ...


Training on ImageNet-1K From Scratch (Multi-GPU)

GC ViT model can be trained from scratch on ImageNet-1K dataset by running:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_port 11223 \ 
--config <config-file> --data_dir <imagenet-path> --batch-size <batch-size-per-gpu> --tag <run-tag>

To resume training from a pre-trained checkpoint:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_port 11223 \ 
--resume <checkpoint-path> --config <config-file> --data_dir <imagenet-path> --batch-size <batch-size-per-gpu> --tag <run-tag>


To evaluate a pre-trained checkpoint using ImageNet-1K validation set on a single GPU:

python --model <model-name> --checkpoint <checkpoint-path> --data_dir <imagenet-path> --batch-size <batch-size-per-gpu>


This repository is built upon the timm library.


Please consider citing GC ViT paper if it is useful for your work:

    title={Global Context Vision Transformers},
    author={Ali Hatamizadeh and Hongxu Yin and Jan Kautz and Pavlo Molchanov},


View Github