Compact Transformers

By Ali Hassani[1], Steven Walton[1], Nikhil Shah[1], Abulikemu Abuduweili[1], Jiachen Li[1,2], and Humphrey Shi[1,2,3]

*Ali Hassani and Steven Walton contributed equal work

In association with SHI Lab @ University of Oregon[1] and UIUC[2], and Picsart AI Research (PAIR)[3]


With the rise of Transformers as the standard for language
processing, and their advancements in computer vi-sion, along with their
unprecedented size and amounts of training data, many have come to believe
that they are not suitable for small sets of data. This trend leads
to great concerns, including but not limited to: limited availability of
data in certain scientific domains and the exclusion ofthose with limited
resource from research in the field. In this paper, we dispel the myth that
transformers are “data-hungry” and therefore can only be applied to large
sets of data. We show for the first time that with the right size
and tokenization, transformers can perform head-to-head with state-of-the-art
CNNs on small datasets. Our model eliminates the requirement for class
token and positional embed-dings through a novel sequence pooling
strategy and the use of convolutions. We show that compared to CNNs,
our compact transformers have fewer parameters and MACs,while obtaining
similar accuracies. Our method is flexible in terms of model size, and can
have as little as 0.28M parameters and achieve reasonable results. It can
reach an accuracy of 95.29% when training from scratch on CIFAR-10,which is
comparable with modern CNN based approaches,and a significant improvement
over previous Transformer based models. Our simple and compact design
democratizes transformers by making them accessible to those equipped
with basic computing resources and/or dealing with important small

ViT-Lite: Lightweight ViT

Different from ViT we show that an image
is not always worth 16x16 words
and the image patch size matters.
Transformers are not in fact ''data-hungry,'' as the authors proposed, and
smaller patching can be used to train efficiently on smaller datasets.

CVT: Compact Vision Transformers

Compact Vision Transformers better utilize information with Sequence Pooling post
encoder, eliminating the need for the class token while achieving better

CCT: Compact Convolutional Transformers

Compact Convolutional Transformers not only use the sequence pooling but also
replace the patch embedding with a convolutional embedding, allowing for better
inductive bias and making positional embeddings optional. CCT achieves better
accuracy than ViT-Lite and CVT and increases the flexibility of the input


How to run

Install locally

Please make sure you're using the following PyTorch version:


Refer to PyTorch's Getting Started page for detailed instructions.

Using Docker

There's also a Dockerfile, which builds off of the PyTorch image (requires CUDA).


We recommend starting with our faster version (CCT-2/3x2) which can be run with the
following command. If you are running on a CPU we recommend this model.

python \
       --dataset cifar10 \
       --model cct_2 \
       --conv-size 3 \
       --conv-layers 2 \

If you would like to run our best running models (CCT-6/3x1 or CCT-7/3x1)
with CIFAR-10 on your machine, please use the following command.

python \
       --dataset cifar10 \
       --model cct_6 \
       --conv-size 3 \
       --conv-layers 1 \
       --warmup 10 \
       --batch-size 64 \
       --checkpoint-path /path/to/checkpoint.pth \


You can use to evaluate the performance of a checkpoint.

python \
       --dataset cifar10 \
       --model cct_6 \
       --conv-size 3 \
       --conv-layers 1 \
       --checkpoint-path /path/to/checkpoint.pth \


Type can be read in the format L/PxC where L is the number of transformer
layers, P is the patch/convolution size, and C (CCT only) is the number of
convolutional layers.

CIFAR-10 and CIFAR-100

Model Type Epochs CIFAR-10 CIFAR-100 # Params MACs
ViT-Lite 7/4 200 91.38% 69.75% 3.717M 0.239G
6/4 200 90.94% 69.20% 3.191M 0.205G
CVT 7/4 200 92.43% 73.01% 3.717M 0.236G
6/4 200 92.58% 72.25% 3.190M 0.202G
CCT 2/3x2 200 89.17% 66.90% 0.284M 0.033G
4/3x2 200 91.45% 70.46% 0.482M 0.046G
6/3x2 200 93.56% 74.47% 3.327M 0.241G
7/3x2 200 93.83% 74.92% 3.853M 0.275G
7/3x1 200 94.78% 77.05% 3.760M 0.947G
6/3x1 200 94.81% 76.71% 3.168M 0.813G
6/3x1 500 95.29% 77.31% 3.168M 0.813G

Click to download checkpoints.


Model Type Resolution Epochs Top-1 Accuracy # Params MACs
ViT 12/16 384 300 77.91% 86.8M 17.6G
CCT 14t/7x2 224 310 80.04% 22.29M 5.11G
16t/7x2 224 310 80.28% 25.32M 5.69G

Please note that we used Ross Wightman's ImageNet training script to train these.

NLP Results

Model Kernel size AGNews TREC # Params
CCT-2 1 93.45% 91.00% 0.238M
2 93.51% 91.80% 0.276M
4 93.80% 91.00% 0.353M
CCT-4 1 93.55% 91.80% 0.436M
2 93.24% 93.60% 0.475M
4 93.09% 93.00% 0.551M
CCT-6 1 93.78% 91.60% 3.237M
2 93.33% 92.20% 3.313M
4 92.95% 92.80% 3.467M
More models are being uploaded.


	title        = {Escaping the Big Data Paradigm with Compact Transformers},
	author       = {Ali Hassani and Steven Walton and Nikhil Shah and Abulikemu Abuduweili and Jiachen Li and Humphrey Shi},
	year         = 2021,
	url          = {},
	eprint       = {2104.05704},
	archiveprefix = {arXiv},
	primaryclass = {cs.CV}