[arXiv]
FQViTThis repo contains the official implementation of “FQViT: Fully Quantized Vision Transformer without Retraining”.
Table of Contents
Introduction
Transformerbased architectures have achieved competitive performance in various CV tasks. Compared to the CNNs, Transformers usually have more parameters and higher computational costs, presenting a challenge when deployed to resourceconstrained hardware devices.
Most existing quantization approaches are designed and tested on CNNs and lack proper handling of Transformerspecific modules. Previous work found there would be significant accuracy degradation when quantizing LayerNorm and Softmax of Transformerbased architectures. As a result, they left LayerNorm and Softmax unquantized with floatingpoint numbers. We revisit these two exclusive modules of the Vision Transformers and discover the reasons for degradation. In this work, we propose the FQViT, the first fully quantized Vision Transformer, which contains two specific modules: PowersofTwo Scale (PTS) and LogIntSoftmax (LIS).
Layernorm quantized with PowersofTwo Scale (PTS)
These two figures below show that there exists serious interchannel variation in Vision Transformers than CNNs, which leads to unacceptable quantization errors with layerwise quantization.
Taking the advantages of both layerwise and channelwise quantization, we propose PTS for LayerNorm’s quantization. The core idea of PTS is to equip different channels with different PowersofTwo Scale factors, rather than different quantization scales.
Softmax quantized with LogIntSoftmax (LIS)
The storage and computation of attention map is known as a bottleneck for transformer structures, so we want to quantize it to extreme lower bitwidth (e.g. 4bit). However, if directly implementing 4bit uniform quantization, there will be severe accuracy degeneration. We observe a distribution centering at a fairly small value of the output of Softmax, while only few outliers have larger values close to 1. Based on the following visualization, Log2 preserves more quantization bins than uniform for the small value interval with dense distribution.
Combining Log2 quantization with iexp, which is a polynomial approximation of exponential function presented by IBERT, we propose LIS, an integeronly, faster, low consuming Softmax.
The whole process is visualized as follow.
Getting Started
Install
 Clone this repo.
git clone https://github.com/linyangzhh/FQViT.git
cd FQViT
 Create a conda virtual environment and activate it.
conda create n fqvit python=3.7 y
conda activate fqvit
 Install PyTorch and torchvision. e.g.,
conda install pytorch=1.7.1 torchvision cudatoolkit=10.1 c pytorch
Data preparation
You should download the standard ImageNet Dataset.
├── imagenet
│ ├── train

│ ├── val
Run
Example: Evaluate quantized DeiTS with MinMax quantizer and our proposed PTS and LIS
python test_quant.py deit_small <YOUR_DATA_DIR> quant pts lis quantmethod minmax

deit_small
: model architecture, which can be replaced bydeit_tiny
,deit_base
,vit_base
,vit_large
,swin_tiny
,swin_small
andswin_base
. 
quant
: whether to quantize the model. 
pts
: whether to use PowerofTwo Scale Integer Layernorm. 
lis
: whether to use LogIntegerSoftmax. 
quantmethod
: quantization methods of activations, which can be chosen fromminmax
,ema
,percentile
andomse
.
Results on ImageNet
This paper employs several current posttraining quantization strategies together with our methods, including MinMax, EMA , Percentile and OMSE.

MinMax uses the minimum and maximum values of the total data as the clipping values;

EMA is based on MinMax and uses an average moving mechanism to smooth the minimum and maximum values of different minibatch;

Percentile assumes that the distribution of values conforms to a normal distribution and uses the percentile to clip. In this paper, we use the 1e5 percentile because the 1e4 commonly used in CNNs has poor performance in Vision Transformers.

OMSE determines the clipping values by minimizing the quantization error.
The following results are evaluated on ImageNet.
Method  W/A/Attn Bits  ViTB  ViTL  DeiTT  DeiTS  DeiTB  SwinT  SwinS  SwinB 

Full Precision  32/32/32  84.53  85.81  72.21  79.85  81.85  81.35  83.20  83.60 
MinMax  8/8/8  23.64  3.37  70.94  75.05  78.02  64.38  74.37  25.58 
MinMax w/ PTS  8/8/8  83.31  85.03  71.61  79.17  81.20  80.51  82.71  82.97 
MinMax w/ PTS, LIS  8/8/4  82.68  84.89  71.07  78.40  80.85  80.04  82.47  82.38 
EMA  8/8/8  30.30  3.53  71.17  75.71  78.82  70.81  75.05  28.00 
EMA w/ PTS  8/8/8  83.49  85.10  71.66  79.09  81.43  80.52  82.81  83.01 
EMA w/ PTS, LIS  8/8/4  82.57  85.08  70.91  78.53  80.90  80.02  82.56  82.43 
Percentile  8/8/8  46.69  5.85  71.47  76.57  78.37  78.78  78.12  40.93 
Percentile w/ PTS  8/8/8  80.86  85.24  71.74  78.99  80.30  80.80  82.85  83.10 
Percentile w/ PTS, LIS  8/8/4  80.22  85.17  71.23  78.30  80.02  80.46  82.67  82.79 
OMSE  8/8/8  73.39  11.32  71.30  75.03  79.57  79.30  78.96  48.55 
OMSE w/ PTS  8/8/8  82.73  85.27  71.64  78.96  81.25  80.64  82.87  83.07 
OMSE w/ PTS, LIS  8/8/4  82.37  85.16  70.87  78.42  80.90  80.41  82.57  82.45 
Citation
If you find this repo useful in your research, please consider citing the following paper:
@misc{
lin2021fqvit,
title={FQViT: Fully Quantized Vision Transformer without Retraining},
author={Yang Lin and Tianyu Zhang and Peiqin Sun and Zheng Li and Shuchang Zhou},
year={2021},
eprint={2111.13824},
archivePrefix={arXiv},
primaryClass={cs.CV}
}