Towards Reproducible and Deployable Model Quantization Benchmark

MQBench

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

We propose a benchmark to evaluate different quantization algorithms on various settings. MQBench is a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights.

Highlighted Features

Integrate with the latest tracing techniques in Pytorch 1.8.
Quantization Algorithms
- Learned Step Size Quantization: https://arxiv.org/abs/1902.08153
- Quantization Interval Learning: https://arxiv.org/abs/1808.05779
- Differentiable Soft Quantization: https://arxiv.org/abs/1908.05033
- Parameterized Clipping AcTivation: https://arxiv.org/abs/1805.06085
- Additive Powers-of-Two Quantization: https://arxiv.org/abs/1909.13144
- DoReFa-Net: https://arxiv.org/abs/1606.06160
Network Architectures:
- ResNet-18, ResNet-50: https://arxiv.org/abs/1512.03385
- MobileNetV2: https://arxiv.org/abs/1801.04381
- EfficienteNet-Lite-B0: https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html
- RegNetX-600GF: https://arxiv.org/abs/2003.13678

Hardware Platform:

Library	Haware Type	s Form	Granularity	Symmetry	Fold BN
Academic	None	FP32	Per-tensor	Symmetric	No
TensorRT	GPU	FP32	Per-channel	Symmetric	Yes
ACL	ASIC	FP32	Per-channel	Asymmetric	Yes
TVM	ARM CPU	POT	Per-tensor	Symmetric	Yes
SNPE	DSP	FP32	Per-tensor	Asymmetric	Yes
FBGEMM	X86 CPU	FP32	Per-channel	Asymmetric	Yes

Installation

These instructions will help get MQBench up.

Clone MQBench.
(Optionally) Create a Python virtual environment.
Install the MQBench-required packages

$ pip install -r requirements.txt

Notes: MQBench uses Pytorch-1.8, our quantized model is based on the new torch.fx tracing techniques.
MQBench use the Pytorch distributed data-parallel training with nccl backend (see details here), please make sure your machine can initailize that distributed learning environment.

How to Reproduce MQBench

We provide the running scripts run.sh and configuration file config.yaml of all experiments in MQBench.

To reproduce LSQ on ResNet-18,

enter the directory

$ cd PATH-TO-PROJECT/qbench_zoo
$ cd lsq_experiments/resnet18_4bit_academic

run script

$ sh run.sh

Note that run.sh contain some commands that may not be found, the core running command is

PYTHONPATH=$PYTHONPATH:../../..
python -u -m prototype.solver.cls_quant_solver --config config.yaml

How to self-implement a quantization algorithm

All our quantization algorithms are implemented in prototype/quantization/

To implementa a new algorithm, you need to add you quantizer into this directory.

All quantizer are inheritant from QuantizeBase class. Each QuantizedBase will have an observer class which is used to estimate/update the quantization range. The observer design is inspired from the Pytorch-1.8 repo. Intializing a QuantizeBase class will also initialize a Observer class.

The parameters contained for QuantizeBase and Observer include：

quant_min, quant_max, which specify the $N_{min}, N_{max}$ for rounding boundaries.
qshcme, which can be torch.per_tensor_symmetric, torch.per_channel_symmetric, torch.per_tensor_affine, and torch.per_channel_affine. This is often determined by the hardware setup.
ch_axis, which is the dimension of channel-wise quantization. -1 is for per-tensor quantization. Typically for nn.Conv2d and nn.Linear module, the ch_axis should be 0.
ada_sign, which can adaptively choose the signness. ada_sign should be enabled for academic setting only.
pot_scale, which is used to determine the powers-of-two scale parameters.

Note: each specified quantizer may have its own unique parameters, see example of LSQ below.

Example Implementation of LSQ:

For initialization, we add new parameters for storing the scale, zero_point:

self.use_grad_scaling = use_grad_scaling
self.scale = Parameter(torch.tensor([scale]))
self.zero_point = Parameter(torch.tensor([zero_point]))

The major implementation is the forward function, which should contain several cases:

In case of ada_sign=True, the quantization range should be adjusted.

if self.ada_sign and X.min() >= 0:
  	self.quant_max = self.activation_post_process.quant_max = 2 ** self.bitwidth - 1
  	self.quant_min = self.activation_post_process.quant_min = 0
  	self.activation_post_process.adjust_sign = True

In case of symmetric quantization, the zero point should set to 0.
```
self.zero_point.data.zero_()
```

In case of powers-of-two scale, the scale should be quantized by:

def pot_quantization(tensor: torch.Tensor):
    log2t = torch.log2(tensor)
    log2t = (torch.round(log2t)-log2t).detach() + log2t
    return 2 ** log2t
    
scale = pot_quantization(self.scale)

Implement both per-channel and per-tensor quantization.

After adding you quantizer...

The next step is to register the quantizer in prototype/quantization/qconfig.py

Import your quantizer and then add it to get_qconfig function, and parse necessary arguments.

The final step is to override a config.yaml file:

qparams:
    w_method: lsq
    a_method: lsq
    bit: 4

backend: academic
bnfold: 4

By replacing the w_method, a_method, you can run your implementation.

Note: the rest of the config file should not be modified in order to keep a unified training setting.

How to self-implement a hardware configuration

Adding a new setting in hardware is much simpler that algorithms. To do this, we can add another condition in the if-else selection. For example, adding a new hardware TFLite Micro:

        elif backend == "tflitemicro":
            backend_params = dict(ada_sign=False, symmetry=True, per_channel=False, pot_scale=True)
        ...

    model_qconfig = get_qconfig(**self.qparams, **backend_params)
    model = quantize_fx.prepare_qat_fx(model, {"": model_qconfig}, foldbn_config)

Submitting Your Results to MQBench

You can submit your implementation to MQBench by submmitting a merge request to this repo. The implementation of new algorithms and the running scripts, log file are needed for evalutation.