MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

We propose a benchmark to evaluate different quantization algorithms on various settings. MQBench is a first attempt to evaluate, analyze, and benchmark the reproducibility and deployability for model quantization algorithms. We choose multiple different platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms under a unified training pipeline. MQBench acts like a bridge to connect the algorithm and the hardware. We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights.

Highlighted Features


These instructions will help get MQBench up.

  1. Clone MQBench.

  2. (Optionally) Create a Python virtual environment.

  3. Install the MQBench-required packages

    $ pip install -r requirements.txt

    Notes: MQBench uses Pytorch-1.8, our quantized model is based on the new torch.fx tracing techniques.

  4. MQBench use the Pytorch distributed data-parallel training with nccl backend (see details here), please make sure your machine can initailize that distributed learning environment.

How to Reproduce MQBench

We provide the running scripts and configuration file config.yaml of all experiments in MQBench.

To reproduce LSQ on ResNet-18,

  1. enter the directory

    $ cd PATH-TO-PROJECT/qbench_zoo
    $ cd lsq_experiments/resnet18_4bit_academic
  2. run script

    $ sh

    Note that contain some commands that may not be found, the core running command is

    python -u -m prototype.solver.cls_quant_solver --config config.yaml

How to self-implement a quantization algorithm

All our quantization algorithms are implemented in prototype/quantization/

To implementa a new algorithm, you need to add you quantizer into this directory.

All quantizer are inheritant from QuantizeBase class. Each QuantizedBase will have an observer class which is used to estimate/update the quantization range. The observer design is inspired from the Pytorch-1.8 repo. Intializing a QuantizeBase class will also initialize a Observer class.

The parameters contained for QuantizeBase and Observer include:

  1. quant_min, quant_max, which specify the $N_{min}, N_{max}$ for rounding boundaries.
  2. qshcme, which can be torch.per_tensor_symmetric, torch.per_channel_symmetric, torch.per_tensor_affine, and torch.per_channel_affine. This is often determined by the hardware setup.
  3. ch_axis, which is the dimension of channel-wise quantization. -1 is for per-tensor quantization. Typically for nn.Conv2d and nn.Linear module, the ch_axis should be 0.
  4. ada_sign, which can adaptively choose the signness. ada_sign should be enabled for academic setting only.
  5. pot_scale, which is used to determine the powers-of-two scale parameters.

Note: each specified quantizer may have its own unique parameters, see example of LSQ below.

Example Implementation of LSQ:

  1. For initialization, we add new parameters for storing the scale, zero_point:

    self.use_grad_scaling = use_grad_scaling
    self.scale = Parameter(torch.tensor([scale]))
    self.zero_point = Parameter(torch.tensor([zero_point]))
  2. The major implementation is the forward function, which should contain several cases:

    1. In case of ada_sign=True, the quantization range should be adjusted.

      if self.ada_sign and X.min() >= 0:
        	self.quant_max = self.activation_post_process.quant_max = 2 ** self.bitwidth - 1
        	self.quant_min = self.activation_post_process.quant_min = 0
        	self.activation_post_process.adjust_sign = True
    2. In case of symmetric quantization, the zero point should set to 0.
    3. In case of powers-of-two scale, the scale should be quantized by:

      def pot_quantization(tensor: torch.Tensor):
          log2t = torch.log2(tensor)
          log2t = (torch.round(log2t)-log2t).detach() + log2t
          return 2 ** log2t
      scale = pot_quantization(self.scale)
    4. Implement both per-channel and per-tensor quantization.

After adding you quantizer...

The next step is to register the quantizer in prototype/quantization/

Import your quantizer and then add it to get_qconfig function, and parse necessary arguments.

The final step is to override a config.yaml file:

    w_method: lsq
    a_method: lsq
    bit: 4

backend: academic
bnfold: 4

By replacing the w_method, a_method, you can run your implementation.

Note: the rest of the config file should not be modified in order to keep a unified training setting.

How to self-implement a hardware configuration

Adding a new setting in hardware is much simpler that algorithms. To do this, we can add another condition in the if-else selection. For example, adding a new hardware TFLite Micro:

        elif backend == "tflitemicro":
            backend_params = dict(ada_sign=False, symmetry=True, per_channel=False, pot_scale=True)

    model_qconfig = get_qconfig(**self.qparams, **backend_params)
    model = quantize_fx.prepare_qat_fx(model, {"": model_qconfig}, foldbn_config)

Submitting Your Results to MQBench

You can submit your implementation to MQBench by submmitting a merge request to this repo. The implementation of new algorithms and the running scripts, log file are needed for evalutation.