[Doc] Compression (#4574)

db3130d7 · J-shang · GitHub · cef9babd · db3130d7 · db3130d7
Unverified Commit db3130d7 authored Feb 28, 2022 by J-shang Committed by GitHub Feb 28, 2022
20 changed files
--- a/docs/source/tutorials/quantization_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/quantization_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/quantization_speed_up.ipynb
+++ b/docs/source/tutorials/quantization_speed_up.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Speed Up Model with Calibration Config\n\n\n## Introduction\n\nDeep learning network has been computational intensive and memory intensive \nwhich increases the difficulty of deploying deep neural network model. Quantization is a \nfundamental technology which is widely used to reduce memory footprint and speed up inference \nprocess. Many frameworks begin to support quantization, but few of them support mixed precision \nquantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\\, only support simulated mixed precision quantization which will \nnot speed up the inference process. To get real speedup of mixed precision quantization and \nhelp people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different \nDL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model \nwith quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects \nTensorRT at this stage, and will support more backends in the future.\n\n\n## Design and Implementation\n\nTo support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  \nFrontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference \nframework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and \nTensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph \nrepresentation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx \nmodel to generate inference engine. \n\n\nQuantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.\nUsers should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\\  ).\nAfter quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.\n\n\nAfter getting mixed precision engine, users can do inference with input data.\n\n\nNote\n\n\n* Recommend using \"cpu\"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not \"cpu\"(host), this tool will transpose it to \"cpu\" which may increases unnecessary overhead.\n* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).\n* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.\n\n\n## Prerequisite\nCUDA version >= 11.0\n\nTensorRT version >= 7.2\n\nNote\n\n* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\\  \n\n## Usage\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import torch\nimport torch.nn.functional as F\nfrom torch.optim import SGD\nfrom scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt\n\nconfig_list = [{\n    'quant_types': ['input', 'weight'],\n    'quant_bits': {'input': 8, 'weight': 8},\n    'op_names': ['conv1']\n}, {\n    'quant_types': ['output'],\n    'quant_bits': {'output': 8},\n    'op_names': ['relu1']\n}, {\n    'quant_types': ['input', 'weight'],\n    'quant_bits': {'input': 8, 'weight': 8},\n    'op_names': ['conv2']\n}, {\n    'quant_types': ['output'],\n    'quant_bits': {'output': 8},\n    'op_names': ['relu2']\n}]\n\nmodel = TorchModel().to(device)\noptimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)\ncriterion = F.nll_loss\ndummy_input = torch.rand(32, 1, 28,28).to(device)\n\nfrom nni.algorithms.compression.pytorch.quantization import QAT_Quantizer\nquantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)\nquantizer.compress()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "finetuning the model by using QAT\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "for epoch in range(3):\n    trainer(model, optimizer, criterion)\n    evaluator(model)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "export model and get calibration_config\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "model_path = \"./log/mnist_model.pth\"\ncalibration_path = \"./log/mnist_calibration.pth\"\ncalibration_config = quantizer.export_model(model_path, calibration_path)\n\nprint(\"calibration_config: \", calibration_config)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "build tensorRT engine to make a real speed up\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT\n# input_shape = (32, 1, 28, 28)\n# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)\n# engine.compress()\n# test_trt(engine)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.\n\nFor complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.\n\nFor more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\\.\n\n### Mnist test\n\non one GTX2080 GPU,\ninput tensor: ``torch.randn(128, 1, 28, 28)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - quantization strategy\n     - Latency\n     - accuracy\n   * - all in 32bit\n     - 0.001199961\n     - 96%\n   * - mixed precision(average bit 20.4)\n     - 0.000753688\n     - 96%\n   * - all in 8bit\n     - 0.000229869\n     - 93.7%\n\n### Cifar10 resnet18 test (train one epoch)\n\non one GTX2080 GPU,\ninput tensor: ``torch.randn(128, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - quantization strategy\n     - Latency\n     - accuracy\n   * - all in 32bit\n     - 0.003286268\n     - 54.21%\n   * - mixed precision(average bit 11.55)\n     - 0.001358022\n     - 54.78%\n   * - all in 8bit\n     - 0.000859139\n     - 52.81%\n\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.8"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
--- a/docs/source/Compression/QuantizationSpeedup.rst
+++ b/docs/source/Compression/QuantizationSpeedup.rst
-Speed up Mixed Precision Quantization Model (experimental)
+"""
-==========================================================
+Speed Up Model with Calibration Config
+======================================
 Introduction
@@ -56,87 +57,114 @@ Note
 Usage
 -----
-quantization aware training:
-.. code-block:: python
+"""
-    # arrange bit config for QAT algorithm
+# %%
-    configure_list = [{
+import torch
-            'quant_types': ['weight', 'output'],
+import torch.nn.functional as F
-            'quant_bits': {'weight':8, 'output':8},
+from torch.optim import SGD
-            'op_names': ['conv1']
+from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
-        }, {
-            'quant_types': ['output'],
+config_list = [{
-            'quant_bits': {'output':8},
+    'quant_types': ['input', 'weight'],
-            'op_names': ['relu1']
+    'quant_bits': {'input': 8, 'weight': 8},
-        }
+    'op_names': ['conv1']
-    ]
+}, {
+    'quant_types': ['output'],
-    quantizer = QAT_Quantizer(model, configure_list, optimizer)
+    'quant_bits': {'output': 8},
-    quantizer.compress()
+    'op_names': ['relu1']
-    calibration_config = quantizer.export_model(model_path, calibration_path)
+}, {
+    'quant_types': ['input', 'weight'],
-    engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=batch_size)
+    'quant_bits': {'input': 8, 'weight': 8},
-    # build tensorrt inference engine
+    'op_names': ['conv2']
-    engine.compress()
+}, {
-    # data should be pytorch tensor
+    'quant_types': ['output'],
-    output, time = engine.inference(data)
+    'quant_bits': {'output': 8},
+    'op_names': ['relu2']
+}]
-Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
+model = TorchModel().to(device)
+optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
-For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
+criterion = F.nll_loss
+dummy_input = torch.rand(32, 1, 28,28).to(device)
-For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
+from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
+quantizer.compress()
-Mnist test
-^^^^^^^^^^^^^^^^^^^
+# %%
+# finetuning the model by using QAT
-on one GTX2080 GPU,
+for epoch in range(3):
-input tensor: ``torch.randn(128, 1, 28, 28)``
+    trainer(model, optimizer, criterion)
+    evaluator(model)
-.. list-table::
-   :header-rows: 1
+# %%
-   :widths: auto
+# export model and get calibration_config
+model_path = "./log/mnist_model.pth"
-   * - quantization strategy
+calibration_path = "./log/mnist_calibration.pth"
-     - Latency
+calibration_config = quantizer.export_model(model_path, calibration_path)
-     - accuracy
-   * - all in 32bit
+print("calibration_config: ", calibration_config)
-     - 0.001199961
-     - 96%
+# %%
-   * - mixed precision(average bit 20.4)
+# build tensorRT engine to make a real speed up
-     - 0.000753688
-     - 96%
+# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
-   * - all in 8bit
+# input_shape = (32, 1, 28, 28)
-     - 0.000229869
+# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
-     - 93.7%
+# engine.compress()
+# test_trt(engine)
-Cifar10 resnet18 test(train one epoch)
+# %%
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
+#
+# For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
-on one GTX2080 GPU,
+#
-input tensor: ``torch.randn(128, 3, 32, 32)``
+# For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
+#
+# Mnist test
-.. list-table::
+# ^^^^^^^^^^
-   :header-rows: 1
+#
-   :widths: auto
+# on one GTX2080 GPU,
+# input tensor: ``torch.randn(128, 1, 28, 28)``
-   * - quantization strategy
+#
-     - Latency
+# .. list-table::
-     - accuracy
+#    :header-rows: 1
-   * - all in 32bit
+#    :widths: auto
-     - 0.003286268
+#
-     - 54.21%
+#    * - quantization strategy
-   * - mixed precision(average bit 11.55)
+#      - Latency
-     - 0.001358022
+#      - accuracy
-     - 54.78%
+#    * - all in 32bit
-   * - all in 8bit
+#      - 0.001199961
-     - 0.000859139
+#      - 96%
-     - 52.81%
+#    * - mixed precision(average bit 20.4)
\ No newline at end of file
+#      - 0.000753688
+#      - 96%
+#    * - all in 8bit
+#      - 0.000229869
+#      - 93.7%
+#
+# Cifar10 resnet18 test (train one epoch)
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# on one GTX2080 GPU,
+# input tensor: ``torch.randn(128, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - quantization strategy
+#      - Latency
+#      - accuracy
+#    * - all in 32bit
+#      - 0.003286268
+#      - 54.21%
+#    * - mixed precision(average bit 11.55)
+#      - 0.001358022
+#      - 54.78%
+#    * - all in 8bit
+#      - 0.000859139
+#      - 52.81%
--- a/docs/source/tutorials/quantization_speed_up.py.md5
+++ b/docs/source/tutorials/quantization_speed_up.py.md5
+07fe95336a0d7edb8924dc24b609b361
\ No newline at end of file
--- a/docs/source/tutorials/quantization_speed_up.rst
+++ b/docs/source/tutorials/quantization_speed_up.rst
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "tutorials/quantization_speed_up.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+.. only:: html
+    .. note::
+        :class: sphx-glr-download-link-note
+        Click :ref:`here <sphx_glr_download_tutorials_quantization_speed_up.py>`
+        to download the full example code
+.. rst-class:: sphx-glr-example-title
+.. _sphx_glr_tutorials_quantization_speed_up.py:
+Speed Up Model with Calibration Config
+======================================
+Introduction
+------------
+Deep learning network has been computational intensive and memory intensive 
+which increases the difficulty of deploying deep neural network model. Quantization is a 
+fundamental technology which is widely used to reduce memory footprint and speed up inference 
+process. Many frameworks begin to support quantization, but few of them support mixed precision 
+quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will 
+not speed up the inference process. To get real speedup of mixed precision quantization and 
+help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different 
+DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model 
+with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects 
+TensorRT at this stage, and will support more backends in the future.
+Design and Implementation
+-------------------------
+To support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  
+Frontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference 
+framework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and 
+TensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph 
+representation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx 
+model to generate inference engine. 
+Quantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.
+Users should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\  ).
+After quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.
+After getting mixed precision engine, users can do inference with input data.
+Note
+* Recommend using "cpu"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not "cpu"(host), this tool will transpose it to "cpu" which may increases unnecessary overhead.
+* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).
+* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.
+Prerequisite
+------------
+CUDA version >= 11.0
+TensorRT version >= 7.2
+Note
+* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\  
+Usage
+-----
+.. GENERATED FROM PYTHON SOURCE LINES 64-96
+.. code-block:: default
+    import torch
+    import torch.nn.functional as F
+    from torch.optim import SGD
+    from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
+    config_list = [{
+        'quant_types': ['input', 'weight'],
+        'quant_bits': {'input': 8, 'weight': 8},
+        'op_names': ['conv1']
+    }, {
+        'quant_types': ['output'],
+        'quant_bits': {'output': 8},
+        'op_names': ['relu1']
+    }, {
+        'quant_types': ['input', 'weight'],
+        'quant_bits': {'input': 8, 'weight': 8},
+        'op_names': ['conv2']
+    }, {
+        'quant_types': ['output'],
+        'quant_bits': {'output': 8},
+        'op_names': ['relu2']
+    }]
+    model = TorchModel().to(device)
+    optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
+    criterion = F.nll_loss
+    dummy_input = torch.rand(32, 1, 28,28).to(device)
+    from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+    quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
+    quantizer.compress()
+.. rst-class:: sphx-glr-script-out
+ Out:
+ .. code-block:: none
+    [2022-02-21 18:53:07] WARNING (nni.algorithms.compression.pytorch.quantization.qat_quantizer/MainThread) op_names ['relu1'] not found in model
+    [2022-02-21 18:53:07] WARNING (nni.algorithms.compression.pytorch.quantization.qat_quantizer/MainThread) op_names ['relu2'] not found in model
+    TorchModel(
+      (conv1): QuantizerModuleWrapper(
+        (module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      )
+      (conv2): QuantizerModuleWrapper(
+        (module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      )
+      (fc1): Linear(in_features=256, out_features=120, bias=True)
+      (fc2): Linear(in_features=120, out_features=84, bias=True)
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+    )
+.. GENERATED FROM PYTHON SOURCE LINES 97-98
+finetuning the model by using QAT
+.. GENERATED FROM PYTHON SOURCE LINES 98-102
+.. code-block:: default
+    for epoch in range(3):
+        trainer(model, optimizer, criterion)
+        evaluator(model)
+.. rst-class:: sphx-glr-script-out
+ Out:
+ .. code-block:: none
+    Average test loss: 0.2524, Accuracy: 9209/10000 (92%)
+    Average test loss: 0.1711, Accuracy: 9461/10000 (95%)
+    Average test loss: 0.1037, Accuracy: 9690/10000 (97%)
+.. GENERATED FROM PYTHON SOURCE LINES 103-104
+export model and get calibration_config
+.. GENERATED FROM PYTHON SOURCE LINES 104-110
+.. code-block:: default
+    model_path = "./log/mnist_model.pth"
+    calibration_path = "./log/mnist_calibration.pth"
+    calibration_config = quantizer.export_model(model_path, calibration_path)
+    print("calibration_config: ", calibration_config)
+.. rst-class:: sphx-glr-script-out
+ Out:
+ .. code-block:: none
+    [2022-02-21 18:53:54] INFO (nni.compression.pytorch.compressor/MainThread) Model state_dict saved to ./log/mnist_model.pth
+    [2022-02-21 18:53:54] INFO (nni.compression.pytorch.compressor/MainThread) Mask dict saved to ./log/mnist_calibration.pth
+    calibration_config:  {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0026], device='cuda:0'), 'weight_zero_point': tensor([103.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0019], device='cuda:0'), 'weight_zero_point': tensor([116.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 10.175512313842773}}
+.. GENERATED FROM PYTHON SOURCE LINES 111-112
+build tensorRT engine to make a real speed up
+.. GENERATED FROM PYTHON SOURCE LINES 112-119
+.. code-block:: default
+    # from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
+    # input_shape = (32, 1, 28, 28)
+    # engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
+    # engine.compress()
+    # test_trt(engine)
+.. GENERATED FROM PYTHON SOURCE LINES 120-171
+Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
+For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
+For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
+Mnist test
+^^^^^^^^^^
+on one GTX2080 GPU,
+input tensor: ``torch.randn(128, 1, 28, 28)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - quantization strategy
+     - Latency
+     - accuracy
+   * - all in 32bit
+     - 0.001199961
+     - 96%
+   * - mixed precision(average bit 20.4)
+     - 0.000753688
+     - 96%
+   * - all in 8bit
+     - 0.000229869
+     - 93.7%
+Cifar10 resnet18 test (train one epoch)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+on one GTX2080 GPU,
+input tensor: ``torch.randn(128, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - quantization strategy
+     - Latency
+     - accuracy
+   * - all in 32bit
+     - 0.003286268
+     - 54.21%
+   * - mixed precision(average bit 11.55)
+     - 0.001358022
+     - 54.78%
+   * - all in 8bit
+     - 0.000859139
+     - 52.81%
+.. rst-class:: sphx-glr-timing
+   **Total running time of the script:** ( 0 minutes  52.798 seconds)
+.. _sphx_glr_download_tutorials_quantization_speed_up.py:
+.. only :: html
+ .. container:: sphx-glr-footer
+    :class: sphx-glr-footer-example
+  .. container:: sphx-glr-download sphx-glr-download-python
+     :download:`Download Python source code: quantization_speed_up.py <quantization_speed_up.py>`
+  .. container:: sphx-glr-download sphx-glr-download-jupyter
+     :download:`Download Jupyter notebook: quantization_speed_up.ipynb <quantization_speed_up.ipynb>`
+.. only:: html
+ .. rst-class:: sphx-glr-signature
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/quantization_speed_up_codeobj.pickle
+++ b/docs/source/tutorials/quantization_speed_up_codeobj.pickle
--- a/docs/source/tutorials/sg_execution_times.rst
+++ b/docs/source/tutorials/sg_execution_times.rst
@@ -5,12 +5,20 @@
 Computation times
 =================
-**04:06.818** total execution time for **tutorials** files:
+**00:08.409** total execution time for **tutorials** files:
-+-------------------------------------------------------------------------------+-----------+--------+
+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``)                     | 04:06.818 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_pruning_speed_up.py` (``pruning_speed_up.py``)                             | 00:08.409 | 0.0 MB |
-+-------------------------------------------------------------------------------+-----------+--------+
+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``) | 00:00.000 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``)                                           | 00:00.000 | 0.0 MB |
-+-------------------------------------------------------------------------------+-----------+--------+
+-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_nni_experiment.py` (``nni_experiment.py``)           | 00:00.000 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``)                       | 00:00.000 | 0.0 MB |
-+-------------------------------------------------------------------------------+-----------+--------+
+-----------------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_nni_experiment.py` (``nni_experiment.py``)                                 | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py` (``pruning_quick_start_mnist.py``)           | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_quantization_speed_up.py` (``quantization_speed_up.py``)                   | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
--- a/examples/tutorials/.gitignore
+++ b/examples/tutorials/.gitignore
 data/
+log/
--- a/examples/tutorials/pruning_quick_start_mnist.py
+++ b/examples/tutorials/pruning_quick_start_mnist.py
+"""
+Pruning Quickstart
+==================
+Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
+It usually has following paths:
+#. Pre-training a model -> Pruning the model -> Fine-tuning the model
+#. Pruning the model aware training -> Fine-tuning the model
+#. Pruning the model -> Pre-training the compact model
+NNI supports the above three modes and mainly focuses on the pruning stage.
+Follow this tutorial for a quick look at how to use NNI to prune a model in a common practice.
+"""
+# %%
+# Preparation
+# -----------
+#
+# In this tutorial, we use a simple model and pre-train on MNIST dataset.
+# If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+# define the model
+model = TorchModel().to(device)
+# define the optimizer and criterion for pre-training
+optimizer = SGD(model.parameters(), 1e-2)
+criterion = F.nll_loss
+# pre-train and evaluate the model on MNIST dataset
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+# %%
+# Pruning Model
+# -------------
+#
+# Using L1NormPruner pruning the model and generating the masks.
+# Usually, pruners require original model and ``config_list`` as parameters.
+# Detailed about how to write ``config_list`` please refer ...
+#
+# This `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
+# except the layer named `fc3`, because `fc3` is `exclude`.
+# The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
+config_list = [{
+    'sparsity_per_layer': 0.5,
+    'op_types': ['Linear', 'Conv2d']
+}, {
+    'exclude': True,
+    'op_names': ['fc3']
+}]
+# %%
+# Pruners usually require `model` and `config_list` as input arguments.
+from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner
+pruner = L1NormPruner(model, config_list)
+# show the wrapped model structure
+print(model)
+# compress the model and generate the masks
+_, masks = pruner.compress()
+# show the masks sparsity
+for name, mask in masks.items():
+    print(name, ' sparsity: ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
+# %%
+# Speed up the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
+# The model becomes smaller after speed-up,
+# and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
+# need to unwrap the model, if the model is wrapped before speed up
+pruner._unwrap_model()
+# speed up the model
+from nni.compression.pytorch.speedup import ModelSpeedup
+ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
+# %%
+# the model will become real smaller after speed up
+print(model)
+# %%
+# Fine-tuning Compacted Model
+# ---------------------------
+# Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
+# Because speed up will replace the masked big layers with dense small ones.
+optimizer = SGD(model.parameters(), 1e-2)
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
--- a/examples/tutorials/pruning_speed_up.py
+++ b/examples/tutorials/pruning_speed_up.py
+"""
+Speed Up Model with Mask
+========================
+Introduction
+------------
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
+and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
+The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
+To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
+Since the support of sparse kernels in community is limited,
+we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+Design and Implementation
+-------------------------
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
+or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
+thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
+Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
+second, replace the modules.
+The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
+For each type of module, we should prepare a function for module replacement.
+The module replacement function returns a newly created module which is smaller.
+Usage
+-----
+"""
+# %%
+# Generate a mask for the model at first.
+# We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
+# But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
+import torch
+from scripts.compression_mnist_model import TorchModel, device
+model = TorchModel().to(device)
+# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
+conv1_mask = torch.ones_like(model.conv1.weight.data)
+# mask the first three output channels in conv1
+conv1_mask[0: 3] = 0
+masks = {'conv1': {'weight': conv1_mask}}
+# %%
+# Show the original model structure.
+print(model)
+# %%
+# Roughly test the original model inference speed.
+import time
+start = time.time()
+model(torch.rand(128, 1, 28, 28).to(device))
+print('Original Model - Elapsed Time : ', time.time() - start)
+# %%
+# Speed up the model and show the model structure after speed up.
+from nni.compression.pytorch import ModelSpeedup
+ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
+print(model)
+# %%
+# Roughly test the model after speed-up inference speed.
+start = time.time()
+model(torch.rand(128, 1, 28, 28).to(device))
+print('Speedup Model - Elapsed Time : ', time.time() - start)
+# %%
+# For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
+# please refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.
+#
+# NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+#
+# Limitations
+# -----------
+#
+# For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
+# our current implementation does not work. One workaround is make the function a PyTorch module.
+#
+# If you want to speed up your own model which cannot supported by the current implementation,
+# you need implement the replace function for module replacement, welcome to contribute.
+#
+# Speedup Results of Examples
+# ---------------------------
+#
+# The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
+#
+# These result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.
+#
+# slim pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01197
+#      - 0.005107
+#    * - 2
+#      - 0.02019
+#      - 0.008769
+#    * - 4
+#      - 0.02733
+#      - 0.014809
+#    * - 8
+#      - 0.04310
+#      - 0.027441
+#    * - 16
+#      - 0.07731
+#      - 0.05008
+#    * - 32
+#      - 0.14464
+#      - 0.10027
+#
+# fpgm pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on cpu,
+# input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+# too large variance
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01383
+#      - 0.01839
+#    * - 2
+#      - 0.01167
+#      - 0.003558
+#    * - 4
+#      - 0.01636
+#      - 0.01088
+#    * - 40
+#      - 0.14412
+#      - 0.08268
+#    * - 40
+#      - 1.29385
+#      - 0.14408
+#    * - 40
+#      - 0.41035
+#      - 0.46162
+#    * - 400
+#      - 6.29020
+#      - 5.82143
+#
+# l1filter pruner example
+# ^^^^^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01026
+#      - 0.003677
+#    * - 2
+#      - 0.01657
+#      - 0.008161
+#    * - 4
+#      - 0.02458
+#      - 0.020018
+#    * - 8
+#      - 0.03498
+#      - 0.025504
+#    * - 16
+#      - 0.06757
+#      - 0.047523
+#    * - 32
+#      - 0.10487
+#      - 0.086442
+#
+# APoZ pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01389
+#      - 0.004208
+#    * - 2
+#      - 0.01628
+#      - 0.008310
+#    * - 4
+#      - 0.02521
+#      - 0.014008
+#    * - 8
+#      - 0.03386
+#      - 0.023923
+#    * - 16
+#      - 0.06042
+#      - 0.046183
+#    * - 32
+#      - 0.12421
+#      - 0.087113
+#
+# SimulatedAnnealing pruner example
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
+# We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
+# The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.
+#
+# .. image:: ../../img/SA_latency_accuracy.png
+#
+# User configuration for ModelSpeedup
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# **PyTorch**
+#
+# ..  autoclass:: nni.compression.pytorch.ModelSpeedup
--- a/examples/tutorials/quantization_quick_start_mnist.py
+++ b/examples/tutorials/quantization_quick_start_mnist.py
+"""
+Quantization Quickstart
+=======================
+Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
+In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
+Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
+"""
+# %%
+# Preparation
+# -----------
+#
+# In this tutorial, we use a simple model and pre-train on MNIST dataset.
+# If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+# define the model
+model = TorchModel().to(device)
+# define the optimizer and criterion for pre-training
+optimizer = SGD(model.parameters(), 1e-2)
+criterion = F.nll_loss
+# pre-train and evaluate the model on MNIST dataset
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+# %%
+# Quantizing Model
+# ----------------
+#
+# Initialize a `config_list`.
+config_list = [{
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv1']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu1']
+}, {
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv2']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu2']
+}]
+# %%
+# finetuning the model by using QAT
+from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+dummy_input = torch.rand(32, 1, 28, 28).to(device)
+quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
+quantizer.compress()
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+# %%
+# export model and get calibration_config
+model_path = "./log/mnist_model.pth"
+calibration_path = "./log/mnist_calibration.pth"
+calibration_config = quantizer.export_model(model_path, calibration_path)
+print("calibration_config: ", calibration_config)
--- a/examples/tutorials/quantization_speed_up.py
+++ b/examples/tutorials/quantization_speed_up.py
+"""
+Speed Up Model with Calibration Config
+======================================
+Introduction
+------------
+Deep learning network has been computational intensive and memory intensive 
+which increases the difficulty of deploying deep neural network model. Quantization is a 
+fundamental technology which is widely used to reduce memory footprint and speed up inference 
+process. Many frameworks begin to support quantization, but few of them support mixed precision 
+quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will 
+not speed up the inference process. To get real speedup of mixed precision quantization and 
+help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different 
+DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model 
+with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects 
+TensorRT at this stage, and will support more backends in the future.
+Design and Implementation
+-------------------------
+To support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  
+Frontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference 
+framework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and 
+TensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph 
+representation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx 
+model to generate inference engine. 
+Quantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.
+Users should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\  ).
+After quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.
+After getting mixed precision engine, users can do inference with input data.
+Note
+* Recommend using "cpu"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not "cpu"(host), this tool will transpose it to "cpu" which may increases unnecessary overhead.
+* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).
+* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.
+Prerequisite
+------------
+CUDA version >= 11.0
+TensorRT version >= 7.2
+Note
+* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\  
+Usage
+-----
+"""
+# %%
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
+config_list = [{
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv1']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu1']
+}, {
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv2']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu2']
+}]
+model = TorchModel().to(device)
+optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
+criterion = F.nll_loss
+dummy_input = torch.rand(32, 1, 28,28).to(device)
+from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
+quantizer.compress()
+# %%
+# finetuning the model by using QAT
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+# %%
+# export model and get calibration_config
+model_path = "./log/mnist_model.pth"
+calibration_path = "./log/mnist_calibration.pth"
+calibration_config = quantizer.export_model(model_path, calibration_path)
+print("calibration_config: ", calibration_config)
+# %%
+# build tensorRT engine to make a real speed up
+# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
+# input_shape = (32, 1, 28, 28)
+# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
+# engine.compress()
+# test_trt(engine)
+# %%
+# Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
+#
+# For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
+#
+# For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
+#
+# Mnist test
+# ^^^^^^^^^^
+#
+# on one GTX2080 GPU,
+# input tensor: ``torch.randn(128, 1, 28, 28)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - quantization strategy
+#      - Latency
+#      - accuracy
+#    * - all in 32bit
+#      - 0.001199961
+#      - 96%
+#    * - mixed precision(average bit 20.4)
+#      - 0.000753688
+#      - 96%
+#    * - all in 8bit
+#      - 0.000229869
+#      - 93.7%
+#
+# Cifar10 resnet18 test (train one epoch)
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# on one GTX2080 GPU,
+# input tensor: ``torch.randn(128, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - quantization strategy
+#      - Latency
+#      - accuracy
+#    * - all in 32bit
+#      - 0.003286268
+#      - 54.21%
+#    * - mixed precision(average bit 11.55)
+#      - 0.001358022
+#      - 54.78%
+#    * - all in 8bit
+#      - 0.000859139
+#      - 52.81%
--- a/examples/tutorials/scripts/compression_mnist_model.py
+++ b/examples/tutorials/scripts/compression_mnist_model.py
+from pathlib import Path
+root_path = Path(__file__).parent.parent
+# define the model
+import torch
+from torch import nn
+from torch.nn import functional as F
+class TorchModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 6, 5, 1)
+        self.conv2 = nn.Conv2d(6, 16, 5, 1)
+        self.fc1 = nn.Linear(16 * 4 * 4, 120)
+        self.fc2 = nn.Linear(120, 84)
+        self.fc3 = nn.Linear(84, 10)
+    def forward(self, x):
+        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
+        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
+        x = torch.flatten(x, 1)
+        x = F.relu(self.fc1(x))
+        x = F.relu(self.fc2(x))
+        x = self.fc3(x)
+        return F.log_softmax(x, dim=1)
+use_cuda = torch.cuda.is_available()
+device = torch.device("cuda" if use_cuda else "cpu")
+# load data
+from torchvision import datasets, transforms
+train_loader = torch.utils.data.DataLoader(
+    datasets.MNIST(root_path / 'data', train=True, download=True, transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])), batch_size=128, shuffle=True)
+test_loader = torch.utils.data.DataLoader(
+    datasets.MNIST(root_path / 'data', train=False, transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])), batch_size=1000, shuffle=True)
+# define the trainer and evaluator
+def trainer(model, optimizer, criterion):
+    # training the model
+    model.train()
+    for data, target in train_loader:
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = criterion(output, target)
+        loss.backward()
+        optimizer.step()
+def evaluator(model):
+    # evaluating the model accuracy and average test loss
+    model.eval()
+    test_loss = 0
+    correct = 0
+    test_dataset_length = len(test_loader.dataset)
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            # sum up batch loss
+            test_loss += F.nll_loss(output, target, reduction='sum').item()
+            # get the index of the max log-probability
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= test_dataset_length
+    accuracy = 100. * correct / test_dataset_length
+    print('Average test loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(test_loss, correct, test_dataset_length, accuracy))
+def test_trt(engine):
+    test_loss = 0
+    correct = 0
+    time_elasped = 0
+    for data, target in test_loader:
+        output, time = engine.inference(data)
+        test_loss += F.nll_loss(output, target, reduction='sum').item()
+        pred = output.argmax(dim=1, keepdim=True)
+        correct += pred.eq(target.view_as(pred)).sum().item()
+        time_elasped += time
+    test_loss /= len(test_loader.dataset)
+    print('Loss: {}  Accuracy: {}%'.format(
+        test_loss, 100 * correct / len(test_loader.dataset)))
+    print("Inference elapsed_time (whole dataset): {}s".format(time_elasped))
--- a/nni/algorithms/compression/pytorch/quantization/bnn_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/bnn_quantizer.py
@@ -22,9 +22,74 @@ class ClipGrad(QuantGrad):
 class BNNQuantizer(Quantizer):
-    """Binarized Neural Networks, as defined in:
+    r"""
-    Binarized Neural Networks: Training Deep Neural Networks with Weights and Outputs Constrained to +1 or -1
+    Binarized Neural Networks, as defined in:
-    (https://arxiv.org/abs/1602.02830)
+    `Binarized Neural Networks: Training Deep Neural Networks with Weights and
+    Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ ,
+    ..
+        We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time.
+        At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass,
+        BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations,
+        which is expected to substantially improve power-efficiency.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `BNNQuantizer`, NNI will patch the optimizer and count the optimize step number.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> quantizer = BNNQuantizer(model, config_list, optimizer)
+        >>> quantizer.compress()
+        >>> # Training Process...
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py
+    <examples/model_compress/quantization/BNN_quantizer_cifar10.py>`.
+    Notes
+    -----
+    **Results**
+    We implemented one of the experiments in
+    `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
+    <https://arxiv.org/abs/1602.02830>`__,
+    we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
+    .. list-table::
+        :header-rows: 1
+        :widths: auto
+        * - Model
+            - Accuracy
+        * - VGGNet
+            - 86.93%
+    The experiments code can be found at
+    :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py
+    <examples/model_compress/quantization/BNN_quantizer_cifar10.py>`
    """
    def __init__(self, model, config_list, optimizer):

--- a/nni/algorithms/compression/pytorch/quantization/dorefa_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/dorefa_quantizer.py
@@ -13,9 +13,45 @@ logger = logging.getLogger(__name__)
 class DoReFaQuantizer(Quantizer):
-    """Quantizer using the DoReFa scheme, as defined in:
+    r"""
-    Zhou et al., DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
+    Quantizer using the DoReFa scheme, as defined in:
-    (https://arxiv.org/abs/1606.06160)
+    `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__\ ,
+    authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `DoReFaQuantizer`, NNI will patch the optimizer and count the optimize step number.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> quantizer = DoReFaQuantizer(model, config_list, optimizer)
+        >>> quantizer.compress()
+        >>> # Training Process...
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/DoReFaQuantizer_torch_mnist.py
+    <examples/model_compress/quantization/DoReFaQuantizer_torch_mnist.py>`.
    """
    def __init__(self, model, config_list, optimizer):

--- a/nni/algorithms/compression/pytorch/quantization/lsq_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/lsq_quantizer.py
@@ -11,35 +11,56 @@ logger = logging.getLogger(__name__)
 class LsqQuantizer(Quantizer):
-    """Quantizer defined in:
+    r"""
-       Learned Step Size Quantization (ICLR 2020)
+    Quantizer defined in: `LEARNED STEP SIZE QUANTIZATION <https://arxiv.org/pdf/1902.08153.pdf>`__,
-       https://arxiv.org/pdf/1902.08153.pdf
+    authors Steven K. Esser and Jeffrey L. McKinstry provide an algorithm to train the scales with gradients.
+    ..
+        The authors introduce a novel means to estimate and scale the task loss gradient at each weight and activation
+        layer's quantizer step size, such that it can be learned in conjunction with other network parameters.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        The model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `LsqQuantizer`, NNI will patch the optimizer and count the optimize step number.
+    dummy_input : Tuple[torch.Tensor]
+        Inputs to the model, which are used to get the graph of the module. The graph is used to find Conv-Bn patterns.
+        And then the batch normalization folding would be enabled. If dummy_input is not given,
+        the batch normalization folding would be disabled.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import LsqQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> dummy_input = torch.rand(...)
+        >>> quantizer = LsqQuantizer(model, config_list, optimizer, dummy_input=dummy_input)
+        >>> quantizer.compress()
+        >>> # Training Process...
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/LSQ_torch_quantizer.py <examples/model_compress/quantization/LSQ_torch_quantizer.py>`.
    """
    def __init__(self, model, config_list, optimizer, dummy_input=None):
-        """
-        Parameters
-        ----------
-        model : torch.nn.Module
-            the model to be quantized
-        config_list : list of dict
-            list of configurations for quantization
-            supported keys for dict:
-                - quant_types : list of string
-                    type of quantization you want to apply, currently support 'weight', 'input', 'output'
-                - quant_bits : int or dict of {str : int}
-                    bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
-                    when the type is int, all quantization types share same bits length
-                - quant_start_step : int
-                    disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
-                    state where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
-                - op_types : list of string
-                    types of nn.module you want to apply quantization, eg. 'Conv2d'
-                - dummy_input : tuple of tensor
-                    inputs to the model, which are used to get the graph of the module. The graph is used to find
-                    Conv-Bn patterns. And then the batch normalization folding would be enabled. If dummy_input is not
-                    given, the batch normalization folding would be disabled.
-        """
        assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type"
        super().__init__(model, config_list, optimizer, dummy_input)
        device = next(model.parameters()).device

--- a/nni/algorithms/compression/pytorch/quantization/native_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/native_quantizer.py
@@ -12,7 +12,32 @@ logger = logging.getLogger(__name__)
 class NaiveQuantizer(Quantizer):
-    """quantize weight to 8 bits
+    r"""
+    Quantize weight to 8 bits directly.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                when the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import NaiveQuantizer
+        >>> model = ...
+        >>> NaiveQuantizer(model).compress()
    """
    def __init__(self, model, config_list, optimizer=None):

--- a/nni/algorithms/compression/pytorch/quantization/observer_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/observer_quantizer.py
@@ -14,7 +14,12 @@ logger = logging.getLogger(__name__)
 class ObserverQuantizer(Quantizer):
-    """This quantizer uses observers to record weight/output statistics to get quantization information.
+    r"""
+    Observer quantizer is a framework of post-training quantization.
+    It will insert observers into the place where the quantization will happen.
+    During quantization calibration, each observer will record all the tensors it 'sees'.
+    These tensors will be used to calculate the quantization statistics after calibration.
    The whole process can be divided into three steps:
    1. It will register observers to the place where quantization would happen (just like registering hooks).
@@ -23,6 +28,66 @@ class ObserverQuantizer(Quantizer):
    Note that the observer type, tensor dtype and quantization qscheme are hard coded for now. Their customization
    are under development and will be ready soon.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                when the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is optional in `ObserverQuantizer`.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> quantizer = ObserverQuantizer(model, config_list)
+        >>> # define a calibration function
+        >>> def calibration(model, calib_loader):
+        >>>     model.eval()
+        >>>     with torch.no_grad():
+        >>>         for data, _ in calib_loader:
+        >>>             model(data)
+        >>> calibration(model, calib_loader)
+        >>> quantizer.compress()
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/observer_quantizer.py <examples/model_compress/quantization/observer_quantizer.py>`.
+    .. note::
+        This quantizer is still under development for now. Some quantizer settings are hard-coded:
+        - weight observer: per_tensor_symmetric, qint8
+        - output observer: per_tensor_affine, quint8, reduce_range=True
+        Other settings (such as quant_type and op_names) can be configured.
+    Notes
+    -----
+    **About the compress API**
+    Before the `compress` API is called, the model will only record tensors' statistics and no quantization process will be executed.
+    After the `compress` API is called, the model will NOT record tensors' statistics any more. The quantization scale and zero point will
+    be generated for each tensor and will be used to quantize each tensor during inference (we call it evaluation mode)
+    **About calibration**
+    Usually we pick up about 100 training/evaluation examples for calibration. If you found the accuracy is a bit low, try
+    to reduce the number of calibration examples.
    """
    def __init__(self, model, config_list, optimizer=None):

--- a/nni/algorithms/compression/pytorch/quantization/qat_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/qat_quantizer.py
@@ -107,36 +107,150 @@ def update_ema(biased_ema, value, decay):
 class QAT_Quantizer(Quantizer):
-    """Quantizer defined in:
+    r"""
+    Quantizer defined in:
    Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
    http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
+    Authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
+    ..
+        We propose an approach that simulates quantization effects in the forward pass of training.
+        Backpropagation still happens as usual, and all weights and biases are stored in floating point
+        so that they can be easily nudged by small amounts.
+        The forward propagation pass however simulates quantized inference as it will happen in the inference engine,
+        by implementing in floating-point arithmetic the rounding behavior of the quantization scheme:
+        * Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer,
+            the batch normalization parameters are “folded into” the weights before quantization.
+        * Activations are quantized at points where they would be during inference,
+            e.g. after the activation function is applied to a convolutional or fully connected layer’s output,
+            or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - quant_start_step : int
+                Disable quantization until model are run by certain number of steps, this allows the network to enter a more stable.
+                State where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `QAT_Quantizer`, NNI will patch the optimizer and count the optimize step number.
+    dummy_input : Tuple[torch.Tensor]
+        Inputs to the model, which are used to get the graph of the module. The graph is used to find Conv-Bn patterns.
+        And then the batch normalization folding would be enabled. If dummy_input is not given,
+        the batch normalization folding would be disabled.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> dummy_input = torch.rand(...)
+        >>> quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input=dummy_input)
+        >>> quantizer.compress()
+        >>> # Training Process...
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/QAT_torch_quantizer.py <examples/model_compress/quantization/QAT_torch_quantizer.py>`.
+    Notes
+    -----
+    **Batch normalization folding**
+    Batch normalization folding is supported in QAT quantizer. It can be easily enabled by passing an argument `dummy_input` to
+    the quantizer, like:
+    .. code-block:: python
+        # assume your model takes an input of shape (1, 1, 28, 28)
+        # and dummy_input must be on the same device as the model
+        dummy_input = torch.randn(1, 1, 28, 28)
+        # pass the dummy_input to the quantizer
+        quantizer = QAT_Quantizer(model, config_list, dummy_input=dummy_input)
+    The quantizer will automatically detect Conv-BN patterns and simulate batch normalization folding process in the training
+    graph. Note that when the quantization aware training process is finished, the folded weight/bias would be restored after calling
+    `quantizer.export_model`.
+    **Quantization dtype and scheme customization**
+    Different backends on different devices use different quantization strategies (i.e. dtype (int or uint) and
+    scheme (per-tensor or per-channel and symmetric or affine)). QAT quantizer supports customization of mainstream dtypes and schemes.
+    There are two ways to set them. One way is setting them globally through a function named `set_quant_scheme_dtype` like:
+    .. code-block:: python
+        from nni.compression.pytorch.quantization.settings import set_quant_scheme_dtype
+        # This will set all the quantization of 'input' in 'per_tensor_affine' and 'uint' manner
+        set_quant_scheme_dtype('input', 'per_tensor_affine', 'uint)
+        # This will set all the quantization of 'output' in 'per_tensor_symmetric' and 'int' manner
+        set_quant_scheme_dtype('output', 'per_tensor_symmetric', 'int')
+        # This will set all the quantization of 'weight' in 'per_channel_symmetric' and 'int' manner
+        set_quant_scheme_dtype('weight', 'per_channel_symmetric', 'int')
+    The other way is more detailed. You can customize the dtype and scheme in each quantization config list like:
+    .. code-block:: python
+        config_list = [{
+            'quant_types': ['weight'],
+            'quant_bits':  8,
+            'op_types':['Conv2d', 'Linear'],
+            'quant_dtype': 'int',
+            'quant_scheme': 'per_channel_symmetric'
+        }, {
+            'quant_types': ['output'],
+            'quant_bits': 8,
+            'quant_start_step': 7000,
+            'op_types':['ReLU6'],
+            'quant_dtype': 'uint',
+            'quant_scheme': 'per_tensor_affine'
+        }]
+    **Multi-GPU training**
+    QAT quantizer natively supports multi-gpu training (DataParallel and DistributedDataParallel). Note that the quantizer
+    instantiation should happen before you wrap your model with DataParallel or DistributedDataParallel. For example:
+    .. code-block:: python
+        from torch.nn.parallel import DistributedDataParallel as DDP
+        from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+        model = define_your_model()
+        model = QAT_Quantizer(model, **other_params)  # <--- QAT_Quantizer instantiation
+        model = DDP(model)
+        for i in range(epochs):
+            train(model)
+            eval(model)
    """
    def __init__(self, model, config_list, optimizer, dummy_input=None):
-        """
-        Parameters
-        ----------
-        layer : LayerInfo
-            the layer to quantize
-        config_list : list of dict
-            list of configurations for quantization
-            supported keys for dict:
-                - quant_types : list of string
-                    type of quantization you want to apply, currently support 'weight', 'input', 'output'
-                - quant_bits : int or dict of {str : int}
-                    bits length of quantization, key is the quantization type, value is the length, eg. {'weight', 8},
-                    when the type is int, all quantization types share same bits length
-                - quant_start_step : int
-                    disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
-                    state where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
-                - op_types : list of string
-                    types of nn.module you want to apply quantization, eg. 'Conv2d'
-                - dummy_input : tuple of tensor
-                    inputs to the model, which are used to get the graph of the module. The graph is used to find
-                    Conv-Bn patterns. And then the batch normalization folding would be enabled. If dummy_input is not
-                    given, the batch normalization folding would be disabled.
-        """
        assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type"
        super().__init__(model, config_list, optimizer, dummy_input)
        self.quant_grad = QATGrad.apply

--- a/nni/algorithms/compression/v2/pytorch/pruning/amc_pruner.py
+++ b/nni/algorithms/compression/v2/pytorch/pruning/amc_pruner.py
@@ -160,9 +160,13 @@ class AMCTaskGenerator(TaskGenerator):
 class AMCPruner(IterativePruner):
-    """
+    r"""
-    A pytorch implementation of AMC: AutoML for Model Compression and Acceleration on Mobile Devices.
+    AMC pruner leverages reinforcement learning to provide the model compression policy.
-    (https://arxiv.org/pdf/1802.03494.pdf)
+    According to the author, this learning-based compression policy outperforms conventional rule-based compression policy by having a higher compression ratio,
+    better preserving the accuracy and freeing human labor.
+    For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
    Suggust config all `total_sparsity` in `config_list` a same value.
    AMC pruner will treat the first sparsity in `config_list` as the global sparsity.
@@ -216,6 +220,18 @@ class AMCPruner(IterativePruner):
    target : str
        'flops' or 'params'. Note that the sparsity in other pruners always means the parameters sparse, but in AMC, you can choose flops sparse.
        This parameter is used to explain what the sparsity setting in config_list refers to.
+    Examples
+    --------
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import AMCPruner
+        >>> config_list = [{'op_types': ['Conv2d'], 'total_sparsity': 0.5, 'max_sparsity_per_layer': 0.8}]
+        >>> dummy_input = torch.rand(...).to(device)
+        >>> evaluator = ...
+        >>> finetuner = ...
+        >>> pruner = AMCPruner(400, model, config_list, dummy_input, evaluator, finetuner=finetuner)
+        >>> pruner.compress()
+    The full script can be found :githublink:`here <examples/model_compress/pruning/v2/amc_pruning_torch.py>`.
    """
    def __init__(self, total_episode: int, model: Module, config_list: List[Dict], dummy_input: Tensor,