Merge pull request #4668 from microsoft/doc-refactor

51d261e7 · J-shang · GitHub · d63a2ea3 · b469e1c1 · 51d261e7
Unverified Commit 51d261e7 authored Mar 22, 2022 by J-shang Committed by GitHub Mar 22, 2022
20 changed files
--- a/examples/tutorials/pruning_quick_start_mnist.py
+++ b/examples/tutorials/pruning_quick_start_mnist.py
+"""
+Pruning Quickstart
+==================
+
+Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
+It usually has following paths:
+
+#. Pre-training a model -> Pruning the model -> Fine-tuning the model
+#. Pruning the model aware training -> Fine-tuning the model
+#. Pruning the model -> Pre-training the compact model
+
+NNI supports the above three modes and mainly focuses on the pruning stage.
+Follow this tutorial for a quick look at how to use NNI to prune a model in a common practice.
+"""
+
+# %%
+# Preparation
+# -----------
+#
+# In this tutorial, we use a simple model and pre-train on MNIST dataset.
+# If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
+
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+
+from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+
+# define the model
+model = TorchModel().to(device)
+
+# show the model structure, note that pruner will wrap the model layer.
+print(model)
+
+# %%
+
+# define the optimizer and criterion for pre-training
+
+optimizer = SGD(model.parameters(), 1e-2)
+criterion = F.nll_loss
+
+# pre-train and evaluate the model on MNIST dataset
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+
+# %%
+# Pruning Model
+# -------------
+#
+# Using L1NormPruner pruning the model and generating the masks.
+# Usually, pruners require original model and ``config_list`` as parameters.
+# Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
+#
+# This `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
+# except the layer named `fc3`, because `fc3` is `exclude`.
+# The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
+
+config_list = [{
+    'sparsity_per_layer': 0.5,
+    'op_types': ['Linear', 'Conv2d']
+}, {
+    'exclude': True,
+    'op_names': ['fc3']
+}]
+
+# %%
+# Pruners usually require `model` and `config_list` as input arguments.
+
+from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner
+pruner = L1NormPruner(model, config_list)
+
+# show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
+print(model)
+
+# %%
+
+# compress the model and generate the masks
+_, masks = pruner.compress()
+# show the masks sparsity
+for name, mask in masks.items():
+    print(name, ' sparsity : ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
+
+# %%
+# Speed up the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
+# The model becomes smaller after speed-up,
+# and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
+
+# need to unwrap the model, if the model is wrapped before speed up
+pruner._unwrap_model()
+
+# speed up the model
+from nni.compression.pytorch.speedup import ModelSpeedup
+
+ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
+
+# %%
+# the model will become real smaller after speed up
+print(model)
+
+# %%
+# Fine-tuning Compacted Model
+# ---------------------------
+# Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
+# Because speed up will replace the masked big layers with dense small ones.
+
+optimizer = SGD(model.parameters(), 1e-2)
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
--- a/examples/tutorials/pruning_speed_up.py
+++ b/examples/tutorials/pruning_speed_up.py
+"""
+Speed Up Model with Mask
+========================
+
+Introduction
+------------
+
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
+and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
+The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
+To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
+Since the support of sparse kernels in community is limited,
+we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+
+Design and Implementation
+-------------------------
+
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
+or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
+thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
+Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
+second, replace the modules.
+
+The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
+For each type of module, we should prepare a function for module replacement.
+The module replacement function returns a newly created module which is smaller.
+
+Usage
+-----
+
+"""
+
+# %%
+# Generate a mask for the model at first.
+# We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
+# But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
+
+import torch
+from scripts.compression_mnist_model import TorchModel, device
+
+model = TorchModel().to(device)
+# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
+conv1_mask = torch.ones_like(model.conv1.weight.data)
+# mask the first three output channels in conv1
+conv1_mask[0: 3] = 0
+masks = {'conv1': {'weight': conv1_mask}}
+
+# %%
+# Show the original model structure.
+print(model)
+
+# %%
+# Roughly test the original model inference speed.
+import time
+start = time.time()
+model(torch.rand(128, 1, 28, 28).to(device))
+print('Original Model - Elapsed Time : ', time.time() - start)
+
+# %%
+# Speed up the model and show the model structure after speed up.
+from nni.compression.pytorch import ModelSpeedup
+ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
+print(model)
+
+# %%
+# Roughly test the model after speed-up inference speed.
+start = time.time()
+model(torch.rand(128, 1, 28, 28).to(device))
+print('Speedup Model - Elapsed Time : ', time.time() - start)
+
+# %%
+# For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
+# please refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.
+#
+# NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+#
+# Limitations
+# -----------
+#
+# For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
+# our current implementation does not work. One workaround is make the function a PyTorch module.
+#
+# If you want to speed up your own model which cannot supported by the current implementation,
+# you need implement the replace function for module replacement, welcome to contribute.
+#
+# Speedup Results of Examples
+# ---------------------------
+#
+# The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
+#
+# These result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.
+#
+# slim pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01197
+#      - 0.005107
+#    * - 2
+#      - 0.02019
+#      - 0.008769
+#    * - 4
+#      - 0.02733
+#      - 0.014809
+#    * - 8
+#      - 0.04310
+#      - 0.027441
+#    * - 16
+#      - 0.07731
+#      - 0.05008
+#    * - 32
+#      - 0.14464
+#      - 0.10027
+#
+# fpgm pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on cpu,
+# input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+# too large variance
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01383
+#      - 0.01839
+#    * - 2
+#      - 0.01167
+#      - 0.003558
+#    * - 4
+#      - 0.01636
+#      - 0.01088
+#    * - 40
+#      - 0.14412
+#      - 0.08268
+#    * - 40
+#      - 1.29385
+#      - 0.14408
+#    * - 40
+#      - 0.41035
+#      - 0.46162
+#    * - 400
+#      - 6.29020
+#      - 5.82143
+#
+# l1filter pruner example
+# ^^^^^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01026
+#      - 0.003677
+#    * - 2
+#      - 0.01657
+#      - 0.008161
+#    * - 4
+#      - 0.02458
+#      - 0.020018
+#    * - 8
+#      - 0.03498
+#      - 0.025504
+#    * - 16
+#      - 0.06757
+#      - 0.047523
+#    * - 32
+#      - 0.10487
+#      - 0.086442
+#
+# APoZ pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01389
+#      - 0.004208
+#    * - 2
+#      - 0.01628
+#      - 0.008310
+#    * - 4
+#      - 0.02521
+#      - 0.014008
+#    * - 8
+#      - 0.03386
+#      - 0.023923
+#    * - 16
+#      - 0.06042
+#      - 0.046183
+#    * - 32
+#      - 0.12421
+#      - 0.087113
+#
+# SimulatedAnnealing pruner example
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
+# We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
+# The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.
+#
+# .. image:: ../../img/SA_latency_accuracy.png
--- a/examples/tutorials/quantization_customize.py
+++ b/examples/tutorials/quantization_customize.py
+"""
+Customize a new quantization algorithm
+======================================
+
+To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``.
+Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``.
+``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+"""
+
+from nni.compression.pytorch import Quantizer
+
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        """
+        Suggest you to use the NNI defined spec for config
+        """
+        super().__init__(model, config_list)
+
+    def quantize_weight(self, weight, config, **kwargs):
+        """
+        quantize should overload this method to quantize weight tensors.
+        This method is effectively hooked to :meth:`forward` of the model.
+
+        Parameters
+        ----------
+        weight : Tensor
+            weight that needs to be quantized
+        config : dict
+            the configuration for weight quantization
+        """
+
+        # Put your code to generate `new_weight` here
+        new_weight = ...
+        return new_weight
+
+    def quantize_output(self, output, config, **kwargs):
+        """
+        quantize should overload this method to quantize output.
+        This method is effectively hooked to `:meth:`forward` of the model.
+
+        Parameters
+        ----------
+        output : Tensor
+            output that needs to be quantized
+        config : dict
+            the configuration for output quantization
+        """
+
+        # Put your code to generate `new_output` here
+        new_output = ...
+        return new_output
+
+    def quantize_input(self, *inputs, config, **kwargs):
+        """
+        quantize should overload this method to quantize input.
+        This method is effectively hooked to :meth:`forward` of the model.
+
+        Parameters
+        ----------
+        inputs : Tensor
+            inputs that needs to be quantized
+        config : dict
+            the configuration for inputs quantization
+        """
+
+        # Put your code to generate `new_input` here
+        new_input = ...
+        return new_input
+
+    def update_epoch(self, epoch_num):
+        pass
+
+    def step(self):
+        """
+        Can do some processing based on the model or weights binded
+        in the func bind_model
+        """
+        pass
+
+# %%
+# Customize backward function
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# Sometimes it's necessary for a quantization operation to have a customized backward function,
+# such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ ,
+# user can customize a backward function as follow:
+
+from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
+
+class ClipGrad(QuantGrad):
+    @staticmethod
+    def quant_backward(tensor, grad_output, quant_type):
+        """
+        This method should be overrided by subclass to provide customized backward function,
+        default implementation is Straight-Through Estimator
+        Parameters
+        ----------
+        tensor : Tensor
+            input of quantization operation
+        grad_output : Tensor
+            gradient of the output of quantization operation
+        quant_type : QuantType
+            the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,
+            you can define different behavior for different types.
+        Returns
+        -------
+        tensor
+            gradient of the input of quantization operation
+        """
+
+        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+        if quant_type == QuantType.OUTPUT:
+            grad_output[tensor.abs() > 1] = 0
+        return grad_output
+
+class _YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        super().__init__(model, config_list)
+        # set your customized backward function to overwrite default backward function
+        self.quant_grad = ClipGrad
+
+# %%
+# If you do not customize ``QuantGrad``, the default backward is Straight-Through Estimator. 
--- a/examples/tutorials/quantization_quick_start_mnist.py
+++ b/examples/tutorials/quantization_quick_start_mnist.py
+"""
+Quantization Quickstart
+=======================
+
+Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
+
+In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
+Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
+"""
+
+# %%
+# Preparation
+# -----------
+#
+# In this tutorial, we use a simple model and pre-train on MNIST dataset.
+# If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
+
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+
+from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+
+# define the model
+model = TorchModel().to(device)
+
+# define the optimizer and criterion for pre-training
+
+optimizer = SGD(model.parameters(), 1e-2)
+criterion = F.nll_loss
+
+# pre-train and evaluate the model on MNIST dataset
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+
+# %%
+# Quantizing Model
+# ----------------
+#
+# Initialize a `config_list`.
+# Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
+
+config_list = [{
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv1']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu1']
+}, {
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv2']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu2']
+}]
+
+# %%
+# finetuning the model by using QAT
+from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+dummy_input = torch.rand(32, 1, 28, 28).to(device)
+quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
+quantizer.compress()
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+
+# %%
+# export model and get calibration_config
+model_path = "./log/mnist_model.pth"
+calibration_path = "./log/mnist_calibration.pth"
+calibration_config = quantizer.export_model(model_path, calibration_path)
+
+print("calibration_config: ", calibration_config)
--- a/examples/tutorials/quantization_speed_up.py
+++ b/examples/tutorials/quantization_speed_up.py
+"""
+Speed Up Model with Calibration Config
+======================================
+
+
+Introduction
+------------
+
+Deep learning network has been computational intensive and memory intensive 
+which increases the difficulty of deploying deep neural network model. Quantization is a 
+fundamental technology which is widely used to reduce memory footprint and speed up inference 
+process. Many frameworks begin to support quantization, but few of them support mixed precision 
+quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will 
+not speed up the inference process. To get real speedup of mixed precision quantization and 
+help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different 
+DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model 
+with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects 
+TensorRT at this stage, and will support more backends in the future.
+
+
+Design and Implementation
+-------------------------
+
+To support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  
+Frontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference 
+framework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and 
+TensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph 
+representation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx 
+model to generate inference engine. 
+
+
+Quantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.
+Users should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\  ).
+After quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.
+
+
+After getting mixed precision engine, users can do inference with input data.
+
+
+Note
+
+
+* Recommend using "cpu"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not "cpu"(host), this tool will transpose it to "cpu" which may increases unnecessary overhead.
+* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).
+* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.
+
+
+Prerequisite
+------------
+CUDA version >= 11.0
+
+TensorRT version >= 7.2
+
+Note
+
+* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\  
+
+Usage
+-----
+
+"""
+
+# %%
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
+
+config_list = [{
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv1']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu1']
+}, {
+    'quant_types': ['input', 'weight'],
+    'quant_bits': {'input': 8, 'weight': 8},
+    'op_names': ['conv2']
+}, {
+    'quant_types': ['output'],
+    'quant_bits': {'output': 8},
+    'op_names': ['relu2']
+}]
+
+model = TorchModel().to(device)
+optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
+criterion = F.nll_loss
+dummy_input = torch.rand(32, 1, 28,28).to(device)
+
+from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
+quantizer.compress()
+
+# %%
+# finetuning the model by using QAT
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+
+# %%
+# export model and get calibration_config
+model_path = "./log/mnist_model.pth"
+calibration_path = "./log/mnist_calibration.pth"
+calibration_config = quantizer.export_model(model_path, calibration_path)
+
+print("calibration_config: ", calibration_config)
+
+# %%
+# build tensorRT engine to make a real speed up
+
+# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
+# input_shape = (32, 1, 28, 28)
+# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
+# engine.compress()
+# test_trt(engine)
+
+# %%
+# Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
+#
+# For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
+#
+# For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
+#
+# Mnist test
+# ^^^^^^^^^^
+#
+# on one GTX2080 GPU,
+# input tensor: ``torch.randn(128, 1, 28, 28)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - quantization strategy
+#      - Latency
+#      - accuracy
+#    * - all in 32bit
+#      - 0.001199961
+#      - 96%
+#    * - mixed precision(average bit 20.4)
+#      - 0.000753688
+#      - 96%
+#    * - all in 8bit
+#      - 0.000229869
+#      - 93.7%
+#
+# Cifar10 resnet18 test (train one epoch)
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# on one GTX2080 GPU,
+# input tensor: ``torch.randn(128, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - quantization strategy
+#      - Latency
+#      - accuracy
+#    * - all in 32bit
+#      - 0.003286268
+#      - 54.21%
+#    * - mixed precision(average bit 11.55)
+#      - 0.001358022
+#      - 54.78%
+#    * - all in 8bit
+#      - 0.000859139
+#      - 52.81%
--- a/examples/tutorials/scripts/compression_mnist_model.py
+++ b/examples/tutorials/scripts/compression_mnist_model.py
+from pathlib import Path
+
+root_path = Path(__file__).parent.parent
+
+# define the model
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+class TorchModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(1, 6, 5, 1)
+        self.conv2 = nn.Conv2d(6, 16, 5, 1)
+        self.fc1 = nn.Linear(16 * 4 * 4, 120)
+        self.fc2 = nn.Linear(120, 84)
+        self.fc3 = nn.Linear(84, 10)
+
+    def forward(self, x):
+        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
+        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
+        x = torch.flatten(x, 1)
+        x = F.relu(self.fc1(x))
+        x = F.relu(self.fc2(x))
+        x = self.fc3(x)
+        return F.log_softmax(x, dim=1)
+
+use_cuda = torch.cuda.is_available()
+device = torch.device("cuda" if use_cuda else "cpu")
+
+# load data
+from torchvision import datasets, transforms
+
+train_loader = torch.utils.data.DataLoader(
+    datasets.MNIST(root_path / 'data', train=True, download=True, transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])), batch_size=128, shuffle=True)
+
+test_loader = torch.utils.data.DataLoader(
+    datasets.MNIST(root_path / 'data', train=False, transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])), batch_size=1000, shuffle=True)
+
+# define the trainer and evaluator
+def trainer(model, optimizer, criterion):
+    # training the model
+    model.train()
+    for data, target in train_loader:
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = criterion(output, target)
+        loss.backward()
+        optimizer.step()
+
+def evaluator(model):
+    # evaluating the model accuracy and average test loss
+    model.eval()
+    test_loss = 0
+    correct = 0
+    test_dataset_length = len(test_loader.dataset)
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            # sum up batch loss
+            test_loss += F.nll_loss(output, target, reduction='sum').item()
+            # get the index of the max log-probability
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= test_dataset_length
+    accuracy = 100. * correct / test_dataset_length
+    print('Average test loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(test_loss, correct, test_dataset_length, accuracy))
+
+def test_trt(engine):
+    test_loss = 0
+    correct = 0
+    time_elasped = 0
+    for data, target in test_loader:
+        output, time = engine.inference(data)
+        test_loss += F.nll_loss(output, target, reduction='sum').item()
+        pred = output.argmax(dim=1, keepdim=True)
+        correct += pred.eq(target.view_as(pred)).sum().item()
+        time_elasped += time
+    test_loss /= len(test_loader.dataset)
+
+    print('Loss: {}  Accuracy: {}%'.format(
+        test_loss, 100 * correct / len(test_loader.dataset)))
+    print("Inference elapsed_time (whole dataset): {}s".format(time_elasped))
--- a/nni/__init__.py
+++ b/nni/__init__.py
@@ -10,6 +10,7 @@ from .runtime.log import init_logger
 init_logger()

 from .common.serializer import trace, dump, load
+from .experiment import Experiment
 from .runtime.env_vars import dispatcher_env_vars
 from .utils import ClassArgsValidator


--- a/nni/algorithms/compression/pytorch/auto_compress/experiment.py
+++ b/nni/algorithms/compression/pytorch/auto_compress/experiment.py
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT license.

+from __future__ import annotations
+
 import inspect
 from pathlib import Path, PurePath
-from typing import overload, Union, List

 from nni.experiment import Experiment, ExperimentConfig
 from nni.algorithms.compression.pytorch.auto_compress.interface import AbstractAutoCompressionModule
@@ -11,49 +12,19 @@ from nni.algorithms.compression.pytorch.auto_compress.interface import AbstractA

 class AutoCompressionExperiment(Experiment):

-    @overload
-    def __init__(self, auto_compress_module: AbstractAutoCompressionModule, config: ExperimentConfig) -> None:
-        """
-        Prepare an experiment.
-
-        Use `Experiment.run()` to launch it.
-
-        Parameters
-        ----------
-        auto_compress_module
-            The module provided by the user implements the `AbstractAutoCompressionModule` interfaces.
-            Remember put the module file under `trial_code_directory`.
-        config
-            Experiment configuration.
-        """
-        ...
-
-    @overload
-    def __init__(self, auto_compress_module: AbstractAutoCompressionModule, training_service: Union[str, List[str]]) -> None:
+    def __init__(self, auto_compress_module: AbstractAutoCompressionModule, config_or_platform: ExperimentConfig | str | list[str]) -> None:
        """
-        Prepare an experiment, leaving configuration fields to be set later.
-
-        Example usage::
-
-            experiment = Experiment(auto_compress_module, 'remote')
-            experiment.config.trial_command = 'python3 trial.py'
-            experiment.config.machines.append(RemoteMachineConfig(ip=..., user_name=...))
-            ...
-            experiment.run(8080)
+        Prepare an auto compression experiment.

        Parameters
        ----------
        auto_compress_module
            The module provided by the user implements the `AbstractAutoCompressionModule` interfaces.
            Remember put the module file under `trial_code_directory`.
-        training_service
-            Name of training service.
-            Supported value: "local", "remote", "openpai", "aml", "kubeflow", "frameworkcontroller", "adl" and hybrid training service.
+        config_or_platform
+            Experiment configuration or training service name.
        """
-        ...
-
-    def __init__(self, auto_compress_module: AbstractAutoCompressionModule, config=None, training_service=None):
-        super().__init__(config, training_service)
+        super().__init__(config_or_platform)

        self.module_file_path = str(PurePath(inspect.getfile(auto_compress_module)))
        self.module_name = auto_compress_module.__name__

--- a/nni/algorithms/compression/pytorch/quantization/bnn_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/bnn_quantizer.py
@@ -22,9 +22,74 @@ class ClipGrad(QuantGrad):


 class BNNQuantizer(Quantizer):
-    """Binarized Neural Networks, as defined in:
-    Binarized Neural Networks: Training Deep Neural Networks with Weights and Outputs Constrained to +1 or -1
-    (https://arxiv.org/abs/1602.02830)
+    r"""
+    Binarized Neural Networks, as defined in:
+    `Binarized Neural Networks: Training Deep Neural Networks with Weights and
+    Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__,
+
+    ..
+
+        We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time.
+        At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass,
+        BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations,
+        which is expected to substantially improve power-efficiency.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `BNNQuantizer`, NNI will patch the optimizer and count the optimize step number.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> quantizer = BNNQuantizer(model, config_list, optimizer)
+        >>> quantizer.compress()
+        >>> # Training Process...
+
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py
+    <examples/model_compress/quantization/BNN_quantizer_cifar10.py>`.
+
+    Notes
+    -----
+
+    **Results**
+
+    We implemented one of the experiments in
+    `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
+    <https://arxiv.org/abs/1602.02830>`__,
+    we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
+
+    .. list-table::
+        :header-rows: 1
+        :widths: auto
+
+        * - Model
+            - Accuracy
+        * - VGGNet
+            - 86.93%
+
+    The experiments code can be found at
+    :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py
+    <examples/model_compress/quantization/BNN_quantizer_cifar10.py>`
    """

    def __init__(self, model, config_list, optimizer):

--- a/nni/algorithms/compression/pytorch/quantization/dorefa_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/dorefa_quantizer.py
@@ -13,9 +13,45 @@ logger = logging.getLogger(__name__)


 class DoReFaQuantizer(Quantizer):
-    """Quantizer using the DoReFa scheme, as defined in:
-    Zhou et al., DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
-    (https://arxiv.org/abs/1606.06160)
+    r"""
+    Quantizer using the DoReFa scheme, as defined in:
+    `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__,
+    authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `DoReFaQuantizer`, NNI will patch the optimizer and count the optimize step number.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> quantizer = DoReFaQuantizer(model, config_list, optimizer)
+        >>> quantizer.compress()
+        >>> # Training Process...
+
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/DoReFaQuantizer_torch_mnist.py
+    <examples/model_compress/quantization/DoReFaQuantizer_torch_mnist.py>`.
+
    """

    def __init__(self, model, config_list, optimizer):

--- a/nni/algorithms/compression/pytorch/quantization/lsq_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/lsq_quantizer.py
@@ -11,35 +11,56 @@ logger = logging.getLogger(__name__)


 class LsqQuantizer(Quantizer):
-    """Quantizer defined in:
-       Learned Step Size Quantization (ICLR 2020)
-       https://arxiv.org/pdf/1902.08153.pdf
+    r"""
+    Quantizer defined in: `LEARNED STEP SIZE QUANTIZATION <https://arxiv.org/pdf/1902.08153.pdf>`__,
+    authors Steven K. Esser and Jeffrey L. McKinstry provide an algorithm to train the scales with gradients.
+
+    ..
+
+        The authors introduce a novel means to estimate and scale the task loss gradient at each weight and activation
+        layer's quantizer step size, such that it can be learned in conjunction with other network parameters.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        The model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `LsqQuantizer`, NNI will patch the optimizer and count the optimize step number.
+    dummy_input : Tuple[torch.Tensor]
+        Inputs to the model, which are used to get the graph of the module. The graph is used to find Conv-Bn patterns.
+        And then the batch normalization folding would be enabled. If dummy_input is not given,
+        the batch normalization folding would be disabled.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import LsqQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> dummy_input = torch.rand(...)
+        >>> quantizer = LsqQuantizer(model, config_list, optimizer, dummy_input=dummy_input)
+        >>> quantizer.compress()
+        >>> # Training Process...
+
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/LSQ_torch_quantizer.py <examples/model_compress/quantization/LSQ_torch_quantizer.py>`.
+
    """

    def __init__(self, model, config_list, optimizer, dummy_input=None):
-        """
-        Parameters
-        ----------
-        model : torch.nn.Module
-            the model to be quantized
-        config_list : list of dict
-            list of configurations for quantization
-            supported keys for dict:
-                - quant_types : list of string
-                    type of quantization you want to apply, currently support 'weight', 'input', 'output'
-                - quant_bits : int or dict of {str : int}
-                    bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
-                    when the type is int, all quantization types share same bits length
-                - quant_start_step : int
-                    disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
-                    state where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
-                - op_types : list of string
-                    types of nn.module you want to apply quantization, eg. 'Conv2d'
-                - dummy_input : tuple of tensor
-                    inputs to the model, which are used to get the graph of the module. The graph is used to find
-                    Conv-Bn patterns. And then the batch normalization folding would be enabled. If dummy_input is not
-                    given, the batch normalization folding would be disabled.
-        """
        assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type"
        super().__init__(model, config_list, optimizer, dummy_input)
        device = next(model.parameters()).device

--- a/nni/algorithms/compression/pytorch/quantization/native_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/native_quantizer.py
@@ -12,7 +12,32 @@ logger = logging.getLogger(__name__)


 class NaiveQuantizer(Quantizer):
-    """quantize weight to 8 bits
+    r"""
+    Quantize weight to 8 bits directly.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                when the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import NaiveQuantizer
+        >>> model = ...
+        >>> NaiveQuantizer(model).compress()
    """

    def __init__(self, model, config_list, optimizer=None):

--- a/nni/algorithms/compression/pytorch/quantization/observer_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/observer_quantizer.py
@@ -14,7 +14,12 @@ logger = logging.getLogger(__name__)


 class ObserverQuantizer(Quantizer):
-    """This quantizer uses observers to record weight/output statistics to get quantization information.
+    r"""
+    Observer quantizer is a framework of post-training quantization.
+    It will insert observers into the place where the quantization will happen.
+    During quantization calibration, each observer will record all the tensors it 'sees'.
+    These tensors will be used to calculate the quantization statistics after calibration.
+
    The whole process can be divided into three steps:

    1. It will register observers to the place where quantization would happen (just like registering hooks).
@@ -23,6 +28,66 @@ class ObserverQuantizer(Quantizer):

    Note that the observer type, tensor dtype and quantization qscheme are hard coded for now. Their customization
    are under development and will be ready soon.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                when the type is int, all quantization types share same bits length.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is optional in `ObserverQuantizer`.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> quantizer = ObserverQuantizer(model, config_list)
+        >>> # define a calibration function
+        >>> def calibration(model, calib_loader):
+        >>>     model.eval()
+        >>>     with torch.no_grad():
+        >>>         for data, _ in calib_loader:
+        >>>             model(data)
+        >>> calibration(model, calib_loader)
+        >>> quantizer.compress()
+
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/observer_quantizer.py <examples/model_compress/quantization/observer_quantizer.py>`.
+
+    .. note::
+        This quantizer is still under development for now. Some quantizer settings are hard-coded:
+
+        - weight observer: per_tensor_symmetric, qint8
+        - output observer: per_tensor_affine, quint8, reduce_range=True
+
+        Other settings (such as quant_type and op_names) can be configured.
+
+    Notes
+    -----
+
+    **About the compress API**
+
+    Before the `compress` API is called, the model will only record tensors' statistics and no quantization process will be executed.
+    After the `compress` API is called, the model will NOT record tensors' statistics any more. The quantization scale and zero point will
+    be generated for each tensor and will be used to quantize each tensor during inference (we call it evaluation mode)
+
+    **About calibration**
+
+    Usually we pick up about 100 training/evaluation examples for calibration. If you found the accuracy is a bit low, try
+    to reduce the number of calibration examples.
    """

    def __init__(self, model, config_list, optimizer=None):

--- a/nni/algorithms/compression/pytorch/quantization/qat_quantizer.py
+++ b/nni/algorithms/compression/pytorch/quantization/qat_quantizer.py
@@ -107,36 +107,151 @@ def update_ema(biased_ema, value, decay):


 class QAT_Quantizer(Quantizer):
-    """Quantizer defined in:
-    Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
-    http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
+    r"""
+    Quantizer defined in:
+    `Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
+    <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
+
+    Authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
+
+    ..
+
+        We propose an approach that simulates quantization effects in the forward pass of training.
+        Backpropagation still happens as usual, and all weights and biases are stored in floating point
+        so that they can be easily nudged by small amounts.
+        The forward propagation pass however simulates quantized inference as it will happen in the inference engine,
+        by implementing in floating-point arithmetic the rounding behavior of the quantization scheme:
+
+        * Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer,
+          the batch normalization parameters are “folded into” the weights before quantization.
+
+        * Activations are quantized at points where they would be during inference,
+          e.g. after the activation function is applied to a convolutional or fully connected layer’s output,
+          or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be quantized.
+    config_list : List[Dict]
+        List of configurations for quantization. Supported keys for dict:
+            - quant_types : List[str]
+                Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
+            - quant_bits : Union[int, Dict[str, int]]
+                Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
+                When the type is int, all quantization types share same bits length.
+            - quant_start_step : int
+                Disable quantization until model are run by certain number of steps, this allows the network to enter a more stable.
+                State where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0.
+            - op_types : List[str]
+                Types of nn.module you want to apply quantization, eg. 'Conv2d'.
+            - op_names : List[str]
+                Names of nn.module you want to apply quantization, eg. 'conv1'.
+            - exclude : bool
+                Set True then the layers setting by op_types and op_names will be excluded from quantization.
+    optimizer : torch.optim.Optimizer
+        Optimizer is required in `QAT_Quantizer`, NNI will patch the optimizer and count the optimize step number.
+    dummy_input : Tuple[torch.Tensor]
+        Inputs to the model, which are used to get the graph of the module. The graph is used to find Conv-Bn patterns.
+        And then the batch normalization folding would be enabled. If dummy_input is not given,
+        the batch normalization folding would be disabled.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+        >>> model = ...
+        >>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
+        >>> optimizer = ...
+        >>> dummy_input = torch.rand(...)
+        >>> quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input=dummy_input)
+        >>> quantizer.compress()
+        >>> # Training Process...
+
+    For detailed example please refer to
+    :githublink:`examples/model_compress/quantization/QAT_torch_quantizer.py <examples/model_compress/quantization/QAT_torch_quantizer.py>`.
+
+    Notes
+    -----
+
+    **Batch normalization folding**
+
+    Batch normalization folding is supported in QAT quantizer. It can be easily enabled by passing an argument `dummy_input` to
+    the quantizer, like:
+
+    .. code-block:: python
+
+        # assume your model takes an input of shape (1, 1, 28, 28)
+        # and dummy_input must be on the same device as the model
+        dummy_input = torch.randn(1, 1, 28, 28)
+
+        # pass the dummy_input to the quantizer
+        quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input=dummy_input)
+
+
+    The quantizer will automatically detect Conv-BN patterns and simulate batch normalization folding process in the training
+    graph. Note that when the quantization aware training process is finished, the folded weight/bias would be restored after calling
+    `quantizer.export_model`.
+
+    **Quantization dtype and scheme customization**
+
+    Different backends on different devices use different quantization strategies (i.e. dtype (int or uint) and
+    scheme (per-tensor or per-channel and symmetric or affine)). QAT quantizer supports customization of mainstream dtypes and schemes.
+    There are two ways to set them. One way is setting them globally through a function named `set_quant_scheme_dtype` like:
+
+    .. code-block:: python
+
+        from nni.compression.pytorch.quantization.settings import set_quant_scheme_dtype
+
+        # This will set all the quantization of 'input' in 'per_tensor_affine' and 'uint' manner
+        set_quant_scheme_dtype('input', 'per_tensor_affine', 'uint)
+        # This will set all the quantization of 'output' in 'per_tensor_symmetric' and 'int' manner
+        set_quant_scheme_dtype('output', 'per_tensor_symmetric', 'int')
+        # This will set all the quantization of 'weight' in 'per_channel_symmetric' and 'int' manner
+        set_quant_scheme_dtype('weight', 'per_channel_symmetric', 'int')
+
+
+    The other way is more detailed. You can customize the dtype and scheme in each quantization config list like:
+
+    .. code-block:: python
+
+        config_list = [{
+            'quant_types': ['weight'],
+            'quant_bits':  8,
+            'op_types':['Conv2d', 'Linear'],
+            'quant_dtype': 'int',
+            'quant_scheme': 'per_channel_symmetric'
+        }, {
+            'quant_types': ['output'],
+            'quant_bits': 8,
+            'quant_start_step': 7000,
+            'op_types':['ReLU6'],
+            'quant_dtype': 'uint',
+            'quant_scheme': 'per_tensor_affine'
+        }]
+
+    **Multi-GPU training**
+
+    QAT quantizer natively supports multi-gpu training (DataParallel and DistributedDataParallel). Note that the quantizer
+    instantiation should happen before you wrap your model with DataParallel or DistributedDataParallel. For example:
+
+    .. code-block:: python
+
+        from torch.nn.parallel import DistributedDataParallel as DDP
+        from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+
+        model = define_your_model()
+
+        model = QAT_Quantizer(model, **other_params)  # <--- QAT_Quantizer instantiation
+
+        model = DDP(model)
+
+        for i in range(epochs):
+            train(model)
+            eval(model)
+
    """

    def __init__(self, model, config_list, optimizer, dummy_input=None):
-        """
-        Parameters
-        ----------
-        layer : LayerInfo
-            the layer to quantize
-        config_list : list of dict
-            list of configurations for quantization
-            supported keys for dict:
-                - quant_types : list of string
-                    type of quantization you want to apply, currently support 'weight', 'input', 'output'
-                - quant_bits : int or dict of {str : int}
-                    bits length of quantization, key is the quantization type, value is the length, eg. {'weight', 8},
-                    when the type is int, all quantization types share same bits length
-                - quant_start_step : int
-                    disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
-                    state where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
-                - op_types : list of string
-                    types of nn.module you want to apply quantization, eg. 'Conv2d'
-                - dummy_input : tuple of tensor
-                    inputs to the model, which are used to get the graph of the module. The graph is used to find
-                    Conv-Bn patterns. And then the batch normalization folding would be enabled. If dummy_input is not
-                    given, the batch normalization folding would be disabled.
-        """
-
        assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type"
        super().__init__(model, config_list, optimizer, dummy_input)
        self.quant_grad = QATGrad.apply

--- a/nni/algorithms/compression/v2/pytorch/base/compressor.py
+++ b/nni/algorithms/compression/v2/pytorch/base/compressor.py
@@ -35,17 +35,16 @@ def _setattr(model: Module, name: str, module: Module):
 class Compressor:
    """
    The abstract base pytorch compressor.
+
+    Parameters
+    ----------
+    model
+        The model under compressed.
+    config_list
+        The config list used by compressor, usually specifies the 'op_types' or 'op_names' that want to compress.
    """

    def __init__(self, model: Optional[Module], config_list: Optional[List[Dict]]):
-        """
-        Parameters
-        ----------
-        model
-            The model under compressed.
-        config_list
-            The config list used by compressor, usually specifies the 'op_types' or 'op_names' that want to compress.
-        """
        self.is_wrapped = False
        if model is not None:
            self.reset(model=model, config_list=config_list)

--- a/nni/algorithms/compression/v2/pytorch/base/pruner.py
+++ b/nni/algorithms/compression/v2/pytorch/base/pruner.py
@@ -16,21 +16,22 @@ __all__ = ['Pruner']


 class PrunerModuleWrapper(Module):
-    def __init__(self, module: Module, module_name: str, config: Dict, pruner: Compressor):
-        """
-        Wrap a module to enable data parallel, forward method customization and buffer registeration.
+    """
+    Wrap a module to enable data parallel, forward method customization and buffer registeration.
+
+    Parameters
+    ----------
+    module
+        The module user wants to compress.
+    config
+        The configurations that users specify for compression.
+    module_name
+        The name of the module to compress, wrapper module shares same name.
+    pruner
+        The pruner used to calculate mask.
+    """

-        Parameters
-        ----------
-        module
-            The module user wants to compress.
-        config
-            The configurations that users specify for compression.
-        module_name
-            The name of the module to compress, wrapper module shares same name.
-        pruner
-            The pruner used to calculate mask.
-        """
+    def __init__(self, module: Module, module_name: str, config: Dict, pruner: Compressor):
        super().__init__()
        # origin layer information
        self.module = module

--- a/nni/algorithms/compression/v2/pytorch/pruning/amc_pruner.py
+++ b/nni/algorithms/compression/v2/pytorch/pruning/amc_pruner.py
@@ -160,9 +160,13 @@ class AMCTaskGenerator(TaskGenerator):


 class AMCPruner(IterativePruner):
-    """
-    A pytorch implementation of AMC: AutoML for Model Compression and Acceleration on Mobile Devices.
-    (https://arxiv.org/pdf/1802.03494.pdf)
+    r"""
+    AMC pruner leverages reinforcement learning to provide the model compression policy.
+    According to the author, this learning-based compression policy outperforms conventional rule-based compression policy by having a higher compression ratio,
+    better preserving the accuracy and freeing human labor.
+
+    For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
+
    Suggust config all `total_sparsity` in `config_list` a same value.
    AMC pruner will treat the first sparsity in `config_list` as the global sparsity.

@@ -216,6 +220,18 @@ class AMCPruner(IterativePruner):
    target : str
        'flops' or 'params'. Note that the sparsity in other pruners always means the parameters sparse, but in AMC, you can choose flops sparse.
        This parameter is used to explain what the sparsity setting in config_list refers to.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import AMCPruner
+        >>> config_list = [{'op_types': ['Conv2d'], 'total_sparsity': 0.5, 'max_sparsity_per_layer': 0.8}]
+        >>> dummy_input = torch.rand(...).to(device)
+        >>> evaluator = ...
+        >>> finetuner = ...
+        >>> pruner = AMCPruner(400, model, config_list, dummy_input, evaluator, finetuner=finetuner)
+        >>> pruner.compress()
+
+    The full script can be found :githublink:`here <examples/model_compress/pruning/v2/amc_pruning_torch.py>`.
    """

    def __init__(self, total_episode: int, model: Module, config_list: List[Dict], dummy_input: Tensor,

--- a/nni/algorithms/compression/v2/pytorch/pruning/auto_compress_pruner.py
+++ b/nni/algorithms/compression/v2/pytorch/pruning/auto_compress_pruner.py
@@ -51,7 +51,16 @@ class AutoCompressTaskGenerator(LotteryTicketTaskGenerator):


 class AutoCompressPruner(IterativePruner):
-    """
+    r"""
+    For total iteration number :math:`N`, AutoCompressPruner prune the model that survive the previous iteration for a fixed sparsity ratio (e.g., :math:`1-{(1-0.8)}^{(1/N)}`) to achieve the overall sparsity (e.g., :math:`0.8`):
+
+    .. code-block:: bash
+
+        1. Generate sparsities distribution using SimulatedAnnealingPruner
+        2. Perform ADMM-based pruning to generate pruning result for the next iteration.
+
+    For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+
    Parameters
    ----------
    model : Module
@@ -70,7 +79,7 @@ class AutoCompressPruner(IterativePruner):
            The model will be trained or inferenced `training_epochs` epochs.
        - traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
            The traced optimizer instance which the optimizer class is wrapped by nni.trace.
-            E.g. traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters()).
+            E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``.
        - criterion : Callable[[Tensor, Tensor], Tensor].
            The criterion function used in trainer. Take model output and target value as input, and return the loss.
        - iterations : int.
@@ -107,6 +116,34 @@ class AutoCompressPruner(IterativePruner):
        If set True, speed up the model at the end of each iteration to make the pruned model compact.
    dummy_input : Optional[torch.Tensor]
        If `speed_up` is True, `dummy_input` is required for tracing the model in speed up.
+
+    Examples
+    --------
+        >>> import nni
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import AutoCompressPruner
+        >>> model = ...
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> # make sure you have used nni.trace to wrap the optimizer class before initialize
+        >>> traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+        >>> trainer = ...
+        >>> criterion = ...
+        >>> evaluator = ...
+        >>> finetuner = ...
+        >>> admm_params = {
+        >>>     'trainer': trainer,
+        >>>     'traced_optimizer': traced_optimizer,
+        >>>     'criterion': criterion,
+        >>>     'iterations': 10,
+        >>>     'training_epochs': 1
+        >>> }
+        >>> sa_params = {
+        >>>     'evaluator': evaluator
+        >>> }
+        >>> pruner = AutoCompressPruner(model, config_list, 10, admm_params, sa_params, finetuner=finetuner)
+        >>> pruner.compress()
+        >>> _, model, masks, _, _ = pruner.get_best_result()
+
+    The full script can be found :githublink:`here <examples/model_compress/pruning/v2/auto_compress_pruner.py>`.
    """

    def __init__(self, model: Module, config_list: List[Dict], total_iteration: int, admm_params: Dict,

--- a/nni/algorithms/compression/v2/pytorch/pruning/basic_pruner.py
+++ b/nni/algorithms/compression/v2/pytorch/pruning/basic_pruner.py
@@ -125,11 +125,14 @@ class BasicPruner(Pruner):


 class LevelPruner(BasicPruner):
-    """
+    r"""
+    This is a basic pruner, and in some papers called it magnitude pruning or fine-grained pruning.
+    It will mask the smallest magnitude weights in each specified layer by a saprsity ratio configured in the config list.
+
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -145,18 +148,18 @@ class LevelPruner(BasicPruner):
        operation an example, weight tensor will be split into sub block whose shape is aligned to
        balance_gran. Then finegrained pruning will be applied internal of sub block. This sparsity
        pattern have more chance to achieve better trade-off between model performance and hardware
-        acceleration. Please refer to releated paper for further information 'Balanced Sparsity for 
-        Efficient DNN Inference on GPU'(https://arxiv.org/pdf/1811.00206.pdf).
+        acceleration. Please refer to releated paper for further information `Balanced Sparsity for
+        Efficient DNN Inference on GPU <https://arxiv.org/pdf/1811.00206.pdf>`__.
    balance_gran : list
-        Balance_gran is for special sparse pattern balanced sparsity, Default value is None which means pruning 
+        Balance_gran is for special sparse pattern balanced sparsity, Default value is None which means pruning
        without awaring balance, namely normal finegrained pruning.
        If passing list of int, LevelPruner will prune the model in the granularity of multi-dimension block.
        Attention that the length of balance_gran should be smaller than tensor dimension.
        For instance, in Linear operation, length of balance_gran should be equal or smaller than two since
-        dimension of pruning weight is two. If setting balbance_gran = [5, 5], sparsity = 0.6, pruner will 
-        divide pruning parameters into multiple block with tile size (5,5) and each bank has 5 * 5 values 
-        and 10 values would be kept after pruning. Finegrained pruning is applied in the granularity of block 
-        so that each block will kept same number of non-zero values after pruning. Such pruning method "balance" 
+        dimension of pruning weight is two. If setting balbance_gran = [5, 5], sparsity = 0.6, pruner will
+        divide pruning parameters into multiple block with tile size (5,5) and each bank has 5 * 5 values
+        and 10 values would be kept after pruning. Finegrained pruning is applied in the granularity of block
+        so that each block will kept same number of non-zero values after pruning. Such pruning method "balance"
        the non-zero value in tensor which create chance for better hardware acceleration.

        Note: If length of given balance_gran smaller than length of pruning tensor shape, it will be made up
@@ -181,7 +184,16 @@ class LevelPruner(BasicPruner):

                pruning result: Weight tensor whose shape is [64, 64] will be split into 4 [32, 32] sub blocks.
                                Each sub block will be pruned 256 values.
-                
+
+    Examples
+    --------
+        >>> model = ...
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import LevelPruner
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+        >>> pruner = LevelPruner(model, config_list)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/level_pruning_torch.py <examples/model_compress/pruning/v2/level_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], mode: str = "normal", balance_gran: Optional[List] = None):
@@ -215,7 +227,7 @@ class NormPruner(BasicPruner):
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -272,11 +284,20 @@ class NormPruner(BasicPruner):


 class L1NormPruner(NormPruner):
-    """
+    r"""
+    L1 norm pruner computes the l1 norm of the layer weight on the first dimension,
+    then prune the weight blocks on this dimension with smaller l1 norm values.
+    i.e., compute the l1 norm of the filters in convolution layer as metric values,
+    compute the l1 norm of the weight by rows in linear layer as metric values.
+
+    For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__.
+
+    In addition, L1 norm pruner also supports dependency-aware mode.
+
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -305,16 +326,21 @@ class L1NormPruner(NormPruner):


 class L2NormPruner(NormPruner):
-    """
+    r"""
+    L2 norm pruner is a variant of L1 norm pruner.
+    The only different between L2 norm pruner and L1 norm pruner is L2 norm pruner prunes the weight with the smallest L2 norm of the weights.
+
+    L2 norm pruner also supports dependency-aware mode.
+
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
            - sparsity_per_layer : Equals to sparsity.
-            - op_types : Conv2d and Linear are supported in L1NormPruner.
+            - op_types : Conv2d and Linear are supported in L2NormPruner.
            - op_names : Operation names to be pruned.
            - op_partial_names: Operation partial names to be pruned, will be autocompleted by NNI.
            - exclude : Set True then the layers setting by op_types and op_names will be excluded from pruning.
@@ -330,6 +356,16 @@ class L2NormPruner(NormPruner):
    dummy_input : Optional[torch.Tensor]
        The dummy input to analyze the topology constraints. Note that, the dummy_input
        should on the same device with the model.
+
+    Examples
+    --------
+        >>> model = ...
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import L2NormPruner
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> pruner = L2NormPruner(model, config_list)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/norm_pruning_torch.py <examples/model_compress/pruning/v2/norm_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict],
@@ -338,11 +374,18 @@ class L2NormPruner(NormPruner):


 class FPGMPruner(BasicPruner):
-    """
+    r"""
+    FPGM pruner prunes the blocks of the weight on the first dimension with the smallest geometric median.
+    FPGM chooses the weight blocks with the most replaceable contribution.
+
+    For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/abs/1811.00250>`__.
+
+    FPGM pruner also supports dependency-aware mode.
+
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -363,6 +406,16 @@ class FPGMPruner(BasicPruner):
    dummy_input : Optional[torch.Tensor]
        The dummy input to analyze the topology constraints. Note that, the dummy_input
        should on the same device with the model.
+
+    Examples
+    --------
+        >>> model = ...
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import FPGMPruner
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> pruner = FPGMPruner(model, config_list)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/fpgm_pruning_torch.py <examples/model_compress/pruning/v2/fpgm_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict],
@@ -396,11 +449,16 @@ class FPGMPruner(BasicPruner):


 class SlimPruner(BasicPruner):
-    """
+    r"""
+    Slim pruner adds sparsity regularization on the scaling factors of batch normalization (BN) layers during training to identify unimportant channels.
+    The channels with small scaling factor values will be pruned.
+
+    For more details, please refer to `Learning Efficient Convolutional Networks through Network Slimming <https://arxiv.org/abs/1708.06519>`__\.
+
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -432,7 +490,7 @@ class SlimPruner(BasicPruner):
                model.train(mode=training)
    traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
        The traced optimizer instance which the optimizer class is wrapped by nni.trace.
-        E.g. traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters()).
+        E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``.
    criterion : Callable[[Tensor, Tensor], Tensor]
        The criterion function used in trainer. Take model output and target value as input, and return the loss.
    training_epochs : int
@@ -444,6 +502,21 @@ class SlimPruner(BasicPruner):
        If prune the model in a global way, all layer weights with same config will be considered uniformly.
        That means a single layer may not reach or exceed the sparsity setting in config,
        but the total pruned weights meet the sparsity setting.
+
+    Examples
+    --------
+        >>> import nni
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import SlimPruner
+        >>> model = ...
+        >>> # make sure you have used nni.trace to wrap the optimizer class before initialize
+        >>> traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+        >>> trainer = ...
+        >>> criterion = ...
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
+        >>> pruner = SlimPruner(model, config_list, trainer, traced_optimizer, criterion, training_epochs=1)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/slim_pruning_torch.py <examples/model_compress/pruning/v2/slim_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], trainer: Callable[[Module, Optimizer, Callable], None],
@@ -507,7 +580,7 @@ class ActivationPruner(BasicPruner):
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -537,7 +610,7 @@ class ActivationPruner(BasicPruner):
                model.train(mode=training)
    traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
        The traced optimizer instance which the optimizer class is wrapped by nni.trace.
-        E.g. traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters()).
+        E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``.
    criterion : Callable[[Tensor, Tensor], Tensor]
        The criterion function used in trainer. Take model output and target value as input, and return the loss.
    training_batches
@@ -627,6 +700,82 @@ class ActivationPruner(BasicPruner):


 class ActivationAPoZRankPruner(ActivationPruner):
+    r"""
+    Activation APoZ rank pruner is a pruner which prunes on the first weight dimension,
+    with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
+    The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures <https://arxiv.org/abs/1607.03250>`__.
+
+    The APoZ is defined as:
+    :math:`APoZ_{c}^{(i)} = APoZ\left(O_{c}^{(i)}\right)=\frac{\sum_{k}^{N} \sum_{j}^{M} f\left(O_{c, j}^{(i)}(k)=0\right)}{N \times M}`
+
+    Activation APoZ rank pruner also supports dependency-aware mode.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be pruned.
+    config_list : List[Dict]
+        Supported keys:
+            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
+            - sparsity_per_layer : Equals to sparsity.
+            - op_types : Conv2d and Linear are supported in ActivationAPoZRankPruner.
+            - op_names : Operation names to be pruned.
+            - op_partial_names: Operation partial names to be pruned, will be autocompleted by NNI.
+            - exclude : Set True then the layers setting by op_types and op_names will be excluded from pruning.
+    trainer : Callable[[Module, Optimizer, Callable], None]
+        A callable function used to train model or just inference. Take model, optimizer, criterion as input.
+        The model will be trained or inferenced `training_epochs` epochs.
+
+        Example::
+
+            def trainer(model: Module, optimizer: Optimizer, criterion: Callable[[Tensor, Tensor], Tensor]):
+                training = model.training
+                model.train(mode=True)
+                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+                for batch_idx, (data, target) in enumerate(train_loader):
+                    data, target = data.to(device), target.to(device)
+                    optimizer.zero_grad()
+                    output = model(data)
+                    loss = criterion(output, target)
+                    loss.backward()
+                    # If you don't want to update the model, you can skip `optimizer.step()`, and set train mode False.
+                    optimizer.step()
+                model.train(mode=training)
+    traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
+        The traced optimizer instance which the optimizer class is wrapped by nni.trace.
+        E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``..
+    criterion : Callable[[Tensor, Tensor], Tensor]
+        The criterion function used in trainer. Take model output and target value as input, and return the loss.
+    training_batches
+        The batch number used to collect activations.
+    mode : str
+        'normal' or 'dependency_aware'.
+        If prune the model in a dependency-aware way, this pruner will
+        prune the model according to the activation-based metrics and the channel-dependency or
+        group-dependency of the model. In this way, the pruner will force the conv layers
+        that have dependencies to prune the same channels, so the speedup module can better
+        harvest the speed benefit from the pruned model. Note that, if set 'dependency_aware'
+        , the dummy_input cannot be None, because the pruner needs a dummy input to trace the
+        dependency between the conv layers.
+    dummy_input : Optional[torch.Tensor]
+        The dummy input to analyze the topology constraints. Note that, the dummy_input
+        should on the same device with the model.
+
+    Examples
+    --------
+        >>> import nni
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import ActivationAPoZRankPruner
+        >>> model = ...
+        >>> # make sure you have used nni.trace to wrap the optimizer class before initialize
+        >>> traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+        >>> trainer = ...
+        >>> criterion = ...
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> pruner = ActivationAPoZRankPruner(model, config_list, trainer, traced_optimizer, criterion, training_batches=20)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/activation_pruning_torch.py <examples/model_compress/pruning/v2/activation_pruning_torch.py>`
+    """
    def _activation_trans(self, output: Tensor) -> Tensor:
        # return a matrix that the position of zero in `output` is one, others is zero.
        return torch.eq(self._activation(output.detach()), torch.zeros_like(output)).type_as(output)
@@ -636,6 +785,80 @@ class ActivationAPoZRankPruner(ActivationPruner):


 class ActivationMeanRankPruner(ActivationPruner):
+    r"""
+    Activation mean rank pruner is a pruner which prunes on the first weight dimension,
+    with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
+
+    The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__.
+
+    Activation mean rank pruner also supports dependency-aware mode.
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Model to be pruned.
+    config_list : List[Dict]
+        Supported keys:
+            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
+            - sparsity_per_layer : Equals to sparsity.
+            - op_types : Conv2d and Linear are supported in ActivationPruner.
+            - op_names : Operation names to be pruned.
+            - op_partial_names: Operation partial names to be pruned, will be autocompleted by NNI.
+            - exclude : Set True then the layers setting by op_types and op_names will be excluded from pruning.
+    trainer : Callable[[Module, Optimizer, Callable], None]
+        A callable function used to train model or just inference. Take model, optimizer, criterion as input.
+        The model will be trained or inferenced `training_epochs` epochs.
+
+        Example::
+
+            def trainer(model: Module, optimizer: Optimizer, criterion: Callable[[Tensor, Tensor], Tensor]):
+                training = model.training
+                model.train(mode=True)
+                device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+                for batch_idx, (data, target) in enumerate(train_loader):
+                    data, target = data.to(device), target.to(device)
+                    optimizer.zero_grad()
+                    output = model(data)
+                    loss = criterion(output, target)
+                    loss.backward()
+                    # If you don't want to update the model, you can skip `optimizer.step()`, and set train mode False.
+                    optimizer.step()
+                model.train(mode=training)
+    traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
+        The traced optimizer instance which the optimizer class is wrapped by nni.trace.
+        E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``..
+    criterion : Callable[[Tensor, Tensor], Tensor]
+        The criterion function used in trainer. Take model output and target value as input, and return the loss.
+    training_batches
+        The batch number used to collect activations.
+    mode : str
+        'normal' or 'dependency_aware'.
+        If prune the model in a dependency-aware way, this pruner will
+        prune the model according to the activation-based metrics and the channel-dependency or
+        group-dependency of the model. In this way, the pruner will force the conv layers
+        that have dependencies to prune the same channels, so the speedup module can better
+        harvest the speed benefit from the pruned model. Note that, if set 'dependency_aware'
+        , the dummy_input cannot be None, because the pruner needs a dummy input to trace the
+        dependency between the conv layers.
+    dummy_input : Optional[torch.Tensor]
+        The dummy input to analyze the topology constraints. Note that, the dummy_input
+        should on the same device with the model.
+
+    Examples
+    --------
+        >>> import nni
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import ActivationMeanRankPruner
+        >>> model = ...
+        >>> # make sure you have used nni.trace to wrap the optimizer class before initialize
+        >>> traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+        >>> trainer = ...
+        >>> criterion = ...
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> pruner = ActivationMeanRankPruner(model, config_list, trainer, traced_optimizer, criterion, training_batches=20)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/activation_pruning_torch.py <examples/model_compress/pruning/v2/activation_pruning_torch.py>`
+    """
    def _activation_trans(self, output: Tensor) -> Tensor:
        # return the activation of `output` directly.
        return self._activation(output.detach())
@@ -645,11 +868,21 @@ class ActivationMeanRankPruner(ActivationPruner):


 class TaylorFOWeightPruner(BasicPruner):
-    """
+    r"""
+    Taylor FO weight pruner is a pruner which prunes on the first weight dimension,
+    based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity.
+    The estimated importance is defined as the paper `Importance Estimation for Neural Network Pruning <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__.
+
+    :math:`\widehat{\mathcal{I}}_{\mathcal{S}}^{(1)}(\mathbf{W}) \triangleq \sum_{s \in \mathcal{S}} \mathcal{I}_{s}^{(1)}(\mathbf{W})=\sum_{s \in \mathcal{S}}\left(g_{s} w_{s}\right)^{2}`
+
+    Taylor FO weight pruner also supports dependency-aware mode.
+
+    What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation.
+
    Parameters
    ----------
    model : torch.nn.Module
-        Model to be pruned
+        Model to be pruned.
    config_list : List[Dict]
        Supported keys:
            - sparsity : This is to specify the sparsity for each layer in this config to be compressed.
@@ -681,7 +914,7 @@ class TaylorFOWeightPruner(BasicPruner):
                model.train(mode=training)
    traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
        The traced optimizer instance which the optimizer class is wrapped by nni.trace.
-        E.g. traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters()).
+        E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``.
    criterion : Callable[[Tensor, Tensor], Tensor]
        The criterion function used in trainer. Take model output and target value as input, and return the loss.
    training_batches : int
@@ -703,6 +936,21 @@ class TaylorFOWeightPruner(BasicPruner):
    dummy_input : Optional[torch.Tensor]
        The dummy input to analyze the topology constraints. Note that, the dummy_input
        should on the same device with the model.
+
+    Examples
+    --------
+        >>> import nni
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import TaylorFOWeightPruner
+        >>> model = ...
+        >>> # make sure you have used nni.trace to wrap the optimizer class before initialize
+        >>> traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+        >>> trainer = ...
+        >>> criterion = ...
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> pruner = TaylorFOWeightPruner(model, config_list, trainer, traced_optimizer, criterion, training_batches=20)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/taylorfo_pruning_torch.py <examples/model_compress/pruning/v2/taylorfo_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], trainer: Callable[[Module, Optimizer, Callable], None],
@@ -772,13 +1020,17 @@ class TaylorFOWeightPruner(BasicPruner):


 class ADMMPruner(BasicPruner):
-    """
-    ADMM (Alternating Direction Method of Multipliers) Pruner is a kind of mathematical optimization technique.
-    The metric used in this pruner is the absolute value of the weight.
-    In each iteration, the weight with small magnitudes will be set to zero.
-    Only in the final iteration, the mask will be generated and apply to model wrapper.
+    r"""
+    Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
+    by decomposing the original nonconvex problem into two subproblems that can be solved iteratively.
+    In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively. 
+
+    During the process of solving these two subproblems, the weights of the original model will be changed.
+    Then a fine-grained pruning will be applied to prune the model according to the config list given.

-    The original paper refer to: https://arxiv.org/abs/1804.03294.
+    This solution framework applies both to non-structured and different variations of structured pruning schemes.
+
+    For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers <https://arxiv.org/abs/1804.03294>`__.

    Parameters
    ----------
@@ -814,13 +1066,28 @@ class ADMMPruner(BasicPruner):
                model.train(mode=training)
    traced_optimizer : nni.common.serializer.Traceable(torch.optim.Optimizer)
        The traced optimizer instance which the optimizer class is wrapped by nni.trace.
-        E.g. traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters()).
+        E.g. ``traced_optimizer = nni.trace(torch.nn.Adam)(model.parameters())``.
    criterion : Callable[[Tensor, Tensor], Tensor]
        The criterion function used in trainer. Take model output and target value as input, and return the loss.
    iterations : int
        The total iteration number in admm pruning algorithm.
    training_epochs : int
        The epoch number for training model in each iteration.
+
+    Examples
+    --------
+        >>> import nni
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import ADMMPruner
+        >>> model = ...
+        >>> # make sure you have used nni.trace to wrap the optimizer class before initialize
+        >>> traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+        >>> trainer = ...
+        >>> criterion = ...
+        >>> config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+        >>> pruner = ADMMPruner(model, config_list, trainer, traced_optimizer, criterion, iterations=10, training_epochs=1)
+        >>> masked_model, masks = pruner.compress()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/admm_pruning_torch.py <examples/model_compress/pruning/v2/admm_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], trainer: Callable[[Module, Optimizer, Callable], None],

--- a/nni/algorithms/compression/v2/pytorch/pruning/iterative_pruner.py
+++ b/nni/algorithms/compression/v2/pytorch/pruning/iterative_pruner.py
@@ -70,7 +70,11 @@ class IterativePruner(PruningScheduler):


 class LinearPruner(IterativePruner):
-    """
+    r"""
+    Linear pruner is an iterative pruner, it will increase sparsity evenly from scratch during each iteration.
+
+    For example, the final sparsity is set as 0.5, and the iteration number is 5, then the sparsity used in each iteration are ``[0, 0.1, 0.2, 0.3, 0.4, 0.5]``.
+
    Parameters
    ----------
    model : Module
@@ -98,6 +102,17 @@ class LinearPruner(IterativePruner):
        If evaluator is None, the best result refers to the latest result.
    pruning_params : Dict
        If the chosen pruning_algorithm has extra parameters, put them as a dict to pass in.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import LinearPruner
+        >>> config_list = [{'sparsity': 0.8, 'op_types': ['Conv2d']}]
+        >>> finetuner = ...
+        >>> pruner = LinearPruner(model, config_list, pruning_algorithm='l1', total_iteration=10, finetuner=finetuner)
+        >>> pruner.compress()
+        >>> _, model, masks, _, _ = pruner.get_best_result()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/iterative_pruning_torch.py <examples/model_compress/pruning/v2/iterative_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], pruning_algorithm: str,
@@ -117,7 +132,14 @@ class LinearPruner(IterativePruner):


 class AGPPruner(IterativePruner):
-    """
+    r"""
+    This is an iterative pruner, which the sparsity is increased from an initial sparsity value :math:`s_{i}` (usually 0) to a final sparsity value :math:`s_{f}` over a span of :math:`n` pruning iterations,
+    starting at training step :math:`t_{0}` and with pruning frequency :math:`\Delta t`:
+
+    :math:`s_{t}=s_{f}+\left(s_{i}-s_{f}\right)\left(1-\frac{t-t_{0}}{n \Delta t}\right)^{3} \text { for } t \in\left\{t_{0}, t_{0}+\Delta t, \ldots, t_{0} + n \Delta t\right\}`
+
+    For more details please refer to `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\.
+
    Parameters
    ----------
    model : Module
@@ -145,6 +167,17 @@ class AGPPruner(IterativePruner):
        If evaluator is None, the best result refers to the latest result.
    pruning_params : Dict
        If the chosen pruning_algorithm has extra parameters, put them as a dict to pass in.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import AGPPruner
+        >>> config_list = [{'sparsity': 0.8, 'op_types': ['Conv2d']}]
+        >>> finetuner = ...
+        >>> pruner = AGPPruner(model, config_list, pruning_algorithm='l1', total_iteration=10, finetuner=finetuner)
+        >>> pruner.compress()
+        >>> _, model, masks, _, _ = pruner.get_best_result()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/iterative_pruning_torch.py <examples/model_compress/pruning/v2/iterative_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], pruning_algorithm: str,
@@ -164,7 +197,25 @@ class AGPPruner(IterativePruner):


 class LotteryTicketPruner(IterativePruner):
-    """
+    r"""
+    `The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks <https://arxiv.org/abs/1803.03635>`__\ ,
+    authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis,
+    and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that
+    -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.
+
+    In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :
+
+    ..
+
+        #. Randomly initialize a neural network f(x;theta_0) (where theta\ *0 follows D*\ {theta}).
+        #. Train the network for j iterations, arriving at parameters theta_j.
+        #. Prune p% of the parameters in theta_j, creating a mask m.
+        #. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
+        #. Repeat step 2, 3, and 4.
+
+    If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning,
+    each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
+
    Parameters
    ----------
    model : Module
@@ -194,6 +245,18 @@ class LotteryTicketPruner(IterativePruner):
        If set True, the model weight will reset to the original model weight at the end of each iteration step.
    pruning_params : Dict
        If the chosen pruning_algorithm has extra parameters, put them as a dict to pass in.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import LotteryTicketPruner
+        >>> config_list = [{'sparsity': 0.8, 'op_types': ['Conv2d']}]
+        >>> finetuner = ...
+        >>> pruner = LotteryTicketPruner(model, config_list, pruning_algorithm='l1', total_iteration=10, finetuner=finetuner, reset_weight=True)
+        >>> pruner.compress()
+        >>> _, model, masks, _, _ = pruner.get_best_result()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/iterative_pruning_torch.py <examples/model_compress/pruning/v2/iterative_pruning_torch.py>`
+
    """

    def __init__(self, model: Module, config_list: List[Dict], pruning_algorithm: str,
@@ -215,6 +278,19 @@ class LotteryTicketPruner(IterativePruner):

 class SimulatedAnnealingPruner(IterativePruner):
    """
+    We implement a guided heuristic search method, Simulated Annealing (SA) algorithm. As mentioned in the paper, this method is enhanced on guided search based on prior experience.
+    The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.
+
+    * Randomly initialize a pruning rate distribution (sparsities).
+    * While current_temperature < stop_temperature:
+
+        #. generate a perturbation to current distribution
+        #. Perform fast evaluation on the perturbated distribution
+        #. accept the perturbation according to the performance and probability, if not accepted, return to step 1
+        #. cool down, current_temperature <- current_temperature * cool_down_rate
+
+    For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+
    Parameters
    ----------
    model : Module
@@ -246,6 +322,19 @@ class SimulatedAnnealingPruner(IterativePruner):
        If set True, speed up the model at the end of each iteration to make the pruned model compact.
    dummy_input : Optional[torch.Tensor]
        If `speed_up` is True, `dummy_input` is required for tracing the model in speed up.
+
+    Examples
+    --------
+        >>> from nni.algorithms.compression.v2.pytorch.pruning import SimulatedAnnealingPruner
+        >>> model = ...
+        >>> config_list = [{'sparsity': 0.8, 'op_types': ['Conv2d']}]
+        >>> evaluator = ...
+        >>> finetuner = ...
+        >>> pruner = SimulatedAnnealingPruner(model, config_list, pruning_algorithm='l1', evaluator=evaluator, cool_down_rate=0.9, finetuner=finetuner)
+        >>> pruner.compress()
+        >>> _, model, masks, _, _ = pruner.get_best_result()
+
+    For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/simulated_anealing_pruning_torch.py <examples/model_compress/pruning/v2/simulated_anealing_pruning_torch.py>`
    """

    def __init__(self, model: Module, config_list: List[Dict], evaluator: Callable[[Module], float], start_temperature: float = 100,