Refactor Compression Doc (#3371)

a8f05579 · colorjam · GitHub · 1338c512 · a8f05579 · a8f05579
Unverified Commit a8f05579 authored Feb 25, 2021 by colorjam Committed by GitHub Feb 25, 2021
11 changed files
--- a/docs/en_US/Compression/CompressionReference.rst
+++ b/docs/en_US/Compression/CompressionReference.rst
-Python API Reference of Compression Utilities
+Model Compression API Reference
-=============================================
+===============================
 .. contents::
-Sensitivity Utilities
+Compressors
+-----------
+Compressor
+^^^^^^^^^^
+..  autoclass:: nni.compression.pytorch.compressor.Compressor
+    :members:
+..  autoclass:: nni.compression.pytorch.compressor.Pruner
+    :members:
+..  autoclass:: nni.compression.pytorch.compressor.Quantizer
+    :members:
+Module Wrapper
+^^^^^^^^^^^^^^
+..  autoclass:: nni.compression.pytorch.compressor.PrunerModuleWrapper
+    :members:
+..  autoclass:: nni.compression.pytorch.compressor.QuantizerModuleWrapper
+    :members:
+Weight Masker
+^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.weight_masker.WeightMasker
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.structured_pruning.StructuredWeightMasker
+    :members:
+Pruners
+^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.sensitivity_pruner.SensitivityPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.OneshotPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.LevelPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.SlimPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.L1FilterPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.L2FilterPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.FPGMPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.TaylorFOWeightFilterPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.ActivationAPoZRankFilterPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot.ActivationMeanRankFilterPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.lottery_ticket.LotteryTicketPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.agp.AGPPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.admm_pruner.ADMMPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.auto_compress_pruner.AutoCompressPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.net_adapt_pruner.NetAdaptPruner
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.simulated_annealing_pruner.SimulatedAnnealingPruner
+    :members:
+Quantizers
+^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.quantizers.NaiveQuantizer
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.quantizers.QAT_Quantizer
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.quantizers.DoReFaQuantizer
+    :members:
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.quantizers.BNNQuantizer
+    :members:
+Compression Utilities
 ---------------------
+Sensitivity Utilities
+^^^^^^^^^^^^^^^^^^^^^
 ..  autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
    :members:
 Topology Utilities
------------------
+^^^^^^^^^^^^^^^^^^
 ..  autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
    :members:
@@ -28,6 +133,6 @@ Topology Utilities
    :members:
 Model FLOPs/Parameters Counter
------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ..  autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
--- a/docs/en_US/Compression/Overview.rst
+++ b/docs/en_US/Compression/Overview.rst
@@ -87,11 +87,6 @@ Quantization algorithms compress the original network by reducing the number of
     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
-Automatic Model Compression
---------------------------
-Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found `here <./AutoPruningUsingTuners.rst>`__.
 Model Speedup
 -------------
@@ -102,10 +97,11 @@ Compression Utilities
 Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
-Customize Your Own Compression Algorithms
+Advanced Usage
-----------------------------------------
+--------------
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. Users can learn more about our compression framework and customize a new compression algorithm (pruning algorithm or quantization algorithm) based on our framework. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Please refer to `here <./advanced.rst>`__ for more details.
-NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found `here <./Framework.rst>`__.
 Reference and Feedback
 ----------------------

--- a/docs/en_US/Compression/QuickStart.rst
+++ b/docs/en_US/Compression/QuickStart.rst
-Tutorial for Model Compression
+Quick Start
-==============================
+===========
-.. contents::
+..  toctree::
+    :hidden:
-In this tutorial, we use the `first section <#quick-start-to-compress-a-model>`__ to quickly go through the usage of model compression on NNI. Then use the `second section <#detailed-usage-guide>`__ to explain more details of the usage.
+    Tutorial <Tutorial>
-Quick Start to Compress a Model
-------------------------------
-NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use `slim pruner <../Compression/Pruner.rst#slim-pruner>`__ as an example to show the usage.
+Model compression usually consists of three stages: 1) pre-training a model, 2) compress the model, 3) fine-tuning the model. NNI mainly focuses on the second stage and provides very simple APIs for compressing a model. Follow this guide for a quick look at how easy it is to use NNI to compress a model. 
-Write configuration
+Model Pruning
-^^^^^^^^^^^^^^^^^^^
+-------------
-Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``BatchNorm2d``\ s to sparsity 0.7 while keeping other layers unpruned.
+Here we use `level pruner <../Compression/Pruner.rst#level-pruner>`__ as an example to show the usage of pruning in NNI.
-.. code-block:: python
+Step1. Write configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
-   configure_list = [{
-       'sparsity': 0.7,
-       'op_types': ['BatchNorm2d'],
-   }]
-The specification of configuration can be found `here <#specification-of-config-list>`__. Note that different pruners may have their own defined fields in configuration, for exmaple ``start_epoch`` in AGP pruner. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
-Choose a compression algorithm
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model.
-.. code-block:: python
-   pruner = SlimPruner(model, configure_list)
-   model = pruner.compress()
-Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
-Export compression result
-^^^^^^^^^^^^^^^^^^^^^^^^^
-After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
-.. code-block:: python
-   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
-Please refer :githublink:`mnist example <examples/model_compress/pruning/naive_prune_torch.py>` for quick start.
-Speed up the model
-^^^^^^^^^^^^^^^^^^
-Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
+Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``default``\ ops to sparsity 0.5 while keeping other layers unpruned.
 .. code-block:: python
-   from nni.compression.pytorch import apply_compression_results
+   config_list = [{
-   apply_compression_results(model, 'mask_vgg19_cifar10.pth')
+       'sparsity': 0.5,
+       'op_types': ['default'],
-Please refer to `here <ModelSpeedup.rst>`__ for detailed description.
+   }]
-Detailed Usage Guide
+The specification of configuration can be found `here <./Tutorial.rst#specify-the-configuration>`__. Note that different pruners may have their own defined fields in configuration, for exmaple ``start_epoch`` in AGP pruner. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
--------------------
-The example code for users to apply model compression on a user model can be found below:
+Step2. Choose a pruner and compress the model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-PyTorch code
+First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model. Note that, some algorithms may check gradients for compressing, so we also define an optimizer and pass it to the pruner.
 .. code-block:: python
   from nni.algorithms.compression.pytorch.pruning import LevelPruner
-   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-   pruner = LevelPruner(model, config_list)
-   pruner.compress()
-You can use other compression algorithms in the package of ``nni.compression``. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under ``nni.compression.pytorch`` and ``nni.compression.tensorflow`` respectively. You can refer to `Pruner <./Pruner.rst>`__ and `Quantizer <./Quantizer.rst>`__ for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to `KDExample <../TrialExample/KDExample.rst>`__
+   optimizer_finetune = torch.optim.SGD(model.parameters(), lr=0.01)
+   pruner = LevelPruner(model, config_list, optimizer_finetune)
-A compression algorithm is first instantiated with a ``config_list`` passed in. The specification of this ``config_list`` will be described later.
+   model = pruner.compress()
-The function call ``pruner.compress()`` modifies user defined model (in Tensorflow the model can be obtained with ``tf.get_default_graph()``\ , while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
-Note that, ``pruner.compress`` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after ``pruner.compress``.
+Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners (e.g., L1FilterPruner, FPGMPruner) prune once at the beginning, the following training can be seen as fine-tune. Some pruners (e.g., AGPPruner) prune your model iteratively, the masks are adjusted epoch by epoch during training.
-Specification of ``config_list``
+Note that, ``pruner.compress`` simply adds masks on model weights, it does not include fine-tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after ``pruner.compress``.
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 
+For example:
-The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+.. code-block:: python
-There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
+   for epoch in range(1, args.epochs + 1):
+        pruner.update_epoch(epoch)
+        train(args, model, device, train_loader, optimizer_finetune, epoch)
+        test(model, device, test_loader)
+More APIs to control the fine-tuning can be found `here <./Tutorial.rst#apis-to-control-the-fine-tuning>`__. 
-* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
-* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
-* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
-Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
+Step3. Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-A simple example of configuration is shown below:
+After training, you can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
 .. code-block:: python
-   [
+   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
-       {
-           'sparsity': 0.8,
-           'op_types': ['default']
-       },
-       {
-           'sparsity': 0.6,
-           'op_names': ['op_name1', 'op_name2']
-       },
-       {
-           'exclude': True,
-           'op_names': ['op_name3']
-       }
-   ]
-It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
-Quantization specific keys
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
+Plese refer to :githublink:`mnist example <examples/model_compress/pruning/naive_prune_torch.py>` for example code.
-* **quant_types** : list of string. 
+More examples of pruning algorithms can be found in :githublink:`basic_pruners_torch <examples/model_compress/pruning/basic_pruners_torch.py>` and :githublink:`auto_pruners_torch <examples/model_compress/pruning/auto_pruners_torch.py>`.
-Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
-to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+Model Quantization
+------------------
-* **quant_bits** : int or dict of {str : int}
+Here we use `QAT  Quantizer <../Compression/Quantizer.rst#qat-quantizer>`__ as an example to show the usage of pruning in NNI.
-bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+Step1. Write configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. code-block:: bash
+.. code-block:: python
-   {
+   config_list = [{
-       quant_bits: {
+       'quant_types': ['weight'],
+       'quant_bits': {
           'weight': 8,
-           'output': 4,
+       }, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
-           },
+       'op_types':['Conv2d', 'Linear']
-   }
+   }, {
+       'quant_types': ['output'],
-when the value is int type, all quantization types share same bits length. eg. 
+       'quant_bits': 8,
+       'quant_start_step': 7000,
-.. code-block:: bash
+       'op_types':['ReLU6']
+   }]
-   {
-       quant_bits: 8, # weight or output quantization are all 8 bits
-   }
-The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
-.. code-block:: bash
-   configure_list = [{
-           'quant_types': ['weight'],        
-           'quant_bits': 8, 
-           'op_names': ['conv1']
-       }, {
-           'quant_types': ['weight'],
-           'quant_bits': 4,
-           'quant_start_step': 0,
-           'op_names': ['conv2']
-       }, {
-           'quant_types': ['weight'],
-           'quant_bits': 3,
-           'op_names': ['fc1']
-           },
-          {
-           'quant_types': ['weight'],
-           'quant_bits': 2,
-           'op_names': ['fc2']
-           }
-   ]
-In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
-APIs for Updating Fine Tuning Status
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Some compression algorithms use epochs to control the progress of compression (e.g. `AGP <../Compression/Pruner.rst#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
-``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
-Export Pruned Model
+The specification of configuration can be found `here <./Tutorial.rst#quantization-specific-keys>`__.
-^^^^^^^^^^^^^^^^^^^^
-You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. In this exported ``model.pth``\ , the masked weights are zero.
+Step2. Choose a quantizer and compress the model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-.. code-block:: bash
+.. code-block:: python
-   pruner.export_model(model_path='model.pth')
+   from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
-``mask_dict`` and pruned model in ``onnx`` format(\ ``input_shape`` need to be specified) can also be exported like this:
+   quantizer = QAT_Quantizer(model, config_list)
+   quantizer.compress()
-.. code-block:: python
-   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+Step3. Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Export Quantized Model
+You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification.
-^^^^^^^^^^^^^^^^^^^^^^
-You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.
 .. code-block:: python
-   # Init model and quantize it by using NNI QAT
-   model = Mnist()
-   configure_list = [...]
-   optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
-   quantizer = QAT_Quantizer(model, configure_list, optimizer)
-   quantizer.compress()
-   model.to(device)
-   # Quantize aware training
-   for epoch in range(40):
-        print('# Epoch {} #'.format(epoch))
-        train(model, quantizer, device, train_loader, optimizer)
   # Save quantized model which is generated by using NNI QAT algorithm
-   torch.save(model.state_dict(), "quantized_model.pkt")
+   torch.save(model.state_dict(), "quantized_model.pth")
-   # Simulate model loading procedure
-   # Have to init new model and compress it before loading
-   qmodel_load = Mnist()
-   optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
-   quantizer = QAT_Quantizer(qmodel_load, configure_list, optimizer)
-   quantizer.compress()
-   # Load quantized model
-   qmodel_load.load_state_dict(torch.load("quantized_model.pkt"))
-   # Get scale, zero_point and weight of conv1 in loaded model
+Plese refer to :githublink:`mnist example <examples/model_compress/quantization/QAT_torch_quantizer.py>` for example code.
-   conv1 = qmodel_load.conv1
-   scale = conv1.module.scale
-   zero_point = conv1.module.zero_point
-   weight = conv1.module.weight
-If you want to really speed up the compressed model, please refer to `NNI model speedup <./ModelSpeedup.rst>`__ for details.
+Congratulations! You've compressed your first model via NNI. To go a bit more in depth about model compression in NNI, check out the `Tutorial <./Tutorial.rst>`__.
\ No newline at end of file
--- a/docs/en_US/Compression/Tutorial.rst
+++ b/docs/en_US/Compression/Tutorial.rst
+Tutorial
+========
+.. contents::
+In this tutorial, we will explain more detailed usage about the model compression in NNI. 
+Setup compression goal
+----------------------
+Specify the configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example, when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 
+The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
+* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+Some other keys are often specific to a certain algorithm, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
+A simple example of configuration is shown below:
+.. code-block:: python
+   [
+       {
+           'sparsity': 0.8,
+           'op_types': ['default']
+       },
+       {
+           'sparsity': 0.6,
+           'op_names': ['op_name1', 'op_name2']
+       },
+       {
+           'exclude': True,
+           'op_names': ['op_name3']
+       }
+   ]
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
+Quantization specific keys
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
+* **quant_types** : list of string. 
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+* **quant_bits** : int or dict of {str : int}
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+.. code-block:: bash
+   {
+       quant_bits: {
+           'weight': 8,
+           'output': 4,
+           },
+   }
+when the value is int type, all quantization types share same bits length. eg. 
+.. code-block:: bash
+   {
+       quant_bits: 8, # weight or output quantization are all 8 bits
+   }
+The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
+.. code-block:: bash
+   config_list = [{
+           'quant_types': ['weight'],        
+           'quant_bits': 8, 
+           'op_names': ['conv1']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 4,
+           'quant_start_step': 0,
+           'op_names': ['conv2']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 3,
+           'op_names': ['fc1']
+           },
+          {
+           'quant_types': ['weight'],
+           'quant_bits': 2,
+           'op_names': ['fc2']
+           }
+   ]
+In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
+Export compression result
+-------------------------
+Export the pruned model
+^^^^^^^^^^^^^^^^^^^^^^^
+You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. Note that, the exported ``model.pth``\ has the same parameters as the original model except the masked weights are zero. ``mask_dict`` stores the binary value that produced by the pruning algorithm, which can be further used to speed up the model.
+.. code-block:: python
+   # export model weights and mask
+   pruner.export_model(model_path='model.pth', mask_path='mask.pth')
+   # apply mask to model
+   from nni.compression.pytorch import apply_compression_results
+   apply_compression_results(model, mask_file, device)
+export model in ``onnx`` format(\ ``input_shape`` need to be specified):
+.. code-block:: python
+   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+Export the quantized model
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.
+.. code-block:: python
+   # Save quantized model which is generated by using NNI QAT algorithm
+   torch.save(model.state_dict(), "quantized_model.pth")
+   # Simulate model loading procedure
+   # Have to init new model and compress it before loading
+   qmodel_load = Mnist()
+   optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
+   quantizer = QAT_Quantizer(qmodel_load, config_list, optimizer)
+   quantizer.compress()
+   # Load quantized model
+   qmodel_load.load_state_dict(torch.load("quantized_model.pth"))
+   # Get scale, zero_point and weight of conv1 in loaded model
+   conv1 = qmodel_load.conv1
+   scale = conv1.module.scale
+   zero_point = conv1.module.zero_point
+   weight = conv1.module.weight
+Speed up the model
+------------------
+Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
+.. code-block:: python
+   from nni.compression.pytorch import apply_compression_results, ModelSpeedup
+   dummy_input = torch.randn(config['input_shape']).to(device)
+   m_speedup = ModelSpeedup(model, dummy_input, masks_file, device)
+   m_speedup.speedup_model()
+Please refer to `here <ModelSpeedup.rst>`__ for detailed description. The example code for model speedup can be found :githublink:`here <examples/model_compress/pruning/model_speedup.py>`
+Control the Fine-tuning process
+-------------------------------
+APIs to control the fine-tuning
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Some compression algorithms control the progress of compression during fine-tuning (e.g. `AGP <../Compression/Pruner.rst#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
+``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+Enhance the fine-tuning process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Knowledge distillation effectively learns a small student model from a large teacher model. Users can enhance the fine-tuning process that utilize knowledge distillation to improve the performance of the compressed model. Example code can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`
--- a/docs/en_US/Compression/advanced.rst
+++ b/docs/en_US/Compression/advanced.rst
+Advanced Usage
+==============
+..  toctree::
+    :maxdepth: 2
+    Framework <./Framework>
+    Customize a new algorithm <./CustomizeCompressor>
+    Automatic Model Compression <./AutoPruningUsingTuners>
--- a/docs/en_US/TrialExample/KDExample.rst
+++ b/docs/en_US/TrialExample/KDExample.rst
@@ -35,11 +35,11 @@ PyTorch code
         loss.backward()
-The complete code for fine-tuning the pruend model can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`
+The complete code for fine-tuning the pruned model can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`
 .. code-block:: python
-      python finetune_kd_torch.py --model [model name] --teacher-model-dir [pretrained checkpoint path]  --student-model-dir [pruend checkpoint path] --mask-path [mask file path]
+      python finetune_kd_torch.py --model [model name] --teacher-model-dir [pretrained checkpoint path]  --student-model-dir [pruned checkpoint path] --mask-path [mask file path]
 Note that: for fine-tuning a pruned model, run :githublink:`basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>` first to get the mask file, then pass the mask path as argument to the script.

--- a/docs/en_US/model_compression.rst
+++ b/docs/en_US/model_compression.rst
@@ -28,5 +28,5 @@ For details, please refer to the following tutorials:
    Pruning <Compression/pruning>
    Quantization <Compression/quantization>
    Utilities <Compression/CompressionUtils>
-    Framework <Compression/Framework>
+    Advanced Usage <Compression/advanced>
-    Customize Model Compression Algorithms <Compression/CustomizeCompressor>
+    API Reference <Compression/CompressionReference>
--- a/docs/en_US/sdk_reference.rst
+++ b/docs/en_US/sdk_reference.rst
@@ -8,4 +8,4 @@ Python API Reference
    Auto Tune <autotune_ref>
    NAS <NAS/NasReference>
-    Compression Utilities <Compression/CompressionReference>
+    Compression <Compression/CompressionReference>
\ No newline at end of file
--- a/nni/algorithms/compression/pytorch/pruning/sensitivity_pruner.py
+++ b/nni/algorithms/compression/pytorch/pruning/sensitivity_pruner.py
@@ -299,7 +299,7 @@ class SensitivityPruner(Pruner):
        eval_args: list
        eval_kwargs: list& dict
            Parameters for the val_funtion, the val_function will be called like
-            evaluator(*eval_args, **eval_kwargs)
+            evaluator(\*eval_args, \*\*eval_kwargs)
        finetune_args: list
        finetune_kwargs: dict
            Parameters for the finetuner function if needed.

--- a/nni/algorithms/compression/pytorch/pruning/structured_pruning.py
+++ b/nni/algorithms/compression/pytorch/pruning/structured_pruning.py
@@ -42,6 +42,7 @@ class StructuredWeightMasker(WeightMasker):
    def calc_mask(self, sparsity, wrapper, wrapper_idx=None, **depen_kwargs):
        """
        calculate the mask for `wrapper`.
        Parameters
        ----------
        sparsity: float/list of float
@@ -292,6 +293,7 @@ class StructuredWeightMasker(WeightMasker):
    def get_mask(self, base_mask, weight, num_prune, wrapper, wrapper_idx, channel_masks=None):
        """
        Calculate the mask of given layer.
        Parameters
        ----------
        base_mask: dict
@@ -309,6 +311,7 @@ class StructuredWeightMasker(WeightMasker):
            mode, before calculating the masks for each layer, we will calculate a common
            mask for all the layers in the dependency set. For the pruners that doesnot
            support dependency-aware mode, they can just ignore this parameter.
        Returns
        -------
        dict

--- a/nni/compression/pytorch/compressor.py
+++ b/nni/compression/pytorch/compressor.py
@@ -422,8 +422,8 @@ class Pruner(Compressor):
        """
        Load the state dict saved from unwrapped model.
-        Parameters:
+        Parameters
-        -----------
+        ----------
        model_state : dict
            state dict saved from unwrapped model
        """