Refactor doc of NNI model compression (#2595)

3d57dd73 · QuanluZhang · GitHub · 0fbaff6c · 3d57dd73 · 0fbaff6c
Unverified Commit 3d57dd73 authored Jun 29, 2020 by QuanluZhang Committed by GitHub Jun 29, 2020
12 changed files
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
-# Design Doc
+# Customize A New Compression Algorithm
-## Overview
+```eval_rst
+.. contents::
+```
+To simplify writing a new compression algorithm, we design programming interfaces which are simple but flexible enough. There are interfaces for pruning and quantization respectively. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
+## Customize a new pruning algorithm
+To better demonstrate how to customize a new pruning algorithm, it is necessary for users to first understand the framework for supporting various pruning algorithms in NNI.
+### Framework overview for pruning algorithms
 Following example shows how to use a pruner:
@@ -26,11 +36,11 @@ A pruner receives `model`, `config_list` and `optimizer` as arguments. It prunes
 From implementation perspective, a pruner consists of a `weight masker` instance and multiple `module wrapper` instances.
-### Weight masker
+#### Weight masker
 A `weight masker` is the implementation of pruning algorithms, it can prune a specified layer wrapped by `module wrapper` with specified sparsity.
-### Module wrapper
+#### Module wrapper
 A `module wrapper` is a module containing:
@@ -43,7 +53,7 @@ the reasons to use `module wrapper`:
 1. some buffers are needed by `calc_mask` to calculate masks and these buffers should be registered in `module wrapper` so that the original modules are not contaminated.
 2. a new `forward` method is needed to apply masks to weight before calling the real `forward` method.
-### Pruner
+#### Pruner
 A `pruner` is responsible for:
@@ -52,9 +62,9 @@ A `pruner` is responsible for:
 3. Use `weight masker` to calculate masks of layers while pruning.
 4. Export pruned model weights and masks.
-## Implement a new pruning algorithm
+### Implement a new pruning algorithm
-Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should a subclass `Pruner`.
+Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should be a subclass `Pruner`.
 An implementation of `weight masker` may look like this:
@@ -74,7 +84,7 @@ class MyMasker(WeightMasker):
 You can reference nni provided [weight masker](https://github.com/microsoft/nni/blob/master/src/sdk/pynni/nni/compression/torch/pruning/structured_pruning.py) implementations to implement your own weight masker.
-A basic pruner looks likes this:
+A basic `pruner` looks likes this:
 ```python
 class MyPruner(Pruner):
@@ -142,3 +152,126 @@ self.pruner.remove_activation_collector(collector_id)
 On multi-GPU training, buffers and parameters are copied to multiple GPU every time the `forward` method runs on multiple GPU. If buffers and parameters are updated in the `forward` method, an `in-place` update is needed to ensure the update is effective.
 Since `calc_mask` is called in the `optimizer.step` method, which happens after the `forward` method and happens only on one GPU, it supports multi-GPU naturally.
+***
+## Customize a new quantization algorithm
+To write a new quantization algorithm, you can write a class that inherits `nni.compression.torch.Quantizer`. Then, override the member functions with the logic of your algorithm. The member function to override is `quantize_weight`. `quantize_weight` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+```python
+from nni.compression.torch import Quantizer
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        """
+        Suggest you to use the NNI defined spec for config
+        """
+        super().__init__(model, config_list)
+    def quantize_weight(self, weight, config, **kwargs):
+        """
+        quantize should overload this method to quantize weight tensors.
+        This method is effectively hooked to :meth:`forward` of the model.
+        Parameters
+        ----------
+        weight : Tensor
+            weight that needs to be quantized
+        config : dict
+            the configuration for weight quantization
+        """
+        # Put your code to generate `new_weight` here
+        return new_weight
+    def quantize_output(self, output, config, **kwargs):
+        """
+        quantize should overload this method to quantize output.
+        This method is effectively hooked to `:meth:`forward` of the model.
+        Parameters
+        ----------
+        output : Tensor
+            output that needs to be quantized
+        config : dict
+            the configuration for output quantization
+        """
+        # Put your code to generate `new_output` here
+        return new_output
+    def quantize_input(self, *inputs, config, **kwargs):
+        """
+        quantize should overload this method to quantize input.
+        This method is effectively hooked to :meth:`forward` of the model.
+        Parameters
+        ----------
+        inputs : Tensor
+            inputs that needs to be quantized
+        config : dict
+            the configuration for inputs quantization
+        """
+        # Put your code to generate `new_input` here
+        return new_input
+    def update_epoch(self, epoch_num):
+        pass
+    def step(self):
+        """
+        Can do some processing based on the model or weights binded
+        in the func bind_model
+        """
+        pass
+```
+### Customize backward function
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as [Straight-Through Estimator](https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste), user can customize a backward function as follow:
+```python
+from nni.compression.torch.compressor import Quantizer, QuantGrad, QuantType
+class ClipGrad(QuantGrad):
+    @staticmethod
+    def quant_backward(tensor, grad_output, quant_type):
+        """
+        This method should be overrided by subclass to provide customized backward function,
+        default implementation is Straight-Through Estimator
+        Parameters
+        ----------
+        tensor : Tensor
+            input of quantization operation
+        grad_output : Tensor
+            gradient of the output of quantization operation
+        quant_type : QuantType
+            the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
+            you can define different behavior for different types.
+        Returns
+        -------
+        tensor
+            gradient of the input of quantization operation
+        """
+        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+        if quant_type == QuantType.QUANT_OUTPUT: 
+            grad_output[torch.abs(tensor) > 1] = 0
+        return grad_output
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        super().__init__(model, config_list)
+        # set your customized backward function to overwrite default backward function
+        self.quant_grad = ClipGrad
+```
+If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator. 
+_Coming Soon_ ...
\ No newline at end of file
--- a/docs/en_US/Compressor/LotteryTicketHypothesis.md
+++ b/docs/en_US/Compressor/LotteryTicketHypothesis.md
-Lottery Ticket Hypothesis on NNI
-===
-## Introduction
-The paper [The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635) is mainly a measurement and analysis paper, it delivers very interesting insights. To support it on NNI, we mainly implement the training approach for finding *winning tickets*.
-In this paper, the authors use the following process to prune a model, called *iterative prunning*:
->1. Randomly initialize a neural network f(x;theta_0) (where theta_0 follows D_{theta}).
->2. Train the network for j iterations, arriving at parameters theta_j.
->3. Prune p% of the parameters in theta_j, creating a mask m.
->4. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
->5. Repeat step 2, 3, and 4.
-If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
-## Reproduce Results
-We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/lottery_torch_mnist_fc.py). In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
-![](../../img/lottery_ticket_mnist_fc.png)
-The above figure shows the result of the fully connected network. `round0-sparsity-0.0` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
--- a/docs/en_US/Compressor/ModelSpeedup.md
+++ b/docs/en_US/Compressor/ModelSpeedup.md
 # Speed up Masked Model
-*This feature is still in Alpha version.*
+*This feature is in Beta version.*
 ## Introduction

--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
-Pruner on NNI Compressor
+# Supported Pruning Algorithms on NNI
-===
-Index of supported pruning algorithms
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Weight pruning** generally results in  unstructured models, which need specialized haredware or software to speed up the sparse network. **Filter Pruning** achieves acceleratation by removing the entire filter.  We also provide an algorithm to control the **pruning schedule**.
+**Weight Pruning**
 * [Level Pruner](#level-pruner)
-* [AGP Pruner](#agp-pruner)
 * [Lottery Ticket Hypothesis](#lottery-ticket-hypothesis)
+**Filter Pruning**
 * [Slim Pruner](#slim-pruner)
 * [Filter Pruners with Weight Rank](#weightrankfilterpruner)
    * [FPGM Pruner](#fpgm-pruner)
@@ -15,6 +18,9 @@ Index of supported pruning algorithms
    * [Activation Mean Rank Pruner](#activationmeanrankfilterpruner)
 * [Filter Pruners with Gradient Rank](#gradientrankfilterpruner)
    * [Taylor FO On Weight Pruner](#taylorfoweightfilterpruner)
+**Pruning Schedule**
+* [AGP Pruner](#agp-pruner)
 ## Level Pruner
@@ -145,7 +151,7 @@ for _ in pruner.get_prune_iterations():
        ...
 ```
-The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs `model` and `optimizer` (**Note that should add `lr_scheduler` if used**) to reset their states every time a new prune iteration starts. Please use `get_prune_iterations` to get the pruning iterations, and invoke `prune_iteration_start` at the beginning of each iteration. `epoch_num` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round. Simple reproducing results can be found [here](./LotteryTicketHypothesis.md).
+The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs `model` and `optimizer` (**Note that should add `lr_scheduler` if used**) to reset their states every time a new prune iteration starts. Please use `get_prune_iterations` to get the pruning iterations, and invoke `prune_iteration_start` at the beginning of each iteration. `epoch_num` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
 *Tensorflow version will be supported later.*
@@ -155,6 +161,14 @@ The above configuration means that there are 5 times of iterative pruning. As th
 * **prune_iterations:** The number of rounds for the iterative pruning, i.e., the number of iterative pruning.
 * **sparsity:** The final sparsity when the compression is done.
+### Reproduced Experiment
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/lottery_torch_mnist_fc.py). In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+![](../../img/lottery_ticket_mnist_fc.png)
+The above figure shows the result of the fully connected network. `round0-sparsity-0.0` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
 ***
 ## Slim Pruner
@@ -181,6 +195,18 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only BatchNorm2d is supported in Slim Pruner
+### Reproduced Experiment
+We implemented one of the experiments in ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
+| Model         | Error(paper/ours) | Parameters | Pruned    |
+| ------------- | ----------------- | ---------- | --------- |
+| VGGNet        | 6.34/6.40     | 20.04M   |           |
+| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
+The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
+***
 ## WeightRankFilterPruner
 WeightRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the weights in convolution layers to achieve a preset level of network sparsity
@@ -238,7 +264,7 @@ You can view example for more information
 ### L1Filter Pruner
-This is an one-shot pruner, In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf. The reproduced experiment results can be found [here](l1filterpruner.md)
+This is an one-shot pruner, In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
 ![](../../img/l1filter_pruner.png)
@@ -270,6 +296,17 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only Conv1d and Conv2d is supported in L1Filter Pruner
+#### Reproduced Experiment
+We implemented one of the experiments in ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710) with **L1FilterPruner**, we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
+| Model           | Error(paper/ours) | Parameters      | Pruned   |
+| --------------- | ----------------- | --------------- | -------- |
+| VGG-16          | 6.75/6.49     | 1.5x10^7 |          |
+| VGG-16-pruned-A | 6.60/6.47     | 5.4x10^6 | 64.0% |
+The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
 ***
 ### L2Filter Pruner
@@ -292,6 +329,8 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only Conv1d and Conv2d is supported in L2Filter Pruner
+***
 ## ActivationRankFilterPruner
 ActivationRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
@@ -355,6 +394,7 @@ You can view example for more information
 - **sparsity:** How much percentage of convolutional filters are to be pruned.
 - **op_types:** Only Conv2d is supported in ActivationMeanRankFilterPruner.
+***
 ## GradientRankFilterPruner

--- a/docs/en_US/Compressor/Quantizer.md
+++ b/docs/en_US/Compressor/Quantizer.md
-Quantizer on NNI Compressor
+# Supported Quantization Algorithms on NNI
-===
+Index of supported quantization algorithms
+* [Naive Quantizer](#naive-quantizer)
+* [QAT Quantizer](#qat-quantizer)
+* [DoReFa Quantizer](#dorefa-quantizer)
+* [BNN Quantizer](#bnn-quantizer)
 ## Naive Quantizer
 We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
@@ -47,7 +53,8 @@ quantizer.compress()
 You can view example for more information
 #### User configuration for QAT Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 configuration needed by this algorithm :
@@ -57,13 +64,17 @@ disable quantization until model are run by certain number of steps, this allows
 state where activation quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
 ### note
 batch normalization folding is currently not supported.
 ***
 ## DoReFa Quantizer
 In [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160), authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
 ### Usage
 To implement DoReFa Quantizer, you can add code below before your training code
 PyTorch code
@@ -81,12 +92,15 @@ quantizer.compress()
 You can view example for more information
 #### User configuration for DoReFa Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 configuration needed by this algorithm :
+***
 ## BNN Quantizer
 In [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), 
 >We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
@@ -118,11 +132,13 @@ model = quantizer.compress()
 You can view example [examples/model_compress/BNN_quantizer_cifar10.py]( https://github.com/microsoft/nni/tree/master/examples/model_compress/BNN_quantizer_cifar10.py) for more information.
 #### User configuration for BNN Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 configuration needed by this algorithm :
 ### Experiment
 We implemented one of the experiments in [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
 | Model         | Accuracy  | 

--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
-# Quick Start to Compress a Model
+# Tutorial for Model Compression
-NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use slim pruner as an example to show the usage.
+```eval_rst
+.. contents::
+```
+In this tutorial, we use the [first section](#quick-start-to-compress-a-model) to quickly go through the usage of model compression on NNI. Then use the [second section](#detailed-usage-guide) to explain more details of the usage.
+## Quick Start to Compress a Model
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use [slim pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) as an example to show the usage.
-## Write configuration
+### Write configuration
 Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the `BatchNorm2d`s to sparsity 0.7 while keeping other layers unpruned.
@@ -13,9 +21,9 @@ configure_list = [{
 }]
 ```
-The specification of configuration can be found [here](Overview.md#user-configuration-for-a-compression-algorithm). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](Overview.md#supported-algorithms) for details, and adjust the configuration accordingly.
+The specification of configuration can be found [here](#specification-of-config-list). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](./Pruner.md) for details, and adjust the configuration accordingly.
-## Choose a compression algorithm
+### Choose a compression algorithm
 Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke `compress()` to compress your model.
@@ -26,7 +34,7 @@ model = pruner.compress()
 Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
-## Export compression result
+### Export compression result
 After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
@@ -36,7 +44,7 @@ pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19
 The complete code of model compression examples can be found [here](https://github.com/microsoft/nni/blob/master/examples/model_compress/model_prune_torch.py).
-## Speed up the model
+### Speed up the model
 Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking `apply_compression_results` on your model, your model becomes a smaller one with shorter inference latency.
@@ -45,4 +53,119 @@ from nni.compression.torch import apply_compression_results
 apply_compression_results(model, 'mask_vgg19_cifar10.pth')
 ```
 Please refer to [here](ModelSpeedup.md) for detailed description.
\ No newline at end of file
+## Detailed Usage Guide
+The example code for users to apply model compression on a user model can be found below:
+PyTorch code
+```python
+from nni.compression.torch import LevelPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+pruner = LevelPruner(model, config_list)
+pruner.compress()
+```
+Tensorflow code
+```python
+from nni.compression.tensorflow import LevelPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+pruner = LevelPruner(tf.get_default_graph(), config_list)
+pruner.compress()
+```
+You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
+A compression algorithm is first instantiated with a `config_list` passed in. The specification of this `config_list` will be described later.
+The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+*Note that, `pruner.compress` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after `pruner.compress`.*
+### Specification of `config_list`
+Users can specify the configuration (i.e., `config_list`) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python `list` object, where each element is a `dict` object. 
+The `dict`s in the `list` are applied one by one, that is, the configurations in latter `dict` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+There are different keys in a `dict`. Some of them are common keys supported by all the compression algorithms:
+* __op_types__: This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* __op_names__: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* __exclude__: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+Some other keys are often specific to a certain algorithms, users can refer to [pruning algorithms](./Pruner.md) and [quantization algorithms](./Quantizer.md) for the keys allowed by each algorithm.
+A simple example of configuration is shown below:
+```python
+[
+    {
+        'sparsity': 0.8,
+        'op_types': ['default']
+    },
+    {
+        'sparsity': 0.6,
+        'op_names': ['op_name1', 'op_name2']
+    },
+    {
+        'exclude': True,
+        'op_names': ['op_name3']
+    }
+]
+```
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for `op_name1` and `op_name2` use sparsity 0.6, and do not compress `op_name3`.
+#### Quantization specific keys
+**If you use quantization algorithms, you need to specify more keys. If you use pruning algorithms, you can safely skip these keys**
+* __quant_types__ : list of string. 
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+* __quant_bits__ : int or dict of {str : int}
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+```
+{
+    quant_bits: {
+        'weight': 8,
+        'output': 4,
+        },
+}
+```
+when the value is int type, all quantization types share same bits length. eg. 
+```
+{
+    quant_bits: 8, # weight or output quantization are all 8 bits
+}
+```
+### APIs for Updating Fine Tuning Status
+Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
+`update_epoch` should be invoked in every epoch, while `step` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+### Export Compressed Model
+You can easily export the compressed model using the following API if you are pruning your model, ```state_dict``` of the sparse model weights will be stored in ```model.pth```, which can be loaded by ```torch.load('model.pth')```. In this exported ```model.pth```, the masked weights are zero.
+```
+pruner.export_model(model_path='model.pth')
+```
+```mask_dict ``` and pruned model in ```onnx``` format(```input_shape``` need to be specified) can also be exported like this:
+```python
+pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+```
+If you want to really speed up the compressed model, please refer to [NNI model speedup](./ModelSpeedup.md) for details.
\ No newline at end of file
--- a/docs/en_US/Compressor/SlimPruner.md
+++ b/docs/en_US/Compressor/SlimPruner.md
-SlimPruner on NNI Compressor
-===
-## 1. Slim Pruner
-SlimPruner is a structured pruning algorithm for pruning channels in the convolutional layers by pruning corresponding scaling factors in the later BN layers.
-In ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
-![](../../img/slim_pruner.png)
-> Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**, L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are **globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
-## 2. Usage
-PyTorch code
-```
-from nni.compression.torch import SlimPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
-pruner = SlimPruner(model, config_list)
-pruner.compress()
-```
-#### User configuration for Filter Pruner
- **sparsity:** This is to specify the sparsity operations to be compressed to
- **op_types:** Only BatchNorm2d is supported in Slim Pruner
-## 3. Experiment
-We implemented one of the experiments in ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
-| Model         | Error(paper/ours) | Parameters | Pruned    |
-| ------------- | ----------------- | ---------- | --------- |
-| VGGNet        | 6.34/6.40     | 20.04M   |           |
-| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
-The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
--- a/docs/en_US/Compressor/l1filterpruner.md
+++ b/docs/en_US/Compressor/l1filterpruner.md
-L1FilterPruner on NNI
-===
-## Introduction
-L1FilterPruner is a general structured pruning algorithm for pruning filters in the convolutional layers.
-In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
-![](../../img/l1filter_pruner.png)
-> L1Filter Pruner prunes filters in the **convolution layers**
->
-> The procedure of pruning m filters from the ith convolutional layer is as follows:
->
-> 1. For each filter ![](http://latex.codecogs.com/gif.latex?F_{i,j}), calculate the sum of its absolute kernel weights![](http://latex.codecogs.com/gif.latex?s_j=\sum_{l=1}^{n_i}\sum|K_l|)
-> 2. Sort the filters by ![](http://latex.codecogs.com/gif.latex?s_j).
-> 3. Prune ![](http://latex.codecogs.com/gif.latex?m) filters with the smallest sum values and their corresponding feature maps. The
->      kernels in the next convolutional layer corresponding to the pruned feature maps are also
->        removed.
-> 4. A new kernel matrix is created for both the ![](http://latex.codecogs.com/gif.latex?i)th and ![](http://latex.codecogs.com/gif.latex?i+1)th layers, and the remaining kernel
->      weights are copied to the new model.
-## Experiment
-We implemented one of the experiments in ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710) with **L1FilterPruner**, we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
-| Model           | Error(paper/ours) | Parameters      | Pruned   |
-| --------------- | ----------------- | --------------- | -------- |
-| VGG-16          | 6.75/6.49     | 1.5x10^7 |          |
-| VGG-16-pruned-A | 6.60/6.47     | 5.4x10^6 | 64.0% |
-The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
--- a/docs/en_US/model_compression.rst
+++ b/docs/en_US/model_compression.rst
@@ -17,9 +17,9 @@ For details, please refer to the following tutorials:
    Overview <Compressor/Overview>
    Quick Start <Compressor/QuickStart>
-    Pruners <pruners>
+    Pruners <Compressor/Pruner>
-    Quantizers <quantizers>
+    Quantizers <Compressor/Quantizer>
-    Model Speedup <Compressor/ModelSpeedup>
    Automatic Model Compression <Compressor/AutoCompression>
-    Implementation <Compressor/Framework>
+    Model Speedup <Compressor/ModelSpeedup>
    Compression Utilities <Compressor/CompressionUtils>
+    Customize Compression Algorithms <Compressor/Framework>
--- a/docs/en_US/pruners.rst
+++ b/docs/en_US/pruners.rst
-############################
-Supported Pruning Algorithms
-############################
-..  toctree::
-    :maxdepth: 1
-    Level Pruner <Compressor/Pruner>
-    AGP Pruner <Compressor/Pruner>
-    Lottery Ticket Pruner <Compressor/LotteryTicketHypothesis>
-    FPGM Pruner <Compressor/Pruner>
-    L1Filter Pruner <Compressor/l1filterpruner>
-    L2Filter Pruner <Compressor/Pruner>
-    ActivationAPoZRankFilterPruner <Compressor/Pruner>
-    ActivationMeanRankFilterPruner <Compressor/Pruner>
-    Slim Pruner <Compressor/SlimPruner>
--- a/docs/en_US/quantizers.rst
+++ b/docs/en_US/quantizers.rst
-#################################
-Supported Quantization Algorithms
-#################################
-..  toctree::
-    :maxdepth: 1
-    Naive Quantizer <Compressor/Quantizer>
-    QAT Quantizer <Compressor/Quantizer>
-    DoReFa Quantizer <Compressor/Quantizer>
-    BNN Quantizer <Compressor/Quantizer>
\ No newline at end of file