Refactor doc of NNI model compression (#2595)

3d57dd73 · QuanluZhang · GitHub · 0fbaff6c · 3d57dd73 · 0fbaff6c
Unverified Commit 3d57dd73 authored Jun 29, 2020 by QuanluZhang Committed by GitHub Jun 29, 2020
12 changed files
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
-# Design Doc
+# Customize A New Compression Algorithm
-## Overview
+```eval_rst
+.. contents::
+```
+To simplify writing a new compression algorithm, we design programming interfaces which are simple but flexible enough. There are interfaces for pruning and quantization respectively. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
+## Customize a new pruning algorithm
+To better demonstrate how to customize a new pruning algorithm, it is necessary for users to first understand the framework for supporting various pruning algorithms in NNI.
+### Framework overview for pruning algorithms
 Following example shows how to use a pruner:
@@ -26,11 +36,11 @@ A pruner receives `model`, `config_list` and `optimizer` as arguments. It prunes
 From implementation perspective, a pruner consists of a `weight masker` instance and multiple `module wrapper` instances.
-### Weight masker
+#### Weight masker
 A `weight masker` is the implementation of pruning algorithms, it can prune a specified layer wrapped by `module wrapper` with specified sparsity.
-### Module wrapper
+#### Module wrapper
 A `module wrapper` is a module containing:
@@ -43,7 +53,7 @@ the reasons to use `module wrapper`:
 1. some buffers are needed by `calc_mask` to calculate masks and these buffers should be registered in `module wrapper` so that the original modules are not contaminated.
 2. a new `forward` method is needed to apply masks to weight before calling the real `forward` method.
-### Pruner
+#### Pruner
 A `pruner` is responsible for:
@@ -52,9 +62,9 @@ A `pruner` is responsible for:
 3. Use `weight masker` to calculate masks of layers while pruning.
 4. Export pruned model weights and masks.
-## Implement a new pruning algorithm
+### Implement a new pruning algorithm
-Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should a subclass `Pruner`.
+Implementing a new pruning algorithm requires implementing a `weight masker` class which shoud be a subclass of `WeightMasker`, and a `pruner` class, which should be a subclass `Pruner`.
 An implementation of `weight masker` may look like this:
@@ -74,7 +84,7 @@ class MyMasker(WeightMasker):
 You can reference nni provided [weight masker](https://github.com/microsoft/nni/blob/master/src/sdk/pynni/nni/compression/torch/pruning/structured_pruning.py) implementations to implement your own weight masker.
-A basic pruner looks likes this:
+A basic `pruner` looks likes this:
 ```python
 class MyPruner(Pruner):
@@ -142,3 +152,126 @@ self.pruner.remove_activation_collector(collector_id)
 On multi-GPU training, buffers and parameters are copied to multiple GPU every time the `forward` method runs on multiple GPU. If buffers and parameters are updated in the `forward` method, an `in-place` update is needed to ensure the update is effective.
 Since `calc_mask` is called in the `optimizer.step` method, which happens after the `forward` method and happens only on one GPU, it supports multi-GPU naturally.
+***
+## Customize a new quantization algorithm
+To write a new quantization algorithm, you can write a class that inherits `nni.compression.torch.Quantizer`. Then, override the member functions with the logic of your algorithm. The member function to override is `quantize_weight`. `quantize_weight` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+```python
+from nni.compression.torch import Quantizer
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        """
+        Suggest you to use the NNI defined spec for config
+        """
+        super().__init__(model, config_list)
+    def quantize_weight(self, weight, config, **kwargs):
+        """
+        quantize should overload this method to quantize weight tensors.
+        This method is effectively hooked to :meth:`forward` of the model.
+        Parameters
+        ----------
+        weight : Tensor
+            weight that needs to be quantized
+        config : dict
+            the configuration for weight quantization
+        """
+        # Put your code to generate `new_weight` here
+        return new_weight
+    def quantize_output(self, output, config, **kwargs):
+        """
+        quantize should overload this method to quantize output.
+        This method is effectively hooked to `:meth:`forward` of the model.
+        Parameters
+        ----------
+        output : Tensor
+            output that needs to be quantized
+        config : dict
+            the configuration for output quantization
+        """
+        # Put your code to generate `new_output` here
+        return new_output
+    def quantize_input(self, *inputs, config, **kwargs):
+        """
+        quantize should overload this method to quantize input.
+        This method is effectively hooked to :meth:`forward` of the model.
+        Parameters
+        ----------
+        inputs : Tensor
+            inputs that needs to be quantized
+        config : dict
+            the configuration for inputs quantization
+        """
+        # Put your code to generate `new_input` here
+        return new_input
+    def update_epoch(self, epoch_num):
+        pass
+    def step(self):
+        """
+        Can do some processing based on the model or weights binded
+        in the func bind_model
+        """
+        pass
+```
+### Customize backward function
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as [Straight-Through Estimator](https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste), user can customize a backward function as follow:
+```python
+from nni.compression.torch.compressor import Quantizer, QuantGrad, QuantType
+class ClipGrad(QuantGrad):
+    @staticmethod
+    def quant_backward(tensor, grad_output, quant_type):
+        """
+        This method should be overrided by subclass to provide customized backward function,
+        default implementation is Straight-Through Estimator
+        Parameters
+        ----------
+        tensor : Tensor
+            input of quantization operation
+        grad_output : Tensor
+            gradient of the output of quantization operation
+        quant_type : QuantType
+            the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
+            you can define different behavior for different types.
+        Returns
+        -------
+        tensor
+            gradient of the input of quantization operation
+        """
+        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+        if quant_type == QuantType.QUANT_OUTPUT: 
+            grad_output[torch.abs(tensor) > 1] = 0
+        return grad_output
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        super().__init__(model, config_list)
+        # set your customized backward function to overwrite default backward function
+        self.quant_grad = ClipGrad
+```
+If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator. 
+_Coming Soon_ ...
\ No newline at end of file
--- a/docs/en_US/Compressor/LotteryTicketHypothesis.md
+++ b/docs/en_US/Compressor/LotteryTicketHypothesis.md
-Lottery Ticket Hypothesis on NNI
-===
-## Introduction
-The paper [The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635) is mainly a measurement and analysis paper, it delivers very interesting insights. To support it on NNI, we mainly implement the training approach for finding *winning tickets*.
-In this paper, the authors use the following process to prune a model, called *iterative prunning*:
->1. Randomly initialize a neural network f(x;theta_0) (where theta_0 follows D_{theta}).
->2. Train the network for j iterations, arriving at parameters theta_j.
->3. Prune p% of the parameters in theta_j, creating a mask m.
->4. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
->5. Repeat step 2, 3, and 4.
-If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
-## Reproduce Results
-We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/lottery_torch_mnist_fc.py). In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
-![](../../img/lottery_ticket_mnist_fc.png)
-The above figure shows the result of the fully connected network. `round0-sparsity-0.0` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
--- a/docs/en_US/Compressor/ModelSpeedup.md
+++ b/docs/en_US/Compressor/ModelSpeedup.md
 # Speed up Masked Model
-*This feature is still in Alpha version.*
+*This feature is in Beta version.*
 ## Introduction

--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
 # Model Compression with NNI
-As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem. 
-We are glad to introduce model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.
+```eval_rst
+.. contents::
+```
+As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
-NNI provides an easy-to-use toolkit to help user design and use compression algorithms. It currently supports PyTorch with unified interface. For users to compress their models, they only need to add several lines in their code. There are some popular model compression algorithms built-in in NNI. Users could further use NNI's auto tuning power to find the best compressed model, which is detailed in [Auto Model Compression](./AutoCompression.md). On the other hand, users could easily customize their new compression algorithms using NNI's interface, refer to the tutorial [here](#customize-new-compression-algorithms). Details about how model compression framework works can be found in [here](./Framework.md).
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
-For a survey of model compression, you can refer to this paper: [Recent Advances in Efficient Computation of Deep Convolutional Neural Networks](https://arxiv.org/pdf/1802.00939.pdf).
+*Note that the interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.*
-## Supported algorithms
-We have provided several compression algorithms, including several pruning and quantization algorithms:
+## Supported Algorithms
-**Pruning**
+The algorithms include pruning algorithms and quantization algorithms.
+### Pruning Algorithms
 Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and address the over-ﬁtting issue.
 |Name|Brief Introduction of Algorithm|
 |---|---|
-| [Level Pruner](./Pruner.md#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
+| [Level Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
-| [AGP Pruner](./Pruner.md#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
+| [AGP Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
-| [Lottery Ticket Pruner](./Pruner.md#agp-pruner) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
+| [Lottery Ticket Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#lottery-ticket-hypothesis) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
-| [FPGM Pruner](./Pruner.md#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
+| [FPGM Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
-| [L1Filter Pruner](./Pruner.md#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
+| [L1Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
-| [L2Filter Pruner](./Pruner.md#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
+| [L2Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
-| [ActivationAPoZRankFilterPruner](./Pruner.md#ActivationAPoZRankFilterPruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
+| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#activationapozrankfilterpruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
-| [ActivationMeanRankFilterPruner](./Pruner.md#ActivationMeanRankFilterPruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
+| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#activationmeanrankfilterpruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
-| [Slim Pruner](./Pruner.md#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
+| [Slim Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
-| [TaylorFO Pruner](./Pruner.md#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
+| [TaylorFO Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
-**Quantization**
+### Quantization Algorithms
 Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
 |Name|Brief Introduction of Algorithm|
 |---|---|
-| [Naive Quantizer](./Quantizer.md#naive-quantizer) |  Quantize weights to default 8 bits |
+| [Naive Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#naive-quantizer) |  Quantize weights to default 8 bits |
-| [QAT Quantizer](./Quantizer.md#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
+| [QAT Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
-| [DoReFa Quantizer](./Quantizer.md#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
+| [DoReFa Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
-| [BNN Quantizer](./Quantizer.md#BNN-Quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
+| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#bnn-quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
-## Usage of built-in compression algorithms
-We use a simple example to show how to modify your trial code in order to apply the compression algorithms. Let's say you want to prune all weight to 80% sparsity with Level Pruner, you can add the following three lines into your code before training your model ([here](https://github.com/microsoft/nni/tree/master/examples/model_compress) is complete code).
-PyTorch code
+## Automatic Model Compression
-```python
-from nni.compression.torch import LevelPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-pruner = LevelPruner(model, config_list)
-pruner.compress()
-```
-Tensorflow code
-```python
-from nni.compression.tensorflow import LevelPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-pruner = LevelPruner(tf.get_default_graph(), config_list)
-pruner.compress()
-```
+Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found [here](./AutoCompression.md).
-You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and Tensorflow, under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
+## Model Speedup
-The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Model Speedup can be found [here](./ModelSpeedup.md).
-When instantiate a compression algorithm, there is `config_list` passed in. We describe how to write this config below.
+## Compression Utilities
-### User configuration for a compression algorithm
+Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to [here](./CompressionUtils.md) for a complete list of compression utilities.
-When compressing a model, users may want to specify the ratio for sparsity, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python `list` object, where each element is a `dict` object. 
-The `dict`s in the `list` are applied one by one, that is, the configurations in latter `dict` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
-#### Common keys
-In each `dict`, there are some keys commonly supported by NNI compression:
-* __op_types__: This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
-* __op_names__: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
-* __exclude__: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
-#### Keys for quantization algorithms
-**If you use quantization algorithms, you need to specify more keys. If you use pruning algorithms, you can safely skip these keys**
-* __quant_types__ : list of string. 
-Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
-to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
-* __quant_bits__ : int or dict of {str : int}
-bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
-```
-{
-    quant_bits: {
-        'weight': 8,
-        'output': 4,
-        },
-}
-```
-when the value is int type, all quantization types share same bits length. eg. 
-```
-{
-    quant_bits: 8, # weight or output quantization are all 8 bits
-}
-```
-#### Other keys specified for every compression algorithm
-There are also other keys in the `dict`, but they are specific for every compression algorithm. For example, [Level Pruner](./Pruner.md#level-pruner) requires `sparsity` key to specify how much a model should be pruned.
-#### example
-A simple example of configuration is shown below:
-```python
-[
-    {
-        'sparsity': 0.8,
-        'op_types': ['default']
-    },
-    {
-        'sparsity': 0.6,
-        'op_names': ['op_name1', 'op_name2']
-    },
-    {
-        'exclude': True,
-        'op_names': ['op_name3']
-    }
-]
-```
-It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for `op_name1` and `op_name2` use sparsity 0.6, and please do not compress `op_name3`.
-### Other APIs
-Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](./Pruner.md#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke. One is `update_epoch`, you can use it as follows:
-Tensorflow code
-```python
-pruner.update_epoch(epoch, sess)
-```
-PyTorch code
-```python
-pruner.update_epoch(epoch)
-```
-The other is `step`, it can be called with `pruner.step()` after each minibatch. Note that not all algorithms need these two APIs, for those that do not need them, calling them is allowed but has no effect.
-You can easily export the compressed model using the following API if you are pruning your model, ```state_dict``` of the sparse model weights will be stored in ```model.pth```, which can be loaded by ```torch.load('model.pth')```
-```
-pruner.export_model(model_path='model.pth')
-```
-```mask_dict ``` and pruned model in ```onnx``` format(```input_shape``` need to be specified) can also be exported like this:
+## Customize Your Own Compression Algorithms
-```python
-pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
-```
-## Customize new compression algorithms
-To simplify writing a new compression algorithm, we design programming interfaces which are simple but flexible enough. There are interfaces for pruner and quantizer respectively.
-### Pruning algorithm
-If you want to write a new pruning algorithm, you can write a class that inherits `nni.compression.tensorflow.Pruner` or `nni.compression.torch.Pruner` depending on which framework you use. Then, override the member functions with the logic of your algorithm.
-```python
-# This is writing a pruner in tensorflow.
-# For writing a pruner in PyTorch, you can simply replace
-# nni.compression.tensorflow.Pruner with
-# nni.compression.torch.Pruner
-class YourPruner(nni.compression.tensorflow.Pruner):
-    def __init__(self, model, config_list):
-        """
-        Suggest you to use the NNI defined spec for config
-        """
-        super().__init__(model, config_list)
-    def calc_mask(self, layer, config):
-        """
-        Pruners should overload this method to provide mask for weight tensors.
-        The mask must have the same shape and type comparing to the weight.
-        It will be applied with ``mul()`` operation on the weight.
-        This method is effectively hooked to ``forward()`` method of the model.
-        Parameters
-        ----------
-        layer: LayerInfo
-            calculate mask for ``layer``'s weight
-        config: dict
-            the configuration for generating the mask
-        """
-        return your_mask
-    # note for pytorch version, there is no sess in input arguments
-    def update_epoch(self, epoch_num, sess):
-        pass
-    # note for pytorch version, there is no sess in input arguments
-    def step(self, sess):
-        """
-        Can do some processing based on the model or weights binded
-        in the func bind_model
-        """
-        pass
-```
-For the simplest algorithm, you only need to override ``calc_mask``. It receives the to-be-compressed layers one by one along with their compression configuration. You generate the mask for this weight in this function and return. Then NNI applies the mask for you.
-Some algorithms generate mask based on training progress, i.e., epoch number. We provide `update_epoch` for the pruner to be aware of the training progress. It should be called at the beginning of each epoch.
-Some algorithms may want global information for generating masks, for example, all weights of the model (for statistic information). Your can use `self.bound_model` in the Pruner class for accessing weights. If you also need optimizer's information (for example in Pytorch), you could override `__init__` to receive more arguments such as model's optimizer. Then `step` can process or update the information according to the algorithm. You can refer to [source code of built-in algorithms](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/compressors) for example implementations.
-### Quantization algorithm
-The interface for customizing quantization algorithm is similar to that of pruning algorithms. The only difference is that `calc_mask` is replaced with `quantize_weight`. `quantize_weight` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
-```python
-from nni.compression.torch.compressor import Quantizer
-class YourQuantizer(Quantizer):
-    def __init__(self, model, config_list):
-        """
-        Suggest you to use the NNI defined spec for config
-        """
-        super().__init__(model, config_list)
-    def quantize_weight(self, weight, config, **kwargs):
-        """
-        quantize should overload this method to quantize weight tensors.
-        This method is effectively hooked to :meth:`forward` of the model.
-        Parameters
-        ----------
-        weight : Tensor
-            weight that needs to be quantized
-        config : dict
-            the configuration for weight quantization
-        """
-        # Put your code to generate `new_weight` here
-        return new_weight
-    def quantize_output(self, output, config, **kwargs):
-        """
-        quantize should overload this method to quantize output.
-        This method is effectively hooked to `:meth:`forward` of the model.
-        Parameters
-        ----------
-        output : Tensor
-            output that needs to be quantized
-        config : dict
-            the configuration for output quantization
-        """
-        # Put your code to generate `new_output` here
-        return new_output
-    def quantize_input(self, *inputs, config, **kwargs):
-        """
-        quantize should overload this method to quantize input.
-        This method is effectively hooked to :meth:`forward` of the model.
-        Parameters
-        ----------
-        inputs : Tensor
-            inputs that needs to be quantized
-        config : dict
-            the configuration for inputs quantization
-        """
-        # Put your code to generate `new_input` here
-        return new_input
-    def update_epoch(self, epoch_num):
-        pass
-    def step(self):
-        """
-        Can do some processing based on the model or weights binded
-        in the func bind_model
-        """
-        pass
-```
-#### Customize backward function
-Sometimes it's necessary for a quantization operation to have a customized backward function, such as [Straight-Through Estimator](https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste), user can customize a backward function as follow:
-```python
-from nni.compression.torch.compressor import Quantizer, QuantGrad, QuantType
-class ClipGrad(QuantGrad):
-    @staticmethod
-    def quant_backward(tensor, grad_output, quant_type):
-        """
-        This method should be overrided by subclass to provide customized backward function,
-        default implementation is Straight-Through Estimator
-        Parameters
-        ----------
-        tensor : Tensor
-            input of quantization operation
-        grad_output : Tensor
-            gradient of the output of quantization operation
-        quant_type : QuantType
-            the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
-            you can define different behavior for different types.
-        Returns
-        -------
-        tensor
-            gradient of the input of quantization operation
-        """
-        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
-        if quant_type == QuantType.QUANT_OUTPUT: 
-            grad_output[torch.abs(tensor) > 1] = 0
-        return grad_output
-class YourQuantizer(Quantizer):
-    def __init__(self, model, config_list):
-        super().__init__(model, config_list)
-        # set your customized backward function to overwrite default backward function
-        self.quant_grad = ClipGrad
-```
-If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator. 
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found [here](./Framework.md).
-_Coming Soon_ ...
 ## Reference and Feedback
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;

--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
-Pruner on NNI Compressor
+# Supported Pruning Algorithms on NNI
-===
-Index of supported pruning algorithms
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Weight pruning** generally results in  unstructured models, which need specialized haredware or software to speed up the sparse network. **Filter Pruning** achieves acceleratation by removing the entire filter.  We also provide an algorithm to control the **pruning schedule**.
+**Weight Pruning**
 * [Level Pruner](#level-pruner)
-* [AGP Pruner](#agp-pruner)
 * [Lottery Ticket Hypothesis](#lottery-ticket-hypothesis)
+**Filter Pruning**
 * [Slim Pruner](#slim-pruner)
 * [Filter Pruners with Weight Rank](#weightrankfilterpruner)
    * [FPGM Pruner](#fpgm-pruner)
@@ -15,6 +18,9 @@ Index of supported pruning algorithms
    * [Activation Mean Rank Pruner](#activationmeanrankfilterpruner)
 * [Filter Pruners with Gradient Rank](#gradientrankfilterpruner)
    * [Taylor FO On Weight Pruner](#taylorfoweightfilterpruner)
+**Pruning Schedule**
+* [AGP Pruner](#agp-pruner)
 ## Level Pruner
@@ -145,7 +151,7 @@ for _ in pruner.get_prune_iterations():
        ...
 ```
-The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs `model` and `optimizer` (**Note that should add `lr_scheduler` if used**) to reset their states every time a new prune iteration starts. Please use `get_prune_iterations` to get the pruning iterations, and invoke `prune_iteration_start` at the beginning of each iteration. `epoch_num` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round. Simple reproducing results can be found [here](./LotteryTicketHypothesis.md).
+The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs `model` and `optimizer` (**Note that should add `lr_scheduler` if used**) to reset their states every time a new prune iteration starts. Please use `get_prune_iterations` to get the pruning iterations, and invoke `prune_iteration_start` at the beginning of each iteration. `epoch_num` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
 *Tensorflow version will be supported later.*
@@ -155,6 +161,14 @@ The above configuration means that there are 5 times of iterative pruning. As th
 * **prune_iterations:** The number of rounds for the iterative pruning, i.e., the number of iterative pruning.
 * **sparsity:** The final sparsity when the compression is done.
+### Reproduced Experiment
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/lottery_torch_mnist_fc.py). In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+![](../../img/lottery_ticket_mnist_fc.png)
+The above figure shows the result of the fully connected network. `round0-sparsity-0.0` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
 ***
 ## Slim Pruner
@@ -181,6 +195,18 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only BatchNorm2d is supported in Slim Pruner
+### Reproduced Experiment
+We implemented one of the experiments in ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
+| Model         | Error(paper/ours) | Parameters | Pruned    |
+| ------------- | ----------------- | ---------- | --------- |
+| VGGNet        | 6.34/6.40     | 20.04M   |           |
+| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
+The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
+***
 ## WeightRankFilterPruner
 WeightRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the weights in convolution layers to achieve a preset level of network sparsity
@@ -238,7 +264,7 @@ You can view example for more information
 ### L1Filter Pruner
-This is an one-shot pruner, In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf. The reproduced experiment results can be found [here](l1filterpruner.md)
+This is an one-shot pruner, In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
 ![](../../img/l1filter_pruner.png)
@@ -270,6 +296,17 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only Conv1d and Conv2d is supported in L1Filter Pruner
+#### Reproduced Experiment
+We implemented one of the experiments in ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710) with **L1FilterPruner**, we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
+| Model           | Error(paper/ours) | Parameters      | Pruned   |
+| --------------- | ----------------- | --------------- | -------- |
+| VGG-16          | 6.75/6.49     | 1.5x10^7 |          |
+| VGG-16-pruned-A | 6.60/6.47     | 5.4x10^6 | 64.0% |
+The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
 ***
 ### L2Filter Pruner
@@ -292,6 +329,8 @@ pruner.compress()
 - **sparsity:** This is to specify the sparsity operations to be compressed to
 - **op_types:** Only Conv1d and Conv2d is supported in L2Filter Pruner
+***
 ## ActivationRankFilterPruner
 ActivationRankFilterPruner is a series of pruners which prune the filters with the smallest importance criterion calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
@@ -355,6 +394,7 @@ You can view example for more information
 - **sparsity:** How much percentage of convolutional filters are to be pruned.
 - **op_types:** Only Conv2d is supported in ActivationMeanRankFilterPruner.
+***
 ## GradientRankFilterPruner

--- a/docs/en_US/Compressor/Quantizer.md
+++ b/docs/en_US/Compressor/Quantizer.md
-Quantizer on NNI Compressor
+# Supported Quantization Algorithms on NNI
-===
+Index of supported quantization algorithms
+* [Naive Quantizer](#naive-quantizer)
+* [QAT Quantizer](#qat-quantizer)
+* [DoReFa Quantizer](#dorefa-quantizer)
+* [BNN Quantizer](#bnn-quantizer)
 ## Naive Quantizer
 We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
@@ -47,7 +53,8 @@ quantizer.compress()
 You can view example for more information
 #### User configuration for QAT Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 configuration needed by this algorithm :
@@ -57,13 +64,17 @@ disable quantization until model are run by certain number of steps, this allows
 state where activation quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
 ### note
 batch normalization folding is currently not supported.
 ***
 ## DoReFa Quantizer
 In [DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients](https://arxiv.org/abs/1606.06160), authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
 ### Usage
 To implement DoReFa Quantizer, you can add code below before your training code
 PyTorch code
@@ -81,12 +92,15 @@ quantizer.compress()
 You can view example for more information
 #### User configuration for DoReFa Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 configuration needed by this algorithm :
+***
 ## BNN Quantizer
 In [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), 
 >We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
@@ -118,11 +132,13 @@ model = quantizer.compress()
 You can view example [examples/model_compress/BNN_quantizer_cifar10.py]( https://github.com/microsoft/nni/tree/master/examples/model_compress/BNN_quantizer_cifar10.py) for more information.
 #### User configuration for BNN Quantizer
-common configuration needed by compression algorithms can be found at : [Common configuration](./Overview.md#User-configuration-for-a-compression-algorithm)
+common configuration needed by compression algorithms can be found at [Specification of `config_list`](./QuickStart.md).
 configuration needed by this algorithm :
 ### Experiment
 We implemented one of the experiments in [Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830), we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
 | Model         | Accuracy  | 

--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
-# Quick Start to Compress a Model
+# Tutorial for Model Compression
-NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use slim pruner as an example to show the usage.
+```eval_rst
+.. contents::
+```
+In this tutorial, we use the [first section](#quick-start-to-compress-a-model) to quickly go through the usage of model compression on NNI. Then use the [second section](#detailed-usage-guide) to explain more details of the usage.
+## Quick Start to Compress a Model
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use [slim pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) as an example to show the usage.
-## Write configuration
+### Write configuration
 Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the `BatchNorm2d`s to sparsity 0.7 while keeping other layers unpruned.
@@ -13,9 +21,9 @@ configure_list = [{
 }]
 ```
-The specification of configuration can be found [here](Overview.md#user-configuration-for-a-compression-algorithm). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](Overview.md#supported-algorithms) for details, and adjust the configuration accordingly.
+The specification of configuration can be found [here](#specification-of-config-list). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](./Pruner.md) for details, and adjust the configuration accordingly.
-## Choose a compression algorithm
+### Choose a compression algorithm
 Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke `compress()` to compress your model.
@@ -26,7 +34,7 @@ model = pruner.compress()
 Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
-## Export compression result
+### Export compression result
 After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
@@ -36,7 +44,7 @@ pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19
 The complete code of model compression examples can be found [here](https://github.com/microsoft/nni/blob/master/examples/model_compress/model_prune_torch.py).
-## Speed up the model
+### Speed up the model
 Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking `apply_compression_results` on your model, your model becomes a smaller one with shorter inference latency.
@@ -45,4 +53,119 @@ from nni.compression.torch import apply_compression_results
 apply_compression_results(model, 'mask_vgg19_cifar10.pth')
 ```
 Please refer to [here](ModelSpeedup.md) for detailed description.
\ No newline at end of file
+## Detailed Usage Guide
+The example code for users to apply model compression on a user model can be found below:
+PyTorch code
+```python
+from nni.compression.torch import LevelPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+pruner = LevelPruner(model, config_list)
+pruner.compress()
+```
+Tensorflow code
+```python
+from nni.compression.tensorflow import LevelPruner
+config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+pruner = LevelPruner(tf.get_default_graph(), config_list)
+pruner.compress()
+```
+You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)
+A compression algorithm is first instantiated with a `config_list` passed in. The specification of this `config_list` will be described later.
+The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+*Note that, `pruner.compress` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after `pruner.compress`.*
+### Specification of `config_list`
+Users can specify the configuration (i.e., `config_list`) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python `list` object, where each element is a `dict` object. 
+The `dict`s in the `list` are applied one by one, that is, the configurations in latter `dict` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+There are different keys in a `dict`. Some of them are common keys supported by all the compression algorithms:
+* __op_types__: This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* __op_names__: This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* __exclude__: Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+Some other keys are often specific to a certain algorithms, users can refer to [pruning algorithms](./Pruner.md) and [quantization algorithms](./Quantizer.md) for the keys allowed by each algorithm.
+A simple example of configuration is shown below:
+```python
+[
+    {
+        'sparsity': 0.8,
+        'op_types': ['default']
+    },
+    {
+        'sparsity': 0.6,
+        'op_names': ['op_name1', 'op_name2']
+    },
+    {
+        'exclude': True,
+        'op_names': ['op_name3']
+    }
+]
+```
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for `op_name1` and `op_name2` use sparsity 0.6, and do not compress `op_name3`.
+#### Quantization specific keys
+**If you use quantization algorithms, you need to specify more keys. If you use pruning algorithms, you can safely skip these keys**
+* __quant_types__ : list of string. 
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+* __quant_bits__ : int or dict of {str : int}
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+```
+{
+    quant_bits: {
+        'weight': 8,
+        'output': 4,
+        },
+}
+```
+when the value is int type, all quantization types share same bits length. eg. 
+```
+{
+    quant_bits: 8, # weight or output quantization are all 8 bits
+}
+```
+### APIs for Updating Fine Tuning Status
+Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
+`update_epoch` should be invoked in every epoch, while `step` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+### Export Compressed Model
+You can easily export the compressed model using the following API if you are pruning your model, ```state_dict``` of the sparse model weights will be stored in ```model.pth```, which can be loaded by ```torch.load('model.pth')```. In this exported ```model.pth```, the masked weights are zero.
+```
+pruner.export_model(model_path='model.pth')
+```
+```mask_dict ``` and pruned model in ```onnx``` format(```input_shape``` need to be specified) can also be exported like this:
+```python
+pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+```
+If you want to really speed up the compressed model, please refer to [NNI model speedup](./ModelSpeedup.md) for details.
\ No newline at end of file
--- a/docs/en_US/Compressor/SlimPruner.md
+++ b/docs/en_US/Compressor/SlimPruner.md
-SlimPruner on NNI Compressor
-===
-## 1. Slim Pruner
-SlimPruner is a structured pruning algorithm for pruning channels in the convolutional layers by pruning corresponding scaling factors in the later BN layers.
-In ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
-![](../../img/slim_pruner.png)
-> Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**, L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are **globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
-## 2. Usage
-PyTorch code
-```
-from nni.compression.torch import SlimPruner
-config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
-pruner = SlimPruner(model, config_list)
-pruner.compress()
-```
-#### User configuration for Filter Pruner
- **sparsity:** This is to specify the sparsity operations to be compressed to
- **op_types:** Only BatchNorm2d is supported in Slim Pruner
-## 3. Experiment
-We implemented one of the experiments in ['Learning Efficient Convolutional Networks through Network Slimming'](https://arxiv.org/pdf/1708.06519.pdf), we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
-| Model         | Error(paper/ours) | Parameters | Pruned    |
-| ------------- | ----------------- | ---------- | --------- |
-| VGGNet        | 6.34/6.40     | 20.04M   |           |
-| Pruned-VGGNet | 6.20/6.26     | 2.03M    | 88.5% |
-The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
--- a/docs/en_US/Compressor/l1filterpruner.md
+++ b/docs/en_US/Compressor/l1filterpruner.md
-L1FilterPruner on NNI
-===
-## Introduction
-L1FilterPruner is a general structured pruning algorithm for pruning filters in the convolutional layers.
-In ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710), authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
-![](../../img/l1filter_pruner.png)
-> L1Filter Pruner prunes filters in the **convolution layers**
->
-> The procedure of pruning m filters from the ith convolutional layer is as follows:
->
-> 1. For each filter ![](http://latex.codecogs.com/gif.latex?F_{i,j}), calculate the sum of its absolute kernel weights![](http://latex.codecogs.com/gif.latex?s_j=\sum_{l=1}^{n_i}\sum|K_l|)
-> 2. Sort the filters by ![](http://latex.codecogs.com/gif.latex?s_j).
-> 3. Prune ![](http://latex.codecogs.com/gif.latex?m) filters with the smallest sum values and their corresponding feature maps. The
->      kernels in the next convolutional layer corresponding to the pruned feature maps are also
->        removed.
-> 4. A new kernel matrix is created for both the ![](http://latex.codecogs.com/gif.latex?i)th and ![](http://latex.codecogs.com/gif.latex?i+1)th layers, and the remaining kernel
->      weights are copied to the new model.
-## Experiment
-We implemented one of the experiments in ['PRUNING FILTERS FOR EFFICIENT CONVNETS'](https://arxiv.org/abs/1608.08710) with **L1FilterPruner**, we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
-| Model           | Error(paper/ours) | Parameters      | Pruned   |
-| --------------- | ----------------- | --------------- | -------- |
-| VGG-16          | 6.75/6.49     | 1.5x10^7 |          |
-| VGG-16-pruned-A | 6.60/6.47     | 5.4x10^6 | 64.0% |
-The experiments code can be found at [examples/model_compress]( https://github.com/microsoft/nni/tree/master/examples/model_compress/)
--- a/docs/en_US/model_compression.rst
+++ b/docs/en_US/model_compression.rst
@@ -17,9 +17,9 @@ For details, please refer to the following tutorials:
    Overview <Compressor/Overview>
    Quick Start <Compressor/QuickStart>
-    Pruners <pruners>
+    Pruners <Compressor/Pruner>
-    Quantizers <quantizers>
+    Quantizers <Compressor/Quantizer>
-    Model Speedup <Compressor/ModelSpeedup>
    Automatic Model Compression <Compressor/AutoCompression>
-    Implementation <Compressor/Framework>
+    Model Speedup <Compressor/ModelSpeedup>
    Compression Utilities <Compressor/CompressionUtils>
+    Customize Compression Algorithms <Compressor/Framework>
--- a/docs/en_US/pruners.rst
+++ b/docs/en_US/pruners.rst
-############################
-Supported Pruning Algorithms
-############################
-..  toctree::
-    :maxdepth: 1
-    Level Pruner <Compressor/Pruner>
-    AGP Pruner <Compressor/Pruner>
-    Lottery Ticket Pruner <Compressor/LotteryTicketHypothesis>
-    FPGM Pruner <Compressor/Pruner>
-    L1Filter Pruner <Compressor/l1filterpruner>
-    L2Filter Pruner <Compressor/Pruner>
-    ActivationAPoZRankFilterPruner <Compressor/Pruner>
-    ActivationMeanRankFilterPruner <Compressor/Pruner>
-    Slim Pruner <Compressor/SlimPruner>
--- a/docs/en_US/quantizers.rst
+++ b/docs/en_US/quantizers.rst
-#################################
-Supported Quantization Algorithms
-#################################
-..  toctree::
-    :maxdepth: 1
-    Naive Quantizer <Compressor/Quantizer>
-    QAT Quantizer <Compressor/Quantizer>
-    DoReFa Quantizer <Compressor/Quantizer>
-    BNN Quantizer <Compressor/Quantizer>
\ No newline at end of file