Merge master into dev-retiarii (#3178)

3ec26b40 · liuzhe-lz · GitHub · d165905d · 3ec26b40 · 3ec26b40
Unverified Commit 3ec26b40 authored Dec 11, 2020 by liuzhe-lz Committed by GitHub Dec 11, 2020
20 changed files
--- a/docs/en_US/CommunitySharings/ModelCompressionComparison.rst
+++ b/docs/en_US/CommunitySharings/ModelCompressionComparison.rst
+Comparison of Filter Pruning Algorithms
+=======================================
+To provide an initial insight into the performance of various filter pruning algorithms, 
+we conduct extensive experiments with various pruning algorithms on some benchmark models and datasets.
+We present the experiment result in this document.
+In addition, we provide friendly instructions on the re-implementation of these experiments to facilitate further contributions to this effort.
+Experiment Setting
+------------------
+The experiments are performed with the following pruners/datasets/models:
+* 
+  Models: :githublink:`VGG16, ResNet18, ResNet50 <examples/model_compress/models/cifar10>`
+* 
+  Datasets: CIFAR-10
+* 
+  Pruners: 
+  * These pruners are included:
+    * Pruners with scheduling : ``SimulatedAnnealing Pruner``\ , ``NetAdapt Pruner``\ , ``AutoCompress Pruner``.
+      Given the overal sparsity requirement, these pruners can automatically generate a sparsity distribution among different layers.
+    * One-shot pruners: ``L1Filter Pruner``\ , ``L2Filter Pruner``\ , ``FPGM Pruner``.
+      The sparsity of each layer is set the same as the overall sparsity in this experiment.
+  * 
+    Only **filter pruning** performances are compared here. 
+    For the pruners with scheduling, ``L1Filter Pruner`` is used as the base algorithm. That is to say, after the sparsities distribution is decided by the scheduling algorithm, ``L1Filter Pruner`` is used to performn real pruning.
+  * 
+    All the pruners listed above are implemented in :githublink:`nni <docs/en_US/Compression/Overview.rst>`.
+Experiment Result
+-----------------
+For each dataset/model/pruner combination, we prune the model to different levels by setting a series of target sparsities for the pruner. 
+Here we plot both **Number of Weights - Performances** curve and** FLOPs - Performance** curve. 
+As a reference, we also plot the result declared in the paper `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <http://arxiv.org/abs/1907.03141>`__ for models VGG16 and ResNet18 on CIFAR-10.
+The experiment result are shown in the following figures:
+CIFAR-10, VGG16:
+.. image:: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_vgg16.png
+   :target: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_vgg16.png
+   :alt: 
+CIFAR-10, ResNet18:
+.. image:: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet18.png
+   :target: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet18.png
+   :alt: 
+CIFAR-10, ResNet50:
+.. image:: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet50.png
+   :target: ../../../examples/model_compress/comparison_of_pruners/img/performance_comparison_resnet50.png
+   :alt: 
+Analysis
+--------
+From the experiment result, we get the following conclusions:
+* Given the constraint on the number of parameters, the pruners with scheduling ( ``AutoCompress Pruner`` , ``SimualatedAnnealing Pruner`` ) performs better than the others when the constraint is strict. However, they have no such advantage in FLOPs/Performances comparison since only number of parameters constraint is considered in the optimization process; 
+* The basic algorithms ``L1Filter Pruner`` , ``L2Filter Pruner`` , ``FPGM Pruner`` performs very similarly in these experiments; 
+* ``NetAdapt Pruner`` can not achieve very high compression rate. This is caused by its mechanism that it prunes only one layer each pruning iteration. This leads to un-acceptable complexity if the sparsity per iteration is much lower than the overall sparisity constraint.
+Experiments Reproduction
+------------------------
+Implementation Details
+^^^^^^^^^^^^^^^^^^^^^^
+* 
+  The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.
+* 
+  Both FLOPs and the number of parameters are counted with :githublink:`Model FLOPs/Parameters Counter <docs/en_US/Compression/CompressionUtils.md#model-flopsparameters-counter>` after :githublink:`model speed up <docs/en_US/Compression/ModelSpeedup.rst>`.
+  This avoids potential issues of counting them of masked models.
+* 
+  The experiment code can be found :githublink:`here <examples/model_compress/auto_pruners_torch.py>`.
+Experiment Result Rendering
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* 
+  If you follow the practice in the :githublink:`example <examples/model_compress/auto_pruners_torch.py>`\ , for every single pruning experiment, the experiment result will be saved in JSON format as follows:
+  .. code-block:: json
+       {
+           "performance": {"original": 0.9298, "pruned": 0.1, "speedup": 0.1, "finetuned": 0.7746}, 
+           "params": {"original": 14987722.0, "speedup": 167089.0}, 
+           "flops": {"original": 314018314.0, "speedup": 38589922.0}
+       }
+* 
+  The experiment results are saved :githublink:`here <examples/model_compress/comparison_of_pruners>`. 
+  You can refer to :githublink:`analyze <examples/model_compress/comparison_of_pruners/analyze.py>` to plot new performance comparison figures.
+Contribution
+------------
+TODO Items
+^^^^^^^^^^
+* Pruners constrained by FLOPS/latency
+* More pruning algorithms/datasets/models
+Issues
+^^^^^^
+For algorithm implementation & experiment issues, please `create an issue <https://github.com/microsoft/nni/issues/new/>`__.
--- a/docs/en_US/CommunitySharings/NNI_AutoFeatureEng.rst
+++ b/docs/en_US/CommunitySharings/NNI_AutoFeatureEng.rst
+.. role:: raw-html(raw)
+   :format: html
+NNI review article from Zhihu: :raw-html:`<an open source project with highly reasonable design>` - By Garvin Li
+========================================================================================================================
+The article is by a NNI user on Zhihu forum. In the article, Garvin had shared his experience on using NNI for Automatic Feature Engineering. We think this article is very useful for users who are interested in using NNI for feature engineering. With author's permission, we translated the original article into English.  
+**原文(source)**\ : `如何看待微软最新发布的AutoML平台NNI？By Garvin Li <https://www.zhihu.com/question/297982959/answer/964961829?utm_source=wechat_session&utm_medium=social&utm_oi=28812108627968&from=singlemessage&isappinstalled=0>`__
+01 Overview of AutoML
+---------------------
+In author's opinion, AutoML is not only about hyperparameter optimization, but
+also a process that can target various stages of the machine learning process,
+including feature engineering, NAS, HPO, etc.
+02 Overview of NNI
+------------------
+NNI (Neural Network Intelligence) is an open source AutoML toolkit from
+Microsoft, to help users design and tune machine learning models, neural network
+architectures, or a complex system’s parameters in an efficient and automatic
+way.
+Link:\ ` https://github.com/Microsoft/nni <https://github.com/Microsoft/nni>`__
+In general, most of Microsoft tools have one prominent characteristic: the
+design is highly reasonable (regardless of the technology innovation degree).
+NNI's AutoFeatureENG basically meets all user requirements of AutoFeatureENG
+with a very reasonable underlying framework design.
+03 Details of NNI-AutoFeatureENG
+--------------------------------
+..
+   The article is following the github project: `https://github.com/SpongebBob/tabular_automl_NNI <https://github.com/SpongebBob/tabular_automl_NNI>`__. 
+Each new user could do AutoFeatureENG with NNI easily and efficiently. To exploring the AutoFeatureENG capability, downloads following required files, and then run NNI install through pip.
+.. image:: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
+   :target: https://pic3.zhimg.com/v2-8886eea730cad25f5ac06ef1897cd7e4_r.jpg
+   :alt: 
+NNI treats AutoFeatureENG as a two-steps-task, feature generation exploration and feature selection. Feature generation exploration is mainly about feature derivation and high-order feature combination.
+04 Feature Exploration
+----------------------
+For feature derivation, NNI offers many operations which could automatically generate new features, which list \ `as following <https://github.com/SpongebBob/tabular_automl_NNI/blob/master/AutoFEOp.rst>`__\  :
+**count**\ : Count encoding is based on replacing categories with their counts computed on the train set, also named frequency encoding.
+**target**\ : Target encoding is based on encoding categorical variable values with the mean of target variable per value.
+**embedding**\ : Regard features as sentences, generate vectors using *Word2Vec.*
+**crosscout**\ : Count encoding on more than one-dimension, alike CTR (Click Through Rate).
+**aggregete**\ : Decide the aggregation functions of the features, including min/max/mean/var.
+**nunique**\ : Statistics of the number of unique features.
+**histsta**\ : Statistics of feature buckets, like histogram statistics.
+Search space could be defined in a **JSON file**\ : to define how specific features intersect, which two columns intersect and how features generate from corresponding columns.
+.. image:: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
+   :target: https://pic1.zhimg.com/v2-3c3eeec6eea9821e067412725e5d2317_r.jpg
+   :alt: 
+The picture shows us the procedure of defining search space. NNI provides count encoding for 1-order-op, as well as cross count encoding, aggerate statistics (min max var mean median nunique) for 2-order-op. 
+For example, we want to search the features which are a frequency encoding (valuecount) features on columns name {“C1”, ...,” C26”}, in the following way:
+.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
+   :target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%203.jpg
+   :alt: 
+we can define a cross frequency encoding (value count on cross dims) method on columns {"C1",...,"C26"} x {"C1",...,"C26"} in the following way:
+.. image:: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
+   :target: https://github.com/JSong-Jia/Pic/blob/master/images/pic%204.jpg
+   :alt: 
+The purpose of Exploration is to generate new features. You can use **get_next_parameter** function to get received feature candidates of one trial.
+..
+   RECEIVED_PARAMS = nni.get_next_parameter()
+05 Feature selection
+--------------------
+To avoid feature explosion and overfitting, feature selection is necessary. In the feature selection of NNI-AutoFeatureENG, LightGBM (Light Gradient Boosting Machine), a gradient boosting framework developed by Microsoft, is mainly promoted.
+.. image:: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
+   :target: https://pic2.zhimg.com/v2-7bf9c6ae1303692101a911def478a172_r.jpg
+   :alt: 
+If you have used **XGBoost** or** GBDT**\ , you would know the algorithm based on tree structure can easily calculate the importance of each feature on results. LightGBM is able to make feature selection naturally.
+The issue is that selected features might be applicable to *GBDT* (Gradient Boosting Decision Tree), but not to the linear algorithm like *LR* (Logistic Regression).
+.. image:: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
+   :target: https://pic4.zhimg.com/v2-d2f919497b0ed937acad0577f7a8df83_r.jpg
+   :alt: 
+06 Summary
+----------
+NNI's AutoFeatureEng sets a well-established standard, showing us the operation procedure, available modules, which is highly convenient to use. However, a simple model is probably not enough for good results.
+Suggestions to NNI
+------------------
+About Exploration: If consider using DNN (like xDeepFM) to extract high-order feature would be better.
+About Selection: There could be more intelligent options, such as automatic selection system based on downstream models.
+Conclusion: NNI could offer users some inspirations of design and it is a good open source project. I suggest researchers leverage it to accelerate the AI research.
+Tips: Because the scripts of open source projects are compiled based on gcc7, Mac system may encounter problems of gcc (GNU Compiler Collection). The solution is as follows:
+brew install libomp
+===================
--- a/docs/en_US/CommunitySharings/NNI_colab_support.rst
+++ b/docs/en_US/CommunitySharings/NNI_colab_support.rst
+Use NNI on Google Colab
+=======================
+NNI can easily run on Google Colab platform. However, Colab doesn't expose its public IP and ports, so by default you can not access NNI's Web UI on Colab. To solve this, you need a reverse proxy software like ``ngrok`` or ``frp``. This tutorial will show you how to use ngrok to access NNI's Web UI on Colab.
+How to Open NNI's Web UI on Google Colab
+----------------------------------------
+#. Install required packages and softwares.
+.. code-block:: bash
+   ! pip install nni # install nni
+   ! wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip # download ngrok and unzip it
+   ! unzip ngrok-stable-linux-amd64.zip
+   ! mkdir -p nni_repo
+   ! git clone https://github.com/microsoft/nni.git nni_repo/nni # clone NNI's offical repo to get examples
+#. Register a ngrok account `here <https://ngrok.com/>`__\ , then connect to your account using your authtoken.
+.. code-block:: bash
+   ! ./ngrok authtoken <your-authtoken>
+#. Start an NNI example on a port bigger than 1024, then start ngrok with the same port. If you want to use gpu, make sure gpuNum >= 1 in config.yml. Use ``get_ipython()`` to start ngrok since it will be stuck if you use ``! ngrok http 5000 &``.
+.. code-block:: bash
+   ! nnictl create --config nni_repo/nni/examples/trials/mnist-pytorch/config.yml --port 5000 &
+   get_ipython().system_raw('./ngrok http 5000 &')
+#. Check the public url.
+.. code-block:: bash
+   ! curl -s http://localhost:4040/api/tunnels # don't change the port number 4040
+You will see an url like http://xxxx.ngrok.io after step 4, open this url and you will find NNI's Web UI. Have fun :)
+Access Web UI with frp
+----------------------
+frp is another reverse proxy software with similar functions. However, frp doesn't provide free public urls, so you may need an server with public IP as a frp server. See `here <https://github.com/fatedier/frp>`__ to know more about how to deploy frp.
--- a/docs/en_US/CommunitySharings/NasComparison.rst
+++ b/docs/en_US/CommunitySharings/NasComparison.rst
+Neural Architecture Search Comparison
+=====================================
+*Posted by Anonymous Author*
+Train and Compare NAS (Neural Architecture Search) models including Autokeras, DARTS, ENAS and NAO.
+Their source code link is as below:
+* 
+  Autokeras: `https://github.com/jhfjhfj1/autokeras <https://github.com/jhfjhfj1/autokeras>`__
+* 
+  DARTS: `https://github.com/quark0/darts <https://github.com/quark0/darts>`__
+* 
+  ENAS: `https://github.com/melodyguan/enas <https://github.com/melodyguan/enas>`__
+* 
+  NAO: `https://github.com/renqianluo/NAO <https://github.com/renqianluo/NAO>`__
+Experiment Description
+----------------------
+To avoid over-fitting in **CIFAR-10**\ , we also compare the models in the other five datasets including Fashion-MNIST, CIFAR-100, OUI-Adience-Age, ImageNet-10-1 (subset of ImageNet), ImageNet-10-2 (another subset of ImageNet). We just sample a subset with 10 different labels from ImageNet to make ImageNet-10-1 or ImageNet-10-2.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Dataset
+     - Training Size
+     - Numer of Classes
+     - Descriptions
+   * - `Fashion-MNIST <https://github.com/zalandoresearch/fashion-mnist>`__
+     - 60,000
+     - 10
+     - T-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot.
+   * - `CIFAR-10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__
+     - 50,000
+     - 10
+     - Airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks.
+   * - `CIFAR-100 <https://www.cs.toronto.edu/~kriz/cifar.html>`__
+     - 50,000
+     - 100
+     - Similar to CIFAR-10 but with 100 classes and 600 images each.
+   * - `OUI-Adience-Age <https://talhassner.github.io/home/projects/Adience/Adience-data.html>`__
+     - 26,580
+     - 8
+     - 8 age groups/labels (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-).
+   * - `ImageNet-10-1 <http://www.image-net.org/>`__
+     - 9,750
+     - 10
+     - Coffee mug, computer keyboard, dining table, wardrobe, lawn mower, microphone, swing, sewing machine, odometer and gas pump.
+   * - `ImageNet-10-2 <http://www.image-net.org/>`__
+     - 9,750
+     - 10
+     - Drum, banj, whistle, grand piano, violin, organ, acoustic guitar, trombone, flute and sax.
+We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
+Search phase time for all NAS methods is **two days** as well as the retrain time.  Average results are reported based on** three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
+For NAO, it requires too much computing resources, so we only use NAO-WS which provides the pipeline script.
+For AutoKeras, we used  0.2.18 version because it was the latest version when we started the experiment.
+NAS Performance
+---------------
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - NAS
+     - AutoKeras (%)
+     - ENAS (macro) (%)
+     - ENAS (micro) (%)
+     - DARTS (%)
+     - NAO-WS (%)
+   * - Fashion-MNIST
+     - 91.84
+     - 95.44
+     - 95.53
+     - **95.74**
+     - 95.20
+   * - CIFAR-10
+     - 75.78
+     - 95.68
+     - **96.16**
+     - 94.23
+     - 95.64
+   * - CIFAR-100
+     - 43.61
+     - 78.13
+     - 78.84
+     - **79.74**
+     - 75.75
+   * - OUI-Adience-Age
+     - 63.20
+     - **80.34**
+     - 78.55
+     - 76.83
+     - 72.96
+   * - ImageNet-10-1
+     - 61.80
+     - 77.07
+     - 79.80
+     - **80.48**
+     - 77.20
+   * - ImageNet-10-2
+     - 37.20
+     - 58.13
+     - 56.47
+     - 60.53
+     - **61.20**
+Unfortunately, we cannot reproduce all the results in the paper.
+The best or average results reported in the paper:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - NAS
+     - AutoKeras(%)
+     - ENAS (macro) (%)
+     - ENAS (micro) (%)
+     - DARTS (%)
+     - NAO-WS (%)
+   * - CIFAR- 10
+     - 88.56(best)
+     - 96.13(best)
+     - 97.11(best)
+     - 97.17(average)
+     - 96.47(best)
+For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
+For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro)  shows good results in CIFAR-10.
+For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
+For NAO-WS, it shows good results in ImageNet-10-2 but it can perform very poorly in OUI-Adience-Age.
+Reference
+---------
+#. 
+   Jin, Haifeng, Qingquan Song, and Xia Hu. "Efficient neural architecture search with network morphism." *arXiv preprint arXiv:1806.10282* (2018).
+#. 
+   Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search." arXiv preprint arXiv:1806.09055 (2018).
+#. 
+   Pham, Hieu, et al. "Efficient Neural Architecture Search via Parameters Sharing." international conference on machine learning (2018): 4092-4101.
+#. 
+   Luo, Renqian, et al. "Neural Architecture Optimization." neural information processing systems (2018): 7827-7838.
--- a/docs/en_US/CommunitySharings/ParallelizingTpeSearch.rst
+++ b/docs/en_US/CommunitySharings/ParallelizingTpeSearch.rst
+.. role:: raw-html(raw)
+   :format: html
+Parallelizing a Sequential Algorithm TPE
+========================================
+TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. For the TPE approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation of the y was assigned temporarily, until the evaluation completed and reported the actual loss f(x∗).
+Introduction and Problems
+-------------------------
+Sequential Model-based Global Optimization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Sequential Model-Based Global Optimization (SMBO) algorithms have been used in many applications where evaluation of the fitness function is expensive. In an application where the true fitness function f: X → R is costly to evaluate, model-based algorithms approximate f with a surrogate that is cheaper to evaluate. Typically the inner loop in an SMBO algorithm is the numerical optimization of this surrogate, or some transformation of the surrogate. The point x∗ that maximizes the surrogate (or its transformation) becomes the proposal for where the true function f should be evaluated. This active-learning-like algorithm template is summarized in the figure below. SMBO algorithms differ in what criterion they optimize to obtain x∗ given a model (or surrogate) of f, and in they model f via observation history H.
+.. image:: ../../img/parallel_tpe_search4.PNG
+   :target: ../../img/parallel_tpe_search4.PNG
+   :alt: 
+The algorithms in this work optimize the criterion of Expected Improvement (EI). Other criteria have been suggested, such as Probability of Improvement and Expected Improvement, minimizing the Conditional Entropy of the Minimizer, and the bandit-based criterion. We chose to use the EI criterion in TPE because it is intuitive, and has been shown to work well in a variety of settings. Expected improvement is the expectation under some model M of f : X → RN that f(x) will exceed (negatively) some threshold y∗:
+.. image:: ../../img/parallel_tpe_search_ei.PNG
+   :target: ../../img/parallel_tpe_search_ei.PNG
+   :alt: 
+Since calculation of p(y|x) is expensive, TPE approach modeled p(y|x) by p(x|y) and p(y).The TPE defines p(x|y) using two such densities:
+.. image:: ../../img/parallel_tpe_search_tpe.PNG
+   :target: ../../img/parallel_tpe_search_tpe.PNG
+   :alt: 
+where l(x) is the density formed by using the observations {x(i)} such that corresponding loss
+f(x(i)) was less than y∗ and g(x) is the density formed by using the remaining observations. TPE algorithm depends on a y∗ that is larger than the best observed f(x) so that some points can be used to form l(x). The TPE algorithm chooses y∗ to be some quantile γ of the observed y values, so that p(y<\ ``y∗``\ ) = γ, but no specific model for p(y) is necessary. The tree-structured form of l and g makes it easy to draw many candidates according to l and evaluate them according to g(x)/l(x). On each iteration, the algorithm returns the candidate x∗ with the greatest EI.
+Here is a simulation of the TPE algorithm in a two-dimensional search space. The difference of background color represents different values. It can be seen that TPE combines exploration and exploitation very well. (Black indicates the points of this round samples, and yellow indicates the points has been taken in the history.)
+.. image:: ../../img/parallel_tpe_search1.gif
+   :target: ../../img/parallel_tpe_search1.gif
+   :alt: 
+**Since EI is a continuous function, the highest x of EI is determined at a certain status.** As shown in the figure below, the blue triangle is the point that is most likely to be sampled in this state.
+.. image:: ../../img/parallel_tpe_search_ei2.PNG
+   :target: ../../img/parallel_tpe_search_ei2.PNG
+   :alt: 
+TPE performs well when we use it in sequential, but if we provide a larger concurrency, then **there will be a large number of points produced in the same EI state**\ , too concentrated points will reduce the exploration ability of the tuner, resulting in resources waste.
+Here is the simulation figure when we set ``concurrency=60``\ , It can be seen that this phenomenon is obvious.
+.. image:: ../../img/parallel_tpe_search2.gif
+   :target: ../../img/parallel_tpe_search2.gif
+   :alt: 
+Research solution
+-----------------
+Approximated q-EI Maximization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The multi-points criterion that we have presented below can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem.
+.. image:: ../../img/parallel_tpe_search_qEI.PNG
+   :target: ../../img/parallel_tpe_search_qEI.PNG
+   :alt: 
+However, the computation of q-EI becomes intensive as q increases. After our research, there are four popular greedy strategies that approach the result of problem while avoiding its numerical cost.
+Solution 1: Believing the OK Predictor: The KB(Kriging Believer) Heuristic Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The Kriging Believer strategy replaces the conditional knowledge about the responses at the sites chosen within the last iterations by deterministic values equal to the expectation of the Kriging predictor. Keeping the same notations as previously, the strategy can be summed up as follows:
+.. image:: ../../img/parallel_tpe_search_kb.PNG
+   :target: ../../img/parallel_tpe_search_kb.PNG
+   :alt: 
+This sequential strategy delivers a q-points design and is computationally affordable since it relies on the analytically known EI, optimized in d dimensions. However, there is a risk of failure, since believing an OK predictor that overshoots the observed data may lead to a sequence that gets trapped in a non-optimal region for many iterations. We now propose a second strategy that reduces this risk.
+Solution 2: The CL(Constant Liar) Heuristic Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Let us now consider a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L exogenously fixed by the user, here called a ”lie”. The strategy referred to as the Constant Liar consists in lying with the same value L at every iteration: maximize EI (i.e. find xn+1), actualize the model as if y(xn+1) = L, and so on always with the same L ∈ R:
+.. image:: ../../img/parallel_tpe_search_cl.PNG
+   :target: ../../img/parallel_tpe_search_cl.PNG
+   :alt: 
+L should logically be determined on the basis of the values taken by y at X. Three values, min{Y}, mean{Y}, and max{Y} are considered here. **The larger L is, the more explorative the algorithm will be, and vice versa.**
+We have simulated the method above. The following figure shows the result of using mean value liars to maximize q-EI. We find that the points we have taken have begun to be scattered.
+.. image:: ../../img/parallel_tpe_search3.gif
+   :target: ../../img/parallel_tpe_search3.gif
+   :alt: 
+Experiment
+----------
+Branin-Hoo
+^^^^^^^^^^
+The four optimization strategies presented in the last section are now compared on the Branin-Hoo function which is a classical test-case in global optimization.
+.. image:: ../../img/parallel_tpe_search_branin.PNG
+   :target: ../../img/parallel_tpe_search_branin.PNG
+   :alt: 
+The recommended values of a, b, c, r, s and t are: a = 1, b = 5.1 ⁄ (4π2), c = 5 ⁄ π, r = 6, s = 10 and t = 1 ⁄ (8π). This function has three global minimizers(-3.14, 12.27), (3.14, 2.27), (9.42, 2.47).
+Next is the comparison of the q-EI associated with the q first points (q ∈ [1,10]) given by the constant liar strategies (min and max), 2000 q-points designs uniformly drawn for every q, and 2000 q-points LHS designs taken at random for every q.
+.. image:: ../../img/parallel_tpe_search_result.PNG
+   :target: ../../img/parallel_tpe_search_result.PNG
+   :alt: 
+As we can seen on figure, CL[max] and CL[min] offer very good q-EI results compared to random designs, especially for small values of q.
+Gaussian Mixed Model function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+We also compared the case of using parallel optimization and not using parallel optimization. A two-dimensional multimodal Gaussian Mixed distribution is used to simulate, the following is our result:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - 
+     - concurrency=80
+     - concurrency=60
+     - concurrency=40
+     - concurrency=20
+     - concurrency=10
+   * - Without parallel optimization
+     - avg =  0.4841 :raw-html:`<br>` var =  0.1953
+     - avg =  0.5155 :raw-html:`<br>` var =  0.2219
+     - avg =  0.5773 :raw-html:`<br>` var =  0.2570
+     - avg =  0.4680 :raw-html:`<br>` var =  0.1994
+     - avg = 0.2774 :raw-html:`<br>` var = 0.1217
+   * - With parallel optimization
+     - avg =  0.2132 :raw-html:`<br>` var = 0.0700
+     - avg =  0.2177\ :raw-html:`<br>`\ var =  0.0796
+     - avg =  0.1835 :raw-html:`<br>` var =  0.0533
+     - avg =  0.1671 :raw-html:`<br>` var =  0.0413
+     - avg =  0.1918 :raw-html:`<br>` var =  0.0697
+Note: The total number of samples per test is 240 (ensure that the budget is equal). The trials in each form were repeated 1000 times, the value is the average and variance of the best results in 1000 trials.
+References
+----------
+[1] James Bergstra, Remi Bardenet, Yoshua Bengio, Balazs Kegl. "Algorithms for Hyper-Parameter Optimization". `Link <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
+[2] Meng-Hiot Lim, Yew-Soon Ong. "Computational Intelligence in Expensive Optimization Problems". `Link <https://link.springer.com/content/pdf/10.1007%2F978-3-642-10701-6.pdf>`__
+[3] M. Jordan, J. Kleinberg, B. Scho¨lkopf. "Pattern Recognition and Machine Learning". `Link <http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf>`__
--- a/docs/en_US/CommunitySharings/RecommendersSvd.rst
+++ b/docs/en_US/CommunitySharings/RecommendersSvd.rst
+Automatically tuning SVD (NNI in Recommenders)
+==============================================
+In this tutorial, we first introduce a github repo `Recommenders <https://github.com/Microsoft/Recommenders>`__. It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
+* `Prepare Data <https://github.com/Microsoft/Recommenders/blob/master/notebooks/01_prepare_data/README.rst>`__\ : Preparing and loading data for each recommender algorithm.
+* `Model <https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/README.rst>`__\ : Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (\ `ALS <https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS>`__\ ) or eXtreme Deep Factorization Machines (\ `xDeepFM <https://arxiv.org/abs/1803.05170>`__\ ).
+* `Evaluate <https://github.com/Microsoft/Recommenders/blob/master/notebooks/03_evaluate/README.rst>`__\ : Evaluating algorithms with offline metrics.
+* `Model Select and Optimize <https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/README.rst>`__\ : Tuning and optimizing hyperparameters for recommender models.
+* `Operationalize <https://github.com/Microsoft/Recommenders/blob/master/notebooks/05_operationalize/README.rst>`__\ : Operationalizing models in a production environment on Azure.
+The fourth task is tuning and optimizing the model's hyperparameters, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model `SVD <https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb>`__\ , and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
+`This Jupyter notebook <https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb>`__ provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including ``Annealing``\ , ``SMAC``\ , ``Random Search``\ , ``TPE``\ , ``Hyperband``\ , ``Metis`` and ``Evolution``. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
--- a/docs/en_US/CommunitySharings/SptagAutoTune.rst
+++ b/docs/en_US/CommunitySharings/SptagAutoTune.rst
+Automatically tuning SPTAG with NNI
+===================================
+`SPTAG <https://github.com/microsoft/SPTAG>`__ (Space Partition Tree And Graph) is a library for large scale vector approximate nearest neighbor search scenario released by `Microsoft Research (MSR) <https://www.msra.cn/>`__ and `Microsoft Bing <https://www.bing.com/>`__.
+This library assumes that the samples are represented as vectors and that the vectors can be compared by L2 distances or cosine distances. Vectors returned for a query vector are the vectors that have smallest L2 distance or cosine distances with the query vector.
+SPTAG provides two methods: kd-tree and relative neighborhood graph (SPTAG-KDT) and balanced k-means tree and relative neighborhood graph (SPTAG-BKT). SPTAG-KDT is advantageous in index building cost, and SPTAG-BKT is advantageous in search accuracy in very high-dimensional data.
+In SPTAG, there are tens of parameters that can be tuned for specified scenarios or datasets. NNI is a great tool for automatically tuning those parameters. The authors of SPTAG tried NNI for the auto tuning and found good-performing parameters easily, thus, they shared the practice of tuning SPTAG on NNI in their document `here <https://github.com/microsoft/SPTAG/blob/master/docs/Parameters.rst>`__. Please refer to it for detailed tutorial.
--- a/docs/en_US/Compression/AutoPruningUsingTuners.rst
+++ b/docs/en_US/Compression/AutoPruningUsingTuners.rst
+Automatic Model Pruning using NNI Tuners
+========================================
+It's convenient to implement auto model pruning with NNI compression and NNI tuners
+First, model compression with NNI
+---------------------------------
+You can easily compress a model with NNI compression. Take pruning for example, you can prune a pretrained model with LevelPruner like this
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+The 'default' op_type stands for the module types defined in :githublink:`default_layers.py <src/sdk/pynni/nni/compression/pytorch/default_layers.py>` for pytorch.
+Therefore ``{ 'sparsity': 0.8, 'op_types': ['default'] }``\ means that **all layers with specified op_types will be compressed with the same 0.8 sparsity**. When ``pruner.compress()`` called, the model is compressed with masks and after that you can normally fine tune this model and **pruned weights won't be updated** which have been masked.
+Then, make this automatic
+-------------------------
+The previous example manually choosed LevelPruner and pruned all layers with the same sparsity, this is obviously sub-optimal because different layers may have different redundancy. Layer sparsity should be carefully tuned to achieve least model performance degradation and this can be done with NNI tuners.
+The first thing we need to do is to design a search space, here we use a nested search space which contains  choosing pruning algorithm and optimizing layer sparsity.
+.. code-block:: json
+   {
+     "prune_method": {
+       "_type": "choice",
+       "_value": [
+         {
+           "_name": "agp",
+           "conv0_sparsity": {
+             "_type": "uniform",
+             "_value": [
+               0.1,
+               0.9
+             ]
+           },
+           "conv1_sparsity": {
+             "_type": "uniform",
+             "_value": [
+               0.1,
+               0.9
+             ]
+           },
+         },
+         {
+           "_name": "level",
+           "conv0_sparsity": {
+             "_type": "uniform",
+             "_value": [
+               0.1,
+               0.9
+             ]
+           },
+           "conv1_sparsity": {
+             "_type": "uniform",
+             "_value": [
+               0.01,
+               0.9
+             ]
+           },
+         }
+       ]
+     }
+   }
+Then we need to modify our codes for few lines
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.pytorch.pruning import *
+   params = nni.get_parameters()
+   conv0_sparsity = params['prune_method']['conv0_sparsity']
+   conv1_sparsity = params['prune_method']['conv1_sparsity']
+   # these raw sparsity should be scaled if you need total sparsity constrained
+   config_list_level = [{ 'sparsity': conv0_sparsity, 'op_name': 'conv0' },
+                        { 'sparsity': conv1_sparsity, 'op_name': 'conv1' }]
+   config_list_agp = [{'initial_sparsity': 0, 'final_sparsity': conv0_sparsity,
+                       'start_epoch': 0, 'end_epoch': 3,
+                       'frequency': 1,'op_name': 'conv0' },
+                      {'initial_sparsity': 0, 'final_sparsity': conv1_sparsity,
+                       'start_epoch': 0, 'end_epoch': 3,
+                       'frequency': 1,'op_name': 'conv1' },]
+   PRUNERS = {'level':LevelPruner(model, config_list_level), 'agp':AGPPruner(model, config_list_agp)}
+   pruner = PRUNERS(params['prune_method']['_name'])
+   pruner.compress()
+   ... # fine tuning
+   acc = evaluate(model) # evaluation
+   nni.report_final_results(acc)
+Last, define our task and automatically tuning pruning methods with layers sparsity
+.. code-block:: yaml
+   authorName: default
+   experimentName: Auto_Compression
+   trialConcurrency: 2
+   maxExecDuration: 100h
+   maxTrialNum: 500
+   #choice: local, remote, pai
+   trainingServicePlatform: local
+   #choice: true, false
+   useAnnotation: False
+   searchSpacePath: search_space.json
+   tuner:
+     #choice: TPE, Random, Anneal...
+     builtinTunerName: TPE
+     classArgs:
+       #choice: maximize, minimize
+       optimize_mode: maximize
+   trial:
+     command: bash run_prune.sh
+     codeDir: .
+     gpuNum: 1
--- a/docs/en_US/Compression/CompressionReference.rst
+++ b/docs/en_US/Compression/CompressionReference.rst
+Python API Reference of Compression Utilities
+=============================================
+.. contents::
+Sensitivity Utilities
+---------------------
+..  autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
+    :members:
+Topology Utilities
+------------------
+..  autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
+    :members:
+..  autoclass:: nni.compression.pytorch.utils.shape_dependency.GroupDependency
+    :members:
+..  autoclass:: nni.compression.pytorch.utils.mask_conflict.CatMaskPadding
+    :members:
+..  autoclass:: nni.compression.pytorch.utils.mask_conflict.GroupMaskConflict
+    :members:
+..  autoclass:: nni.compression.pytorch.utils.mask_conflict.ChannelMaskConflict
+    :members:
+Model FLOPs/Parameters Counter
+------------------------------
+..  autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
--- a/docs/en_US/Compression/CompressionUtils.rst
+++ b/docs/en_US/Compression/CompressionUtils.rst
+Analysis Utils for Model Compression
+====================================
+.. contents::
+We provide several easy-to-use tools for users to analyze their model during model compression.
+Sensitivity Analysis
+--------------------
+First, we provide a sensitivity analysis tool (\ **SensitivityAnalysis**\ ) for users to analyze the sensitivity of each convolutional layer in their model. Specifically, the SensitiviyAnalysis gradually prune each layer of the model, and test the accuracy of the model at the same time. Note that, SensitivityAnalysis only prunes a layer once a time, and the other layers are set to their original weights. According to the accuracies of different convolutional layers under different sparsities, we can easily find out which layers the model accuracy is more sensitive to. 
+Usage
+^^^^^
+The following codes show the basic usage of the SensitivityAnalysis.
+.. code-block:: python
+   from nni.compression.pytorch.utils.sensitivity_analysis import SensitivityAnalysis
+   def val(model):
+       model.eval()
+       total = 0
+       correct = 0
+       with torch.no_grad():
+           for batchid, (data, label) in enumerate(val_loader):
+               data, label = data.cuda(), label.cuda()
+               out = model(data)
+               _, predicted = out.max(1)
+               total += data.size(0)
+               correct += predicted.eq(label).sum().item()
+       return correct / total
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val)
+   sensitivity = s_analyzer.analysis(val_args=[net])
+   os.makedir(outdir)
+   s_analyzer.export(os.path.join(outdir, filename))
+Two key parameters of SensitivityAnalysis are ``model``\ , and ``val_func``. ``model`` is the neural network that to be analyzed and the ``val_func`` is the validation function that returns the model accuracy/loss/ or other metrics on the validation dataset. Due to different scenarios may have different ways to calculate the loss/accuracy, so users should prepare a function that returns the model accuracy/loss on the dataset and pass it to SensitivityAnalysis.
+SensitivityAnalysis can export the sensitivity results as a csv file usage is shown in the example above.
+Futhermore, users can specify the sparsities values used to prune for each layer by optional parameter ``sparsities``.
+.. code-block:: python
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75])
+the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
+Users can also speed up the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes:  minimize, maximize, dropped, raised.
+minimize: The analysis stops when the validation metric return by the val_func lower than ``early_stop_value``.
+maximize: The analysis stops when the validation metric return by the val_func larger than ``early_stop_value``.
+dropped: The analysis stops when the validation metric has dropped by ``early_stop_value``.
+raised: The analysis stops when the validation metric has raised by ``early_stop_value``.
+.. code-block:: python
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75], early_stop_mode='dropped', early_stop_value=0.1)
+If users only want to analyze several specified convolutional layers, users can specify the target conv layers by the ``specified_layers`` in analysis function. ``specified_layers`` is a list that consists of the Pytorch module names of the conv layers. For example
+.. code-block:: python
+   sensitivity = s_analyzer.analysis(val_args=[net], specified_layers=['Conv1'])
+In this example, only the ``Conv1`` layer is analyzed. In addtion, users can quickly and easily achieve the analysis parallelization by launching multiple processes and assigning different conv layers of the same model to each process.
+Output example
+^^^^^^^^^^^^^^
+The following lines are the example csv file exported from SensitivityAnalysis. The first line is constructed by 'layername' and sparsity list. Here the sparsity value means how much weight SensitivityAnalysis prune for each layer. Each line below records the model accuracy when this layer is under different sparsities. Note that, due to the early_stop option, some layers may
+not have model accuracies/losses under all sparsities, for example, its accuracy drop has already exceeded the threshold set by the user.
+.. code-block:: bash
+   layername,0.05,0.1,0.2,0.3,0.4,0.5,0.7,0.85,0.95
+   features.0,0.54566,0.46308,0.06978,0.0374,0.03024,0.01512,0.00866,0.00492,0.00184
+   features.3,0.54878,0.51184,0.37978,0.19814,0.07178,0.02114,0.00438,0.00442,0.00142
+   features.6,0.55128,0.53566,0.4887,0.4167,0.31178,0.19152,0.08612,0.01258,0.00236
+   features.8,0.55696,0.54194,0.48892,0.42986,0.33048,0.2266,0.09566,0.02348,0.0056
+   features.10,0.55468,0.5394,0.49576,0.4291,0.3591,0.28138,0.14256,0.05446,0.01578
+Topology Analysis
+-----------------
+We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
+ChannelDependency
+^^^^^^^^^^^^^^^^^
+Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the ``layer2.0.conv2`` and ``layer2.0.downsample.0`` are added together, so the number of the output channels of ``layer2.0.conv2`` and ``layer2.0.downsample.0`` should be the same, or there may be a tensor shape conflict.
+.. image:: ../../img/channel_dependency_example.jpg
+   :target: ../../img/channel_dependency_example.jpg
+   :alt: 
+If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
+Usage
+^^^^^
+.. code-block:: python
+   from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
+   data = torch.ones(1, 3, 224, 224).cuda()
+   channel_depen = ChannelDependency(net, data)
+   channel_depen.export('dependency.csv')
+Output Example
+^^^^^^^^^^^^^^
+The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict. 
+.. code-block:: bash
+   Dependency Set,Convolutional Layers
+   Set 1,layer1.1.conv2,layer1.0.conv2,conv1
+   Set 2,layer1.0.conv1
+   Set 3,layer1.1.conv1
+   Set 4,layer2.0.conv1
+   Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
+   Set 6,layer2.1.conv1
+   Set 7,layer3.0.conv1
+   Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
+   Set 9,layer3.1.conv1
+   Set 10,layer4.0.conv1
+   Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
+   Set 12,layer4.1.conv1
+MaskConflict
+^^^^^^^^^^^^
+When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
+.. code-block:: bash
+   from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
+   fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
+Model FLOPs/Parameters Counter
+------------------------------
+We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). 
+We support two modes to collect information of modules. The first mode is ``default``\ , which only collect the information of convolution and linear. The second mode is ``full``\ , which also collect the information of other operations. Users can easily use our collected ``results`` for futher analysis.
+Usage
+^^^^^
+.. code-block:: python
+   from nni.compression.pytorch.utils.counter import count_flops_params
+   # Given input size (1, 1, 28, 28)
+   flops, params, results = count_flops_params(model, (1, 1, 28, 28)) 
+   # Given input tensor with size (1, 1, 28, 28) and switch to full mode
+   x = torch.randn(1, 1, 28, 28)
+   flops, params, results = count_flops_params(model, (x,) mode='full') # tuple of tensor as input
+   # Format output size to M (i.e., 10^6)
+   print(f'FLOPs: {flops/1e6:.3f}M,  Params: {params/1e6:.3f}M)
+   print(results)
+   {
+   'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}, 
+   'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
+   }
--- a/docs/en_US/Compression/CustomizeCompressor.rst
+++ b/docs/en_US/Compression/CustomizeCompressor.rst
+Customize New Compression Algorithm
+===================================
+.. contents::
+In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
+**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference `Framework overview of model compression </Compression/Framework.html>`__
+Customize a new pruning algorithm
+---------------------------------
+Implementing a new pruning algorithm requires implementing a ``weight masker`` class which shoud be a subclass of ``WeightMasker``\ , and a ``pruner`` class, which should be a subclass ``Pruner``.
+An implementation of ``weight masker`` may look like this:
+.. code-block:: python
+   class MyMasker(WeightMasker):
+       def __init__(self, model, pruner):
+           super().__init__(model, pruner)
+           # You can do some initialization here, such as collecting some statistics data
+           # if it is necessary for your algorithms to calculate the masks.
+       def calc_mask(self, sparsity, wrapper, wrapper_idx=None):
+           # calculate the masks based on the wrapper.weight, and sparsity, 
+           # and anything else
+           # mask = ...
+           return {'weight_mask': mask}
+You can reference nni provided :githublink:`weight masker <src/sdk/pynni/nni/compression/pytorch/pruning/structured_pruning.py>` implementations to implement your own weight masker.
+A basic ``pruner`` looks likes this:
+.. code-block:: python
+   class MyPruner(Pruner):
+       def __init__(self, model, config_list, optimizer):
+           super().__init__(model, config_list, optimizer)
+           self.set_wrappers_attribute("if_calculated", False)
+           # construct a weight masker instance
+           self.masker = MyMasker(model, self)
+       def calc_mask(self, wrapper, wrapper_idx=None):
+           sparsity = wrapper.config['sparsity']
+           if wrapper.if_calculated:
+               # Already pruned, do not prune again as a one-shot pruner
+               return None
+           else:
+               # call your masker to actually calcuate the mask for this layer
+               masks = self.masker.calc_mask(sparsity=sparsity, wrapper=wrapper, wrapper_idx=wrapper_idx)
+               wrapper.if_calculated = True
+               return masks
+Reference nni provided :githublink:`pruner <src/sdk/pynni/nni/compression/pytorch/pruning/one_shot.py>` implementations to implement your own pruner class.
+----
+Customize a new quantization algorithm
+--------------------------------------
+To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``. Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``. ``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+.. code-block:: python
+   from nni.compression.pytorch import Quantizer
+   class YourQuantizer(Quantizer):
+       def __init__(self, model, config_list):
+           """
+           Suggest you to use the NNI defined spec for config
+           """
+           super().__init__(model, config_list)
+       def quantize_weight(self, weight, config, **kwargs):
+           """
+           quantize should overload this method to quantize weight tensors.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           weight : Tensor
+               weight that needs to be quantized
+           config : dict
+               the configuration for weight quantization
+           """
+           # Put your code to generate `new_weight` here
+           return new_weight
+       def quantize_output(self, output, config, **kwargs):
+           """
+           quantize should overload this method to quantize output.
+           This method is effectively hooked to `:meth:`forward` of the model.
+           Parameters
+           ----------
+           output : Tensor
+               output that needs to be quantized
+           config : dict
+               the configuration for output quantization
+           """
+           # Put your code to generate `new_output` here
+           return new_output
+       def quantize_input(self, *inputs, config, **kwargs):
+           """
+           quantize should overload this method to quantize input.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           inputs : Tensor
+               inputs that needs to be quantized
+           config : dict
+               the configuration for inputs quantization
+           """
+           # Put your code to generate `new_input` here
+           return new_input
+       def update_epoch(self, epoch_num):
+           pass
+       def step(self):
+           """
+           Can do some processing based on the model or weights binded
+           in the func bind_model
+           """
+           pass
+Customize backward function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ , user can customize a backward function as follow:
+.. code-block:: python
+   from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
+   class ClipGrad(QuantGrad):
+       @staticmethod
+       def quant_backward(tensor, grad_output, quant_type):
+           """
+           This method should be overrided by subclass to provide customized backward function,
+           default implementation is Straight-Through Estimator
+           Parameters
+           ----------
+           tensor : Tensor
+               input of quantization operation
+           grad_output : Tensor
+               gradient of the output of quantization operation
+           quant_type : QuantType
+               the type of quantization, it can be `QuantType.QUANT_INPUT`, `QuantType.QUANT_WEIGHT`, `QuantType.QUANT_OUTPUT`,
+               you can define different behavior for different types.
+           Returns
+           -------
+           tensor
+               gradient of the input of quantization operation
+           """
+           # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+           if quant_type == QuantType.QUANT_OUTPUT: 
+               grad_output[torch.abs(tensor) > 1] = 0
+           return grad_output
+   class YourQuantizer(Quantizer):
+       def __init__(self, model, config_list):
+           super().__init__(model, config_list)
+           # set your customized backward function to overwrite default backward function
+           self.quant_grad = ClipGrad
+If you do not customize ``QuantGrad``\ , the default backward is Straight-Through Estimator. 
+*Coming Soon* ...
--- a/docs/en_US/Compression/DependencyAware.rst
+++ b/docs/en_US/Compression/DependencyAware.rst
+Dependency-aware Mode for Filter Pruning
+========================================
+Currently, we have several filter pruning algorithm for the convolutional layers: FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner. In these filter pruning algorithms, the pruner will prune each convolutional layer separately. While pruning a convolution layer, the algorithm will quantify the importance of each filter based on some specific rules(such as l1-norm), and prune the less important filters.
+As `dependency analysis utils <./CompressionUtils.md>`__ shows, if the output channels of two convolutional layers(conv1, conv2) are added together, then these two conv layers have channel dependency with each other(more details please see `Compression Utils <./CompressionUtils.rst>`__\ ). Take the following figure as an example.
+.. image:: ../../img/mask_conflict.jpg
+   :target: ../../img/mask_conflict.jpg
+   :alt: 
+If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning.
+ To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the Filter Pruner. In the dependency-aware mode, the pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the whole network architecture.
+In the dependency-aware mode(\ ``dependency_aware`` is set ``True``\ ), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+.. image:: ../../img/dependency-aware.jpg
+   :target: ../../img/dependency-aware.jpg
+   :alt: 
+Take the dependency-aware mode of L1Filter Pruner as an example. Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel. Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set(denoted by ``min_sparsity``\ ). According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers. Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of ``conv1`` , ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively. In this case, the ``dependency-aware pruner`` will 
+.. code-block:: bash
+   - First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`. 
+   - Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+In addition, for the convolutional layers that have more than one filter group, ``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process. 
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
+Usage
+-----
+In this section, we will show how to enable the dependency-aware mode for the filter pruner. Currently, only the one-shot pruners such as FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner, support the dependency-aware mode.
+To enable the dependency-aware mode for ``L1FilterPruner``\ :
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   # dummy_input is necessary for the dependency_aware mode
+   dummy_input = torch.ones(1, 3, 224, 224).cuda()
+   pruner = L1FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+   # for L2FilterPruner
+   # pruner = L2FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+   # for FPGMPruner
+   # pruner = FPGMPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+   # for ActivationAPoZRankFilterPruner
+   # pruner = ActivationAPoZRankFilterPruner(model, config_list, statistics_batch_num=1, , dependency_aware=True, dummy_input=dummy_input)
+   # for ActivationMeanRankFilterPruner
+   # pruner = ActivationMeanRankFilterPruner(model, config_list, statistics_batch_num=1, dependency_aware=True, dummy_input=dummy_input)
+   # for TaylorFOWeightFilterPruner
+   # pruner = TaylorFOWeightFilterPruner(model, config_list, statistics_batch_num=1, dependency_aware=True, dummy_input=dummy_input)
+   pruner.compress()
+Evaluation
+----------
+In order to compare the performance of the pruner with or without the dependency-aware mode, we use L1FilterPruner to prune the Mobilenet_v2 separately when the dependency-aware mode is turned on and off. To simplify the experiment, we use the uniform pruning which means we allocate the same sparsity for all convolutional layers in the model.
+We trained a Mobilenet_v2 model on the cifar10 dataset and prune the model based on this pretrained checkpoint. The following figure shows the accuracy and FLOPs of the model pruned by different pruners.
+.. image:: ../../img/mobilev2_l1_cifar.jpg
+   :target: ../../img/mobilev2_l1_cifar.jpg
+   :alt: 
+In the figure, the ``Dependency-aware`` represents the L1FilterPruner with dependency-aware mode enabled. ``L1 Filter`` is the normal ``L1FilterPruner`` without the dependency-aware mode, and the ``No-Dependency`` means  pruner only prunes the layers that has no channel dependency with other layers. As we can see in the figure, when the dependency-aware mode enabled, the pruner can bring higher accuracy under the same Flops.
--- a/docs/en_US/Compression/Framework.rst
+++ b/docs/en_US/Compression/Framework.rst
+Framework overview of model compression
+=======================================
+.. contents::
+Below picture shows the components overview of model compression framework.
+.. image:: ../../img/compressor_framework.jpg
+   :target: ../../img/compressor_framework.jpg
+   :alt: 
+There are 3 major components/classes in NNI model compression framework: ``Compressor``\ , ``Pruner`` and ``Quantizer``. Let's look at them in detail one by one:
+Compressor
+----------
+Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   # load a pretrained model or train a model before using a pruner
+   configure_list = [{
+       'sparsity': 0.7,
+       'op_types': ['Conv2d', 'Linear'],
+   }]
+   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+   pruner = LevelPruner(model, configure_list, optimizer)
+   model = pruner.compress()
+   # model is ready for pruning, now start finetune the model,
+   # the model will be pruned during training automatically
+To use a quantizer:
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import DoReFaQuantizer
+   configure_list = [{
+       'quant_types': ['weight'],
+       'quant_bits': {
+           'weight': 8,
+       },
+       'op_types':['Conv2d', 'Linear']
+   }]
+   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+   quantizer = DoReFaQuantizer(model, configure_list, optimizer)
+   quantizer.compress()
+View :githublink:`example code <examples/model_compress>` for more information.
+``Compressor`` class provides some utility methods for subclass and users:
+Set wrapper attribute
+^^^^^^^^^^^^^^^^^^^^^
+Sometimes ``calc_mask`` must save some state data, therefore users can use ``set_wrappers_attribute`` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to ``module wrapper``. Users can access these buffers through ``module wrapper``.
+In above example, we use ``set_wrappers_attribute`` to set a buffer ``if_calculated`` which is used as flag indicating if the mask of a layer is already calculated.
+Collect data during forward
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Sometimes users want to collect some data during the modules' forward method, for example, the mean value of the activation. This can be done by adding a customized collector to module.
+.. code-block:: python
+   class MyMasker(WeightMasker):
+       def __init__(self, model, pruner):
+           super().__init__(model, pruner)
+           # Set attribute `collected_activation` for all wrappers to store
+           # activations for each layer
+           self.pruner.set_wrappers_attribute("collected_activation", [])
+           self.activation = torch.nn.functional.relu
+           def collector(wrapper, input_, output):
+               # The collected activation can be accessed via each wrapper's collected_activation
+               # attribute
+               wrapper.collected_activation.append(self.activation(output.detach().cpu()))
+           self.pruner.hook_id = self.pruner.add_activation_collector(collector)
+The collector function will be called each time the forward method runs.
+Users can also remove this collector like this:
+.. code-block:: python
+   # Save the collector identifier
+   collector_id = self.pruner.add_activation_collector(collector)
+   # When the collector is not used any more, it can be remove using
+   # the saved collector identifier
+   self.pruner.remove_activation_collector(collector_id)
+----
+Pruner
+------
+A pruner receives ``model``\ , ``config_list`` and ``optimizer`` as arguments. It prunes the model per the ``config_list`` during training loop by adding a hook on ``optimizer.step()``.
+Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
+Weight masker
+^^^^^^^^^^^^^
+A ``weight masker`` is the implementation of pruning algorithms, it can prune a specified layer wrapped by ``module wrapper`` with specified sparsity.
+Pruning module wrapper
+^^^^^^^^^^^^^^^^^^^^^^
+A ``pruning module wrapper`` is a module containing:
+#. the origin module
+#. some buffers used by ``calc_mask``
+#. a new forward method that applies masks before running the original forward method.
+the reasons to use ``module wrapper``\ :
+#. some buffers are needed by ``calc_mask`` to calculate masks and these buffers should be registered in ``module wrapper`` so that the original modules are not contaminated.
+#. a new ``forward`` method is needed to apply masks to weight before calling the real ``forward`` method.
+Pruning hook
+^^^^^^^^^^^^
+A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner's calc_mask method at ``optimizer.step()`` is invoked.
+----
+Quantizer
+---------
+Quantizer class is also a subclass of ``Compressor``\ , it is used to compress models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. It contains:
+Quantization module wrapper
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Each module/layer of the model to be quantized is wrapped by a quantization module wrapper, it provides a new ``forward`` method to quantize the original module's weight, input and output.
+Quantization hook
+^^^^^^^^^^^^^^^^^
+A quantization hook is installed on a quntizer when it is constructed, it is call at ``optimizer.step()``.
+Quantization methods
+^^^^^^^^^^^^^^^^^^^^
+``Quantizer`` class provides following methods for subclass to implement quantization algorithms:
+.. code-block:: python
+   class Quantizer(Compressor):
+       """
+       Base quantizer for pytorch quantizer
+       """
+       def quantize_weight(self, weight, wrapper, **kwargs):
+           """
+           quantize should overload this method to quantize weight.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           weight : Tensor
+               weight that needs to be quantized
+           wrapper : QuantizerModuleWrapper
+               the wrapper for origin module
+           """
+           raise NotImplementedError('Quantizer must overload quantize_weight()')
+       def quantize_output(self, output, wrapper, **kwargs):
+           """
+           quantize should overload this method to quantize output.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           output : Tensor
+               output that needs to be quantized
+           wrapper : QuantizerModuleWrapper
+               the wrapper for origin module
+           """
+           raise NotImplementedError('Quantizer must overload quantize_output()')
+       def quantize_input(self, *inputs, wrapper, **kwargs):
+           """
+           quantize should overload this method to quantize input.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           inputs : Tensor
+               inputs that needs to be quantized
+           wrapper : QuantizerModuleWrapper
+               the wrapper for origin module
+           """
+           raise NotImplementedError('Quantizer must overload quantize_input()')
+----
+Multi-GPU support
+-----------------
+On multi-GPU training, buffers and parameters are copied to multiple GPU every time the ``forward`` method runs on multiple GPU. If buffers and parameters are updated in the ``forward`` method, an ``in-place`` update is needed to ensure the update is effective.
+Since ``calc_mask`` is called in the ``optimizer.step`` method, which happens after the ``forward`` method and happens only on one GPU, it supports multi-GPU naturally.
--- a/docs/en_US/Compression/ModelSpeedup.rst
+++ b/docs/en_US/Compression/ModelSpeedup.rst
+Speed up Masked Model
+=====================
+*This feature is in Beta version.*
+Introduction
+------------
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+Design and Implementation
+-------------------------
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
+Usage
+-----
+.. code-block:: python
+   from nni.compression.pytorch import ModelSpeedup
+   # model: the model you want to speed up
+   # dummy_input: dummy input of the model, given to `jit.trace`
+   # masks_file: the mask file created by pruning algorithms
+   m_speedup = ModelSpeedup(model, dummy_input.to(device), masks_file)
+   m_speedup.speedup_model()
+   dummy_input = dummy_input.to(device)
+   start = time.time()
+   out = model(dummy_input)
+   print('elapsed time: ', time.time() - start)
+For complete examples please refer to :githublink:`the code <examples/model_compress/model_speedup.py>`
+NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+Limitations
+-----------
+Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
+For PyTorch we can only replace modules, if functions in ``forward`` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
+Speedup Results of Examples
+---------------------------
+The code of these experiments can be found :githublink:`here <examples/model_compress/model_speedup.py>`.
+slim pruner example
+^^^^^^^^^^^^^^^^^^^
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01197
+     - 0.005107
+   * - 2
+     - 0.02019
+     - 0.008769
+   * - 4
+     - 0.02733
+     - 0.014809
+   * - 8
+     - 0.04310
+     - 0.027441
+   * - 16
+     - 0.07731
+     - 0.05008
+   * - 32
+     - 0.14464
+     - 0.10027
+fpgm pruner example
+^^^^^^^^^^^^^^^^^^^
+on cpu,
+input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+too large variance
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01383
+     - 0.01839
+   * - 2
+     - 0.01167
+     - 0.003558
+   * - 4
+     - 0.01636
+     - 0.01088
+   * - 40
+     - 0.14412
+     - 0.08268
+   * - 40
+     - 1.29385
+     - 0.14408
+   * - 40
+     - 0.41035
+     - 0.46162
+   * - 400
+     - 6.29020
+     - 5.82143
+l1filter pruner example
+^^^^^^^^^^^^^^^^^^^^^^^
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01026
+     - 0.003677
+   * - 2
+     - 0.01657
+     - 0.008161
+   * - 4
+     - 0.02458
+     - 0.020018
+   * - 8
+     - 0.03498
+     - 0.025504
+   * - 16
+     - 0.06757
+     - 0.047523
+   * - 32
+     - 0.10487
+     - 0.086442
+APoZ pruner example
+^^^^^^^^^^^^^^^^^^^
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01389
+     - 0.004208
+   * - 2
+     - 0.01628
+     - 0.008310
+   * - 4
+     - 0.02521
+     - 0.014008
+   * - 8
+     - 0.03386
+     - 0.023923
+   * - 16
+     - 0.06042
+     - 0.046183
+   * - 32
+     - 0.12421
+     - 0.087113
--- a/docs/en_US/Compression/Overview.rst
+++ b/docs/en_US/Compression/Overview.rst
+Model Compression with NNI
+==========================
+.. contents::
+As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
+*Note that the interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.*
+Supported Algorithms
+--------------------
+The algorithms include pruning algorithms and quantization algorithms.
+Pruning Algorithms
+^^^^^^^^^^^^^^^^^^
+Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and address the over-ﬁtting issue.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Level Pruner </Compression/Pruner.html#level-pruner>`__
+     - Pruning the specified ratio on each weight based on absolute values of weights
+   * - `AGP Pruner </Compression/Pruner.html#agp-pruner>`__
+     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+   * - `Lottery Ticket Pruner </Compression/Pruner.html#lottery-ticket-hypothesis>`__
+     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+   * - `FPGM Pruner </Compression/Pruner.html#fpgm-pruner>`__
+     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/pdf/1811.00250.pdf>`__
+   * - `L1Filter Pruner </Compression/Pruner.html#l1filter-pruner>`__
+     - Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
+   * - `L2Filter Pruner </Compression/Pruner.html#l2filter-pruner>`__
+     - Pruning filters with the smallest L2 norm of weights in convolution layers
+   * - `ActivationAPoZRankFilterPruner </Compression/Pruner.html#activationapozrankfilterpruner>`__
+     - Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
+   * - `ActivationMeanRankFilterPruner </Compression/Pruner.html#activationmeanrankfilterpruner>`__
+     - Pruning filters based on the metric that calculates the smallest mean value of output activations
+   * - `Slim Pruner </Compression/Pruner.html#slim-pruner>`__
+     - Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
+   * - `TaylorFO Pruner </Compression/Pruner.html#taylorfoweightfilterpruner>`__
+     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
+   * - `ADMM Pruner </Compression/Pruner.html#admm-pruner>`__
+     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
+   * - `NetAdapt Pruner </Compression/Pruner.html#netadapt-pruner>`__
+     - Automatically simplify a pretrained network to meet the resource budget by iterative pruning  `Reference Paper <https://arxiv.org/abs/1804.03230>`__
+   * - `SimulatedAnnealing Pruner </Compression/Pruner.html#simulatedannealing-pruner>`__
+     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - `AutoCompress Pruner </Compression/Pruner.html#autocompress-pruner>`__
+     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - `AMC Pruner </Compression/Pruner.html#amc-pruner>`__
+     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/pdf/1802.03494.pdf>`__
+You can refer to this :githublink:`benchmark <docs/en_US/CommunitySharings/ModelCompressionComparison.rst>` for the performance of these pruners on some benchmark problems.
+Quantization Algorithms
+^^^^^^^^^^^^^^^^^^^^^^^
+Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Naive Quantizer </Compression/Quantizer.html#naive-quantizer>`__
+     - Quantize weights to default 8 bits
+   * - `QAT Quantizer </Compression/Quantizer.html#qat-quantizer>`__
+     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
+   * - `DoReFa Quantizer </Compression/Quantizer.html#dorefa-quantizer>`__
+     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
+   * - `BNN Quantizer </Compression/Quantizer.html#bnn-quantizer>`__
+     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
+Automatic Model Compression
+---------------------------
+Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found `here <./AutoPruningUsingTuners.rst>`__.
+Model Speedup
+-------------
+The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Model Speedup can be found `here <./ModelSpeedup.rst>`__.
+Compression Utilities
+---------------------
+Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
+Customize Your Own Compression Algorithms
+-----------------------------------------
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found `here <./Framework.rst>`__.
+Reference and Feedback
+----------------------
+* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
+* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
+* To know more about `Feature Engineering with NNI <../FeatureEngineering/Overview.rst>`__\ ;
+* To know more about `NAS with NNI <../NAS/Overview.rst>`__\ ;
+* To know more about `Hyperparameter Tuning with NNI <../Tuner/BuiltinTuner.rst>`__\ ;
--- a/docs/en_US/Compression/Pruner.rst
+++ b/docs/en_US/Compression/Pruner.rst
+Supported Pruning Algorithms on NNI
+===================================
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in  unstructured models, which need specialized haredware or software to speed up the sparse network.** Filter Pruning** achieves acceleratation by removing the entire filter.  We also provide an algorithm to control the** pruning schedule**.
+**Fine-grained Pruning**
+* `Level Pruner <#level-pruner>`__
+**Filter Pruning**
+* `Slim Pruner <#slim-pruner>`__
+* `FPGM Pruner <#fpgm-pruner>`__
+* `L1Filter Pruner <#l1filter-pruner>`__
+* `L2Filter Pruner <#l2filter-pruner>`__
+* `Activation APoZ Rank Filter Pruner <#activationAPoZRankFilter-pruner>`__
+* `Activation Mean Rank Filter Pruner <#activationmeanrankfilter-pruner>`__
+* `Taylor FO On Weight Pruner <#taylorfoweightfilter-pruner>`__
+**Pruning Schedule**
+* `AGP Pruner <#agp-pruner>`__
+* `NetAdapt Pruner <#netadapt-pruner>`__
+* `SimulatedAnnealing Pruner <#simulatedannealing-pruner>`__
+* `AutoCompress Pruner <#autocompress-pruner>`__
+* `AMC Pruner <#amc-pruner>`__
+* `Sensitivity Pruner <#sensitivity-pruner>`__
+**Others**
+* `ADMM Pruner <#admm-pruner>`__
+* `Lottery Ticket Hypothesis <#lottery-ticket-hypothesis>`__
+Level Pruner
+------------
+This is one basic one-shot pruner: you can set a target sparsity level (expressed as a fraction, 0.6 means we will prune 60% of the weight parameters). 
+We first sort the weights in the specified layer by their absolute values. And then mask to zero the smallest magnitude weights until the desired sparsity level is reached.
+Usage
+^^^^^
+Tensorflow code
+.. code-block:: python
+   from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+User configuration for Level Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.LevelPruner
+Tensorflow
+""""""""""
+..  autoclass:: nni.algorithms.compression.tensorflow.pruning.LevelPruner
+Slim Pruner
+-----------
+This is an one-shot pruner, In `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\ , authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
+.. image:: ../../img/slim_pruner.png
+   :target: ../../img/slim_pruner.png
+   :alt: 
+..
+   Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**\ , L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are** globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import SlimPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
+   pruner = SlimPruner(model, config_list)
+   pruner.compress()
+User configuration for Slim Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.SlimPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We implemented one of the experiments in `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\ , we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Error(paper/ours)
+     - Parameters
+     - Pruned
+   * - VGGNet
+     - 6.34/6.40
+     - 20.04M
+     - 
+   * - Pruned-VGGNet
+     - 6.20/6.26
+     - 2.03M
+     - 88.5%
+The experiments code can be found at :githublink:`examples/model_compress <examples/model_compress/>`
+----
+FPGM Pruner
+-----------
+This is an one-shot pruner, FPGM Pruner is an implementation of paper `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__
+FPGMPruner prune filters with the smallest geometric median.
+.. image:: ../../img/fpgm_fig1.png
+   :target: ../../img/fpgm_fig1.png
+   :alt: 
+..
+   Previous works utilized “smaller-norm-less-important” criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with “relatively less” importance. 
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import FPGMPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = FPGMPruner(model, config_list)
+   pruner.compress()
+User configuration for FPGM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.FPGMPruner
+L1Filter Pruner
+---------------
+This is an one-shot pruner, In `'PRUNING FILTERS FOR EFFICIENT CONVNETS' <https://arxiv.org/abs/1608.08710>`__\ , authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
+.. image:: ../../img/l1filter_pruner.png
+   :target: ../../img/l1filter_pruner.png
+   :alt: 
+..
+   L1Filter Pruner prunes filters in the **convolution layers**
+   The procedure of pruning m filters from the ith convolutional layer is as follows:
+   #. For each filter :math:`F_{i,j}`, calculate the sum of its absolute kernel weights :math:`s_j=\sum_{l=1}^{n_i}\sum|K_l|`.
+   #. Sort the filters by :math:`s_j`.
+   #. Prune :math:`m` filters with the smallest sum values and their corresponding feature maps. The
+      kernels in the next convolutional layer corresponding to the pruned feature maps are also removed.
+   #. A new kernel matrix is created for both the :math:`i`-th and :math:`i+1`-th layers, and the remaining kernel
+      weights are copied to the new model.
+In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = L1FilterPruner(model, config_list)
+   pruner.compress()
+User configuration for L1Filter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.L1FilterPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We implemented one of the experiments in `'PRUNING FILTERS FOR EFFICIENT CONVNETS' <https://arxiv.org/abs/1608.08710>`__ with **L1FilterPruner**\ , we pruned** VGG-16** for CIFAR-10 to** VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Error(paper/ours)
+     - Parameters
+     - Pruned
+   * - VGG-16
+     - 6.75/6.49
+     - 1.5x10^7
+     - 
+   * - VGG-16-pruned-A
+     - 6.60/6.47
+     - 5.4x10^6
+     - 64.0%
+The experiments code can be found at :githublink:`examples/model_compress <examples/model_compress/>`
+----
+L2Filter Pruner
+---------------
+This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import L2FilterPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = L2FilterPruner(model, config_list)
+   pruner.compress()
+User configuration for L2Filter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.L2FilterPruner
+----
+ActivationAPoZRankFilter Pruner
+-------------------------------
+ActivationAPoZRankFilter Pruner is a pruner which prunes the filters with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures <https://arxiv.org/abs/1607.03250>`__.
+The APoZ is defined as:
+.. image:: ../../img/apoz.png
+   :target: ../../img/apoz.png
+   :alt: 
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ActivationAPoZRankFilterPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = ActivationAPoZRankFilterPruner(model, config_list, statistics_batch_num=1)
+   pruner.compress()
+Note: ActivationAPoZRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
+You can view :githublink:`example <examples/model_compress/model_prune_torch.py>` for more information.
+User configuration for ActivationAPoZRankFilter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationAPoZRankFilterPruner
+----
+ActivationMeanRankFilter Pruner
+-------------------------------
+ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper\ `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ActivationMeanRankFilterPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = ActivationMeanRankFilterPruner(model, config_list, statistics_batch_num=1)
+   pruner.compress()
+Note: ActivationMeanRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
+You can view :githublink:`example <examples/model_compress/model_prune_torch.py>` for more information.
+User configuration for ActivationMeanRankFilterPruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationMeanRankFilterPruner
+----
+TaylorFOWeightFilter Pruner
+---------------------------
+TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity. The estimated importance of filters is defined as the paper `Importance Estimation for Neural Network Pruning <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__. Other pruning criteria mentioned in this paper will be supported in future release.
+..
+.. image:: ../../img/importance_estimation_sum.png
+   :target: ../../img/importance_estimation_sum.png
+   :alt: 
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import TaylorFOWeightFilterPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = TaylorFOWeightFilterPruner(model, config_list, statistics_batch_num=1)
+   pruner.compress()
+User configuration for TaylorFOWeightFilter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.TaylorFOWeightFilterPruner
+----
+AGP Pruner
+----------
+This is an iterative pruner, In `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\ , authors Michael Zhu and Suyog Gupta provide an algorithm to prune the weight gradually.
+..
+   We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step t0 and with pruning frequency ∆t:
+   .. image:: ../../img/agp_pruner.png
+      :target: ../../img/agp_pruner.png
+      :alt: 
+   The binary weight masks are updated every ∆t steps as the network is trained to gradually increase the sparsity of the network while allowing the network training steps to recover from any pruning-induced loss in accuracy. In our experience, varying the pruning frequency ∆t between 100 and 1000 training steps had a negligible impact on the final model quality. Once the model achieves the target sparsity sf , the weight masks are no longer updated. The intuition behind this sparsity function in equation (1).
+Usage
+^^^^^
+You can prune all weight from 0% to 80% sparsity in 10 epoch with the code below.
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import AGPPruner
+   config_list = [{
+       'initial_sparsity': 0,
+       'final_sparsity': 0.8,
+       'start_epoch': 0,
+       'end_epoch': 10,
+       'frequency': 1,
+       'op_types': ['default']
+   }]
+   # load a pretrained model or train a model before using a pruner
+   # model = MyModel()
+   # model.load_state_dict(torch.load('mycheckpoint.pth'))
+   # AGP pruner prunes model while fine tuning the model by adding a hook on
+   # optimizer.step(), so an optimizer is required to prune the model.
+   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+   pruner = AGPPruner(model, config_list, optimizer, pruning_algorithm='level')
+   pruner.compress()
+AGP pruner uses ``LevelPruner`` algorithms to prune the weight by default, however you can set ``pruning_algorithm`` parameter to other values to use other pruning algorithms:
+* ``level``\ : LevelPruner
+* ``slim``\ : SlimPruner
+* ``l1``\ : L1FilterPruner
+* ``l2``\ : L2FilterPruner
+* ``fpgm``\ : FPGMPruner
+* ``taylorfo``\ : TaylorFOWeightFilterPruner
+* ``apoz``\ : ActivationAPoZRankFilterPruner
+* ``mean_activation``\ : ActivationMeanRankFilterPruner
+You should add code below to update epoch number when you finish one epoch in your training code.
+PyTorch code
+.. code-block:: python
+   pruner.update_epoch(epoch)
+You can view :githublink:`example <examples/model_compress/model_prune_torch.py>` for more information.
+User configuration for AGP Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.AGPPruner
+----
+NetAdapt Pruner
+---------------
+NetAdapt allows a user to automatically simplify a pretrained network to meet the resource budget. 
+Given the overall sparsity, NetAdapt will automatically generate the sparsities distribution among different layers by iterative pruning.
+For more details, please refer to `NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications <https://arxiv.org/abs/1804.03230>`__.
+.. image:: ../../img/algo_NetAdapt.png
+   :target: ../../img/algo_NetAdapt.png
+   :alt: 
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import NetAdaptPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = NetAdaptPruner(model, config_list, short_term_fine_tuner=short_term_fine_tuner, evaluator=evaluator,base_algo='l1', experiment_data_dir='./')
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for NetAdapt Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.NetAdaptPruner
+SimulatedAnnealing Pruner
+-------------------------
+We implement a guided heuristic search method, Simulated Annealing (SA) algorithm, with enhancement on guided search based on prior experience. 
+The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.
+* Randomly initialize a pruning rate distribution (sparsities).
+* While current_temperature < stop_temperature:
+  #. generate a perturbation to current distribution
+  #. Perform fast evaluation on the perturbated distribution
+  #. accept the perturbation according to the performance and probability, if not accepted, return to step 1
+  #. cool down, current_temperature <- current_temperature * cool_down_rate
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import SimulatedAnnealingPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = SimulatedAnnealingPruner(model, config_list, evaluator=evaluator, base_algo='l1', cool_down_rate=0.9, experiment_data_dir='./')
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for SimulatedAnnealing Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.SimulatedAnnealingPruner
+AutoCompress Pruner
+-------------------
+For each round, AutoCompressPruner prune the model for the same sparsity to achive the overall sparsity:
+.. code-block:: bash
+       1. Generate sparsities distribution using SimulatedAnnealingPruner
+       2. Perform ADMM-based structured pruning to generate pruning result for the next round.
+          Here we use `speedup` to perform real pruning.
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ADMMPruner
+   config_list = [{
+           'sparsity': 0.5,
+           'op_types': ['Conv2d']
+       }]
+   pruner = AutoCompressPruner(
+               model, config_list, trainer=trainer, evaluator=evaluator,
+               dummy_input=dummy_input, num_iterations=3, optimize_mode='maximize', base_algo='l1',
+               cool_down_rate=0.9, admm_num_iterations=30, admm_training_epochs=5, experiment_data_dir='./')
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for AutoCompress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.AutoCompressPruner
+AMC Pruner
+----------
+AMC pruner leverages reinforcement learning to provide the model compression policy.
+This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio,
+better preserving the accuracy and freeing human labor.
+.. image:: ../../img/amc_pruner.jpg
+   :target: ../../img/amc_pruner.jpg
+   :alt: 
+For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import AMCPruner
+   config_list = [{
+           'op_types': ['Conv2d', 'Linear']
+       }]
+   pruner = AMCPruner(model, config_list, evaluator, val_loader, flops_ratio=0.5)
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/amc/>` for more information.
+User configuration for AutoCompress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.AMCPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We implemented one of the experiments in `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__\ , we pruned **MobileNet** to 50% FLOPS for ImageNet in the paper. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Top 1 acc.(paper/ours)
+     - Top 5 acc. (paper/ours)
+     - FLOPS
+   * - MobileNet
+     - 70.5% / 69.9%
+     - 89.3% / 89.1%
+     - 50%
+The experiments code can be found at :githublink:`examples/model_compress <examples/model_compress/amc/>`
+ADMM Pruner
+-----------
+Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
+by decomposing the original nonconvex problem into two subproblems that can be solved iteratively. In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively. 
+During the process of solving these two subproblems, the weights of the original model will be changed. An one-shot pruner will then be applied to prune the model according to the config list given.
+This solution framework applies both to non-structured and different variations of structured pruning schemes.
+For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers <https://arxiv.org/abs/1804.03294>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ADMMPruner
+   config_list = [{
+               'sparsity': 0.8,
+               'op_types': ['Conv2d'],
+               'op_names': ['conv1']
+           }, {
+               'sparsity': 0.92,
+               'op_types': ['Conv2d'],
+               'op_names': ['conv2']
+           }]
+   pruner = ADMMPruner(model, config_list, trainer=trainer, num_iterations=30, epochs=5)
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for ADMM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.ADMMPruner
+Lottery Ticket Hypothesis
+-------------------------
+`The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks <https://arxiv.org/abs/1803.03635>`__\ , authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis, and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.
+In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :
+..
+   #. Randomly initialize a neural network f(x;theta_0) (where theta\ *0 follows D*\ {theta}).
+   #. Train the network for j iterations, arriving at parameters theta_j.
+   #. Prune p% of the parameters in theta_j, creating a mask m.
+   #. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
+   #. Repeat step 2, 3, and 4.
+If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LotteryTicketPruner
+   config_list = [{
+       'prune_iterations': 5,
+       'sparsity': 0.8,
+       'op_types': ['default']
+   }]
+   pruner = LotteryTicketPruner(model, config_list, optimizer)
+   pruner.compress()
+   for _ in pruner.get_prune_iterations():
+       pruner.prune_iteration_start()
+       for epoch in range(epoch_num):
+           ...
+The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs ``model`` and ``optimizer`` (\ **Note that should add ``lr_scheduler`` if used**\ ) to reset their states every time a new prune iteration starts. Please use ``get_prune_iterations`` to get the pruning iterations, and invoke ``prune_iteration_start`` at the beginning of each iteration. ``epoch_num`` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
+*Tensorflow version will be supported later.*
+User configuration for LotteryTicket Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.LotteryTicketPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here <examples/model_compress/lottery_torch_mnist_fc.py>`. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+.. image:: ../../img/lottery_ticket_mnist_fc.png
+   :target: ../../img/lottery_ticket_mnist_fc.png
+   :alt: 
+The above figure shows the result of the fully connected network. ``round0-sparsity-0.0`` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
+Sensitivity Pruner
+------------------
+For each round, SensitivityPruner prunes the model based on the sensitivity to the accuracy of each layer until meeting the final configured sparsity of the whole model:
+.. code-block:: bash
+       1. Analyze the sensitivity of each layer in the current state of the model.
+       2. Prune each layer according to the sensitivity.
+For more details, please refer to `Learning both Weights and Connections for Efficient Neural Networks  <https://arxiv.org/abs/1506.02626>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import SensitivityPruner
+   config_list = [{
+           'sparsity': 0.5,
+           'op_types': ['Conv2d']
+       }]
+   pruner = SensitivityPruner(model, config_list, finetuner=fine_tuner, evaluator=evaluator)
+   # eval_args and finetune_args are the parameters passed to the evaluator and finetuner respectively
+   pruner.compress(eval_args=[model], finetune_args=[model])
+User configuration for Sensitivity Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.SensitivityPruner
--- a/docs/en_US/Compression/Quantizer.rst
+++ b/docs/en_US/Compression/Quantizer.rst
+Supported Quantization Algorithms on NNI
+========================================
+Index of supported quantization algorithms
+* `Naive Quantizer <#naive-quantizer>`__
+* `QAT Quantizer <#qat-quantizer>`__
+* `DoReFa Quantizer <#dorefa-quantizer>`__
+* `BNN Quantizer <#bnn-quantizer>`__
+Naive Quantizer
+---------------
+We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
+Usage
+^^^^^
+pytorch
+.. code-block:: python
+   model = nni.algorithms.compression.pytorch.quantization.NaiveQuantizer(model).compress()
+----
+QAT Quantizer
+-------------
+In `Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__\ , authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
+..
+   We propose an approach that simulates quantization effects in the forward pass of training. Backpropagation still happens as usual, and all weights and biases are stored in floating point so that they can be easily nudged by small amounts. The forward propagation pass however simulates quantized inference as it will happen in the inference engine, by implementing in floating-point arithmetic the rounding behavior of the quantization scheme
+   * Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer, the batch normalization parameters are “folded into” the weights before quantization.
+   * Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
+Usage
+^^^^^
+You can quantize your model to 8 bits with the code below before your training code.
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+   model = Mnist()
+   config_list = [{
+       'quant_types': ['weight'],
+       'quant_bits': {
+           'weight': 8,
+       }, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
+       'op_types':['Conv2d', 'Linear']
+   }, {
+       'quant_types': ['output'],
+       'quant_bits': 8,
+       'quant_start_step': 7000,
+       'op_types':['ReLU6']
+   }]
+   quantizer = QAT_Quantizer(model, config_list)
+   quantizer.compress()
+You can view example for more information
+User configuration for QAT Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+configuration needed by this algorithm :
+* **quant_start_step:** int
+disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
+state where activation quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
+note
+^^^^
+batch normalization folding is currently not supported.
+----
+DoReFa Quantizer
+----------------
+In `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__\ , authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
+Usage
+^^^^^
+To implement DoReFa Quantizer, you can add code below before your training code
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
+   config_list = [{ 
+       'quant_types': ['weight'],
+       'quant_bits': 8, 
+       'op_types': 'default' 
+   }]
+   quantizer = DoReFaQuantizer(model, config_list)
+   quantizer.compress()
+You can view example for more information
+User configuration for DoReFa Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+configuration needed by this algorithm :
+----
+BNN Quantizer
+-------------
+In `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , 
+..
+   We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
+   model = VGG_Cifar10(num_classes=10)
+   configure_list = [{
+       'quant_bits': 1,
+       'quant_types': ['weight'],
+       'op_types': ['Conv2d', 'Linear'],
+       'op_names': ['features.0', 'features.3', 'features.7', 'features.10', 'features.14', 'features.17', 'classifier.0', 'classifier.3']
+   }, {
+       'quant_bits': 1,
+       'quant_types': ['output'],
+       'op_types': ['Hardtanh'],
+       'op_names': ['features.6', 'features.9', 'features.13', 'features.16', 'features.20', 'classifier.2', 'classifier.5']
+   }]
+   quantizer = BNNQuantizer(model, configure_list)
+   model = quantizer.compress()
+You can view example :githublink:`examples/model_compress/BNN_quantizer_cifar10.py <examples/model_compress/BNN_quantizer_cifar10.py>` for more information.
+User configuration for BNN Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+configuration needed by this algorithm :
+Experiment
+^^^^^^^^^^
+We implemented one of the experiments in `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Accuracy
+   * - VGGNet
+     - 86.93%
+The experiments code can be found at :githublink:`examples/model_compress/BNN_quantizer_cifar10.py <examples/model_compress/BNN_quantizer_cifar10.py>` 
--- a/docs/en_US/Compression/QuickStart.rst
+++ b/docs/en_US/Compression/QuickStart.rst
+Tutorial for Model Compression
+==============================
+.. contents::
+In this tutorial, we use the `first section <#quick-start-to-compress-a-model>`__ to quickly go through the usage of model compression on NNI. Then use the `second section <#detailed-usage-guide>`__ to explain more details of the usage.
+Quick Start to Compress a Model
+-------------------------------
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use `slim pruner </Compression/Pruner.html#slim-pruner>`__ as an example to show the usage.
+Write configuration
+^^^^^^^^^^^^^^^^^^^
+Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``BatchNorm2d``\ s to sparsity 0.7 while keeping other layers unpruned.
+.. code-block:: python
+   configure_list = [{
+       'sparsity': 0.7,
+       'op_types': ['BatchNorm2d'],
+   }]
+The specification of configuration can be found `here <#specification-of-config-list>`__. Note that different pruners may have their own defined fields in configuration, for exmaple ``start_epoch`` in AGP pruner. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
+Choose a compression algorithm
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model.
+.. code-block:: python
+   pruner = SlimPruner(model, configure_list)
+   model = pruner.compress()
+Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
+Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^
+After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
+.. code-block:: python
+   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
+The complete code of model compression examples can be found :githublink:`here <examples/model_compress/model_prune_torch.py>`.
+Speed up the model
+^^^^^^^^^^^^^^^^^^
+Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
+.. code-block:: python
+   from nni.compression.pytorch import apply_compression_results
+   apply_compression_results(model, 'mask_vgg19_cifar10.pth')
+Please refer to `here <ModelSpeedup.rst>`__ for detailed description.
+Detailed Usage Guide
+--------------------
+The example code for users to apply model compression on a user model can be found below:
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+Tensorflow code
+.. code-block:: python
+   from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(tf.get_default_graph(), config_list)
+   pruner.compress()
+You can use other compression algorithms in the package of ``nni.compression``. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under ``nni.compression.pytorch`` and ``nni.compression.tensorflow`` respectively. You can refer to `Pruner <./Pruner.md>`__ and `Quantizer <./Quantizer.md>`__ for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to `KDExample <../TrialExample/KDExample.rst>`__
+A compression algorithm is first instantiated with a ``config_list`` passed in. The specification of this ``config_list`` will be described later.
+The function call ``pruner.compress()`` modifies user defined model (in Tensorflow the model can be obtained with ``tf.get_default_graph()``\ , while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+*Note that, ``pruner.compress`` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after ``pruner.compress``.*
+Specification of ``config_list``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 
+The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
+* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.md>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
+A simple example of configuration is shown below:
+.. code-block:: python
+   [
+       {
+           'sparsity': 0.8,
+           'op_types': ['default']
+       },
+       {
+           'sparsity': 0.6,
+           'op_names': ['op_name1', 'op_name2']
+       },
+       {
+           'exclude': True,
+           'op_names': ['op_name3']
+       }
+   ]
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
+Quantization specific keys
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
+* **quant_types** : list of string. 
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+* **quant_bits** : int or dict of {str : int}
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+.. code-block:: bash
+   {
+       quant_bits: {
+           'weight': 8,
+           'output': 4,
+           },
+   }
+when the value is int type, all quantization types share same bits length. eg. 
+.. code-block:: bash
+   {
+       quant_bits: 8, # weight or output quantization are all 8 bits
+   }
+The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
+.. code-block:: bash
+   configure_list = [{
+           'quant_types': ['weight'],        
+           'quant_bits': 8, 
+           'op_names': ['conv1']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 4,
+           'quant_start_step': 0,
+           'op_names': ['conv2']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 3,
+           'op_names': ['fc1']
+           },
+          {
+           'quant_types': ['weight'],
+           'quant_bits': 2,
+           'op_names': ['fc2']
+           }
+   ]
+In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
+APIs for Updating Fine Tuning Status
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Some compression algorithms use epochs to control the progress of compression (e.g. `AGP </Compression/Pruner.html#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
+``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+Export Compressed Model
+^^^^^^^^^^^^^^^^^^^^^^^
+You can easily export the compressed model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. In this exported ``model.pth``\ , the masked weights are zero.
+.. code-block:: bash
+   pruner.export_model(model_path='model.pth')
+``mask_dict`` and pruned model in ``onnx`` format(\ ``input_shape`` need to be specified) can also be exported like this:
+.. code-block:: python
+   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+If you want to really speed up the compressed model, please refer to `NNI model speedup <./ModelSpeedup.rst>`__ for details.
--- a/docs/en_US/FeatureEngineering/GBDTSelector.rst
+++ b/docs/en_US/FeatureEngineering/GBDTSelector.rst
+GBDTSelector
+------------
+GBDTSelector is based on `LightGBM <https://github.com/microsoft/LightGBM>`__\ , which is a gradient boosting framework that uses tree-based learning algorithms.
+When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
+We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
+For now, we support the ``importance_type`` is ``split`` and ``gain``. But we will support customized ``importance_type`` in the future, which means the user could define how to calculate the ``feature score`` by themselves.
+Usage
+^^^^^
+First you need to install dependency:
+.. code-block:: bash
+   pip install lightgbm
+Then
+.. code-block:: python
+   from nni.feature_engineering.gbdt_selector import GBDTSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = GBDTSelector()
+   # fit data
+   fgs.fit(X_train, y_train, ...)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features(10))
+   ...
+And you could reference the examples in ``/examples/feature_engineering/gbdt_selector/``\ , too.
+**Requirement of ``fit`` FuncArgs**
+* 
+  **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+* 
+  **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+* 
+  **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
+* 
+  **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
+* 
+  **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__.
+* 
+  **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance>`__.
+* 
+  **num_boost_round** (int, require) - number of boost round. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train>`__.
+**Requirement of ``get_selected_features`` FuncArgs**
+* **topk** (int, require) - the topK impotance features you want to selected.
--- a/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
+++ b/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
+GradientFeatureSelector
+-----------------------
+The algorithm in GradientFeatureSelector comes from `"Feature Gradients: Scalable Feature Selection via Discrete Relaxation" <https://arxiv.org/pdf/1908.10382.pdf>`__.
+GradientFeatureSelector, a gradient-based search algorithm
+for feature selection. 
+1) This approach extends a recent result on the estimation of
+learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N. 
+2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
+3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
+Usage
+^^^^^
+.. code-block:: python
+   from nni.feature_engineering.gradient_selector import FeatureGradientSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = FeatureGradientSelector(n_features=10)
+   # fit data
+   fgs.fit(X_train, y_train)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features())
+   ...
+And you could reference the examples in ``/examples/feature_engineering/gradient_feature_selector/``\ , too.
+**Parameters of class FeatureGradientSelector constructor**
+* 
+  **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
+* 
+  **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.
+* 
+  **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
+* 
+  **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
+* 
+  **learning_rate** (float, optional, default = 1e-1) - learning rate
+* 
+  **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*\ ) - How to initialize the vector of scores. 'zero' is the default.
+* 
+  **n_epochs** (int, optional, default = 1) - number of epochs to run
+* 
+  **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
+* 
+  **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
+* 
+  **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
+* 
+  **classification** (bool, optional, default = True) - If True, problem is classification, else regression.
+* 
+  **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
+* 
+  **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
+* 
+  **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
+* 
+  **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
+* 
+  **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
+* 
+  **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
+**Requirement of ``fit`` FuncArgs**
+* 
+  **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+* 
+  **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+* 
+  **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
+**Requirement of ``get_selected_features`` FuncArgs**
+ For now, the ``get_selected_features`` function has no parameters.