Port markdown docs to rst and introduce "githublink" (#3107)

5bf5e46c · Yuge Zhang · GitHub · dbb2434f · 5bf5e46c · 5bf5e46c
Unverified Commit 5bf5e46c authored Dec 10, 2020 by Yuge Zhang Committed by GitHub Dec 10, 2020
20 changed files
--- a/docs/en_US/Compression/ModelSpeedup.rst
+++ b/docs/en_US/Compression/ModelSpeedup.rst
+Speed up Masked Model
+=====================
+*This feature is in Beta version.*
+Introduction
+------------
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+Design and Implementation
+-------------------------
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
+Usage
+-----
+.. code-block:: python
+   from nni.compression.pytorch import ModelSpeedup
+   # model: the model you want to speed up
+   # dummy_input: dummy input of the model, given to `jit.trace`
+   # masks_file: the mask file created by pruning algorithms
+   m_speedup = ModelSpeedup(model, dummy_input.to(device), masks_file)
+   m_speedup.speedup_model()
+   dummy_input = dummy_input.to(device)
+   start = time.time()
+   out = model(dummy_input)
+   print('elapsed time: ', time.time() - start)
+For complete examples please refer to :githublink:`the code <examples/model_compress/model_speedup.py>`
+NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+Limitations
+-----------
+Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
+For PyTorch we can only replace modules, if functions in ``forward`` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
+Speedup Results of Examples
+---------------------------
+The code of these experiments can be found :githublink:`here <examples/model_compress/model_speedup.py>`.
+slim pruner example
+^^^^^^^^^^^^^^^^^^^
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01197
+     - 0.005107
+   * - 2
+     - 0.02019
+     - 0.008769
+   * - 4
+     - 0.02733
+     - 0.014809
+   * - 8
+     - 0.04310
+     - 0.027441
+   * - 16
+     - 0.07731
+     - 0.05008
+   * - 32
+     - 0.14464
+     - 0.10027
+fpgm pruner example
+^^^^^^^^^^^^^^^^^^^
+on cpu,
+input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+too large variance
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01383
+     - 0.01839
+   * - 2
+     - 0.01167
+     - 0.003558
+   * - 4
+     - 0.01636
+     - 0.01088
+   * - 40
+     - 0.14412
+     - 0.08268
+   * - 40
+     - 1.29385
+     - 0.14408
+   * - 40
+     - 0.41035
+     - 0.46162
+   * - 400
+     - 6.29020
+     - 5.82143
+l1filter pruner example
+^^^^^^^^^^^^^^^^^^^^^^^
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01026
+     - 0.003677
+   * - 2
+     - 0.01657
+     - 0.008161
+   * - 4
+     - 0.02458
+     - 0.020018
+   * - 8
+     - 0.03498
+     - 0.025504
+   * - 16
+     - 0.06757
+     - 0.047523
+   * - 32
+     - 0.10487
+     - 0.086442
+APoZ pruner example
+^^^^^^^^^^^^^^^^^^^
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01389
+     - 0.004208
+   * - 2
+     - 0.01628
+     - 0.008310
+   * - 4
+     - 0.02521
+     - 0.014008
+   * - 8
+     - 0.03386
+     - 0.023923
+   * - 16
+     - 0.06042
+     - 0.046183
+   * - 32
+     - 0.12421
+     - 0.087113
--- a/docs/en_US/Compression/Overview.rst
+++ b/docs/en_US/Compression/Overview.rst
+Model Compression with NNI
+==========================
+.. contents::
+As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
+*Note that the interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.*
+Supported Algorithms
+--------------------
+The algorithms include pruning algorithms and quantization algorithms.
+Pruning Algorithms
+^^^^^^^^^^^^^^^^^^
+Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and address the over-ﬁtting issue.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Level Pruner </Compression/Pruner.html#level-pruner>`__
+     - Pruning the specified ratio on each weight based on absolute values of weights
+   * - `AGP Pruner </Compression/Pruner.html#agp-pruner>`__
+     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+   * - `Lottery Ticket Pruner </Compression/Pruner.html#lottery-ticket-hypothesis>`__
+     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+   * - `FPGM Pruner </Compression/Pruner.html#fpgm-pruner>`__
+     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/pdf/1811.00250.pdf>`__
+   * - `L1Filter Pruner </Compression/Pruner.html#l1filter-pruner>`__
+     - Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
+   * - `L2Filter Pruner </Compression/Pruner.html#l2filter-pruner>`__
+     - Pruning filters with the smallest L2 norm of weights in convolution layers
+   * - `ActivationAPoZRankFilterPruner </Compression/Pruner.html#activationapozrankfilterpruner>`__
+     - Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
+   * - `ActivationMeanRankFilterPruner </Compression/Pruner.html#activationmeanrankfilterpruner>`__
+     - Pruning filters based on the metric that calculates the smallest mean value of output activations
+   * - `Slim Pruner </Compression/Pruner.html#slim-pruner>`__
+     - Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
+   * - `TaylorFO Pruner </Compression/Pruner.html#taylorfoweightfilterpruner>`__
+     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
+   * - `ADMM Pruner </Compression/Pruner.html#admm-pruner>`__
+     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
+   * - `NetAdapt Pruner </Compression/Pruner.html#netadapt-pruner>`__
+     - Automatically simplify a pretrained network to meet the resource budget by iterative pruning  `Reference Paper <https://arxiv.org/abs/1804.03230>`__
+   * - `SimulatedAnnealing Pruner </Compression/Pruner.html#simulatedannealing-pruner>`__
+     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - `AutoCompress Pruner </Compression/Pruner.html#autocompress-pruner>`__
+     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - `AMC Pruner </Compression/Pruner.html#amc-pruner>`__
+     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/pdf/1802.03494.pdf>`__
+You can refer to this :githublink:`benchmark <docs/en_US/CommunitySharings/ModelCompressionComparison.rst>` for the performance of these pruners on some benchmark problems.
+Quantization Algorithms
+^^^^^^^^^^^^^^^^^^^^^^^
+Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Naive Quantizer </Compression/Quantizer.html#naive-quantizer>`__
+     - Quantize weights to default 8 bits
+   * - `QAT Quantizer </Compression/Quantizer.html#qat-quantizer>`__
+     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
+   * - `DoReFa Quantizer </Compression/Quantizer.html#dorefa-quantizer>`__
+     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
+   * - `BNN Quantizer </Compression/Quantizer.html#bnn-quantizer>`__
+     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
+Automatic Model Compression
+---------------------------
+Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found `here <./AutoPruningUsingTuners.rst>`__.
+Model Speedup
+-------------
+The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Model Speedup can be found `here <./ModelSpeedup.rst>`__.
+Compression Utilities
+---------------------
+Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
+Customize Your Own Compression Algorithms
+-----------------------------------------
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. The detailed tutorial for customizing a new compression algorithm (pruning algorithm or quantization algorithm) can be found `here <./Framework.rst>`__.
+Reference and Feedback
+----------------------
+* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
+* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
+* To know more about `Feature Engineering with NNI <../FeatureEngineering/Overview.rst>`__\ ;
+* To know more about `NAS with NNI <../NAS/Overview.rst>`__\ ;
+* To know more about `Hyperparameter Tuning with NNI <../Tuner/BuiltinTuner.rst>`__\ ;
--- a/docs/en_US/Compression/Pruner.rst
+++ b/docs/en_US/Compression/Pruner.rst
+Supported Pruning Algorithms on NNI
+===================================
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in  unstructured models, which need specialized haredware or software to speed up the sparse network.** Filter Pruning** achieves acceleratation by removing the entire filter.  We also provide an algorithm to control the** pruning schedule**.
+**Fine-grained Pruning**
+* `Level Pruner <#level-pruner>`__
+**Filter Pruning**
+* `Slim Pruner <#slim-pruner>`__
+* `FPGM Pruner <#fpgm-pruner>`__
+* `L1Filter Pruner <#l1filter-pruner>`__
+* `L2Filter Pruner <#l2filter-pruner>`__
+* `Activation APoZ Rank Filter Pruner <#activationAPoZRankFilter-pruner>`__
+* `Activation Mean Rank Filter Pruner <#activationmeanrankfilter-pruner>`__
+* `Taylor FO On Weight Pruner <#taylorfoweightfilter-pruner>`__
+**Pruning Schedule**
+* `AGP Pruner <#agp-pruner>`__
+* `NetAdapt Pruner <#netadapt-pruner>`__
+* `SimulatedAnnealing Pruner <#simulatedannealing-pruner>`__
+* `AutoCompress Pruner <#autocompress-pruner>`__
+* `AMC Pruner <#amc-pruner>`__
+* `Sensitivity Pruner <#sensitivity-pruner>`__
+**Others**
+* `ADMM Pruner <#admm-pruner>`__
+* `Lottery Ticket Hypothesis <#lottery-ticket-hypothesis>`__
+Level Pruner
+------------
+This is one basic one-shot pruner: you can set a target sparsity level (expressed as a fraction, 0.6 means we will prune 60% of the weight parameters). 
+We first sort the weights in the specified layer by their absolute values. And then mask to zero the smallest magnitude weights until the desired sparsity level is reached.
+Usage
+^^^^^
+Tensorflow code
+.. code-block:: python
+   from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+User configuration for Level Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.LevelPruner
+Tensorflow
+""""""""""
+..  autoclass:: nni.algorithms.compression.tensorflow.pruning.LevelPruner
+Slim Pruner
+-----------
+This is an one-shot pruner, In `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\ , authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
+.. image:: ../../img/slim_pruner.png
+   :target: ../../img/slim_pruner.png
+   :alt: 
+..
+   Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**\ , L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are** globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import SlimPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
+   pruner = SlimPruner(model, config_list)
+   pruner.compress()
+User configuration for Slim Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.SlimPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We implemented one of the experiments in `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\ , we pruned $70\%$ channels in the **VGGNet** for CIFAR-10 in the paper, in which $88.5\%$ parameters are pruned. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Error(paper/ours)
+     - Parameters
+     - Pruned
+   * - VGGNet
+     - 6.34/6.40
+     - 20.04M
+     - 
+   * - Pruned-VGGNet
+     - 6.20/6.26
+     - 2.03M
+     - 88.5%
+The experiments code can be found at :githublink:`examples/model_compress <examples/model_compress/>`
+----
+FPGM Pruner
+-----------
+This is an one-shot pruner, FPGM Pruner is an implementation of paper `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__
+FPGMPruner prune filters with the smallest geometric median.
+.. image:: ../../img/fpgm_fig1.png
+   :target: ../../img/fpgm_fig1.png
+   :alt: 
+..
+   Previous works utilized “smaller-norm-less-important” criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with “relatively less” importance. 
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import FPGMPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = FPGMPruner(model, config_list)
+   pruner.compress()
+User configuration for FPGM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.FPGMPruner
+L1Filter Pruner
+---------------
+This is an one-shot pruner, In `'PRUNING FILTERS FOR EFFICIENT CONVNETS' <https://arxiv.org/abs/1608.08710>`__\ , authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
+.. image:: ../../img/l1filter_pruner.png
+   :target: ../../img/l1filter_pruner.png
+   :alt: 
+..
+   L1Filter Pruner prunes filters in the **convolution layers**
+   The procedure of pruning m filters from the ith convolutional layer is as follows:
+   #. For each filter :math:`F_{i,j}`, calculate the sum of its absolute kernel weights :math:`s_j=\sum_{l=1}^{n_i}\sum|K_l|`.
+   #. Sort the filters by :math:`s_j`.
+   #. Prune :math:`m` filters with the smallest sum values and their corresponding feature maps. The
+      kernels in the next convolutional layer corresponding to the pruned feature maps are also removed.
+   #. A new kernel matrix is created for both the :math:`i`-th and :math:`i+1`-th layers, and the remaining kernel
+      weights are copied to the new model.
+In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = L1FilterPruner(model, config_list)
+   pruner.compress()
+User configuration for L1Filter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.L1FilterPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We implemented one of the experiments in `'PRUNING FILTERS FOR EFFICIENT CONVNETS' <https://arxiv.org/abs/1608.08710>`__ with **L1FilterPruner**\ , we pruned** VGG-16** for CIFAR-10 to** VGG-16-pruned-A** in the paper, in which $64\%$ parameters are pruned. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Error(paper/ours)
+     - Parameters
+     - Pruned
+   * - VGG-16
+     - 6.75/6.49
+     - 1.5x10^7
+     - 
+   * - VGG-16-pruned-A
+     - 6.60/6.47
+     - 5.4x10^6
+     - 64.0%
+The experiments code can be found at :githublink:`examples/model_compress <examples/model_compress/>`
+----
+L2Filter Pruner
+---------------
+This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import L2FilterPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = L2FilterPruner(model, config_list)
+   pruner.compress()
+User configuration for L2Filter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.L2FilterPruner
+----
+ActivationAPoZRankFilter Pruner
+-------------------------------
+ActivationAPoZRankFilter Pruner is a pruner which prunes the filters with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures <https://arxiv.org/abs/1607.03250>`__.
+The APoZ is defined as:
+.. image:: ../../img/apoz.png
+   :target: ../../img/apoz.png
+   :alt: 
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ActivationAPoZRankFilterPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = ActivationAPoZRankFilterPruner(model, config_list, statistics_batch_num=1)
+   pruner.compress()
+Note: ActivationAPoZRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
+You can view :githublink:`example <examples/model_compress/model_prune_torch.py>` for more information.
+User configuration for ActivationAPoZRankFilter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationAPoZRankFilterPruner
+----
+ActivationMeanRankFilter Pruner
+-------------------------------
+ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper\ `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ActivationMeanRankFilterPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = ActivationMeanRankFilterPruner(model, config_list, statistics_batch_num=1)
+   pruner.compress()
+Note: ActivationMeanRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
+You can view :githublink:`example <examples/model_compress/model_prune_torch.py>` for more information.
+User configuration for ActivationMeanRankFilterPruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationMeanRankFilterPruner
+----
+TaylorFOWeightFilter Pruner
+---------------------------
+TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity. The estimated importance of filters is defined as the paper `Importance Estimation for Neural Network Pruning <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__. Other pruning criteria mentioned in this paper will be supported in future release.
+..
+.. image:: ../../img/importance_estimation_sum.png
+   :target: ../../img/importance_estimation_sum.png
+   :alt: 
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import TaylorFOWeightFilterPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = TaylorFOWeightFilterPruner(model, config_list, statistics_batch_num=1)
+   pruner.compress()
+User configuration for TaylorFOWeightFilter Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.TaylorFOWeightFilterPruner
+----
+AGP Pruner
+----------
+This is an iterative pruner, In `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\ , authors Michael Zhu and Suyog Gupta provide an algorithm to prune the weight gradually.
+..
+   We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step t0 and with pruning frequency ∆t:
+   .. image:: ../../img/agp_pruner.png
+      :target: ../../img/agp_pruner.png
+      :alt: 
+   The binary weight masks are updated every ∆t steps as the network is trained to gradually increase the sparsity of the network while allowing the network training steps to recover from any pruning-induced loss in accuracy. In our experience, varying the pruning frequency ∆t between 100 and 1000 training steps had a negligible impact on the final model quality. Once the model achieves the target sparsity sf , the weight masks are no longer updated. The intuition behind this sparsity function in equation (1).
+Usage
+^^^^^
+You can prune all weight from 0% to 80% sparsity in 10 epoch with the code below.
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import AGPPruner
+   config_list = [{
+       'initial_sparsity': 0,
+       'final_sparsity': 0.8,
+       'start_epoch': 0,
+       'end_epoch': 10,
+       'frequency': 1,
+       'op_types': ['default']
+   }]
+   # load a pretrained model or train a model before using a pruner
+   # model = MyModel()
+   # model.load_state_dict(torch.load('mycheckpoint.pth'))
+   # AGP pruner prunes model while fine tuning the model by adding a hook on
+   # optimizer.step(), so an optimizer is required to prune the model.
+   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+   pruner = AGPPruner(model, config_list, optimizer, pruning_algorithm='level')
+   pruner.compress()
+AGP pruner uses ``LevelPruner`` algorithms to prune the weight by default, however you can set ``pruning_algorithm`` parameter to other values to use other pruning algorithms:
+* ``level``\ : LevelPruner
+* ``slim``\ : SlimPruner
+* ``l1``\ : L1FilterPruner
+* ``l2``\ : L2FilterPruner
+* ``fpgm``\ : FPGMPruner
+* ``taylorfo``\ : TaylorFOWeightFilterPruner
+* ``apoz``\ : ActivationAPoZRankFilterPruner
+* ``mean_activation``\ : ActivationMeanRankFilterPruner
+You should add code below to update epoch number when you finish one epoch in your training code.
+PyTorch code
+.. code-block:: python
+   pruner.update_epoch(epoch)
+You can view :githublink:`example <examples/model_compress/model_prune_torch.py>` for more information.
+User configuration for AGP Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.AGPPruner
+----
+NetAdapt Pruner
+---------------
+NetAdapt allows a user to automatically simplify a pretrained network to meet the resource budget. 
+Given the overall sparsity, NetAdapt will automatically generate the sparsities distribution among different layers by iterative pruning.
+For more details, please refer to `NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications <https://arxiv.org/abs/1804.03230>`__.
+.. image:: ../../img/algo_NetAdapt.png
+   :target: ../../img/algo_NetAdapt.png
+   :alt: 
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import NetAdaptPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = NetAdaptPruner(model, config_list, short_term_fine_tuner=short_term_fine_tuner, evaluator=evaluator,base_algo='l1', experiment_data_dir='./')
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for NetAdapt Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.NetAdaptPruner
+SimulatedAnnealing Pruner
+-------------------------
+We implement a guided heuristic search method, Simulated Annealing (SA) algorithm, with enhancement on guided search based on prior experience. 
+The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.
+* Randomly initialize a pruning rate distribution (sparsities).
+* While current_temperature < stop_temperature:
+  #. generate a perturbation to current distribution
+  #. Perform fast evaluation on the perturbated distribution
+  #. accept the perturbation according to the performance and probability, if not accepted, return to step 1
+  #. cool down, current_temperature <- current_temperature * cool_down_rate
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import SimulatedAnnealingPruner
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['Conv2d']
+   }]
+   pruner = SimulatedAnnealingPruner(model, config_list, evaluator=evaluator, base_algo='l1', cool_down_rate=0.9, experiment_data_dir='./')
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for SimulatedAnnealing Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.SimulatedAnnealingPruner
+AutoCompress Pruner
+-------------------
+For each round, AutoCompressPruner prune the model for the same sparsity to achive the overall sparsity:
+.. code-block:: bash
+       1. Generate sparsities distribution using SimulatedAnnealingPruner
+       2. Perform ADMM-based structured pruning to generate pruning result for the next round.
+          Here we use `speedup` to perform real pruning.
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ADMMPruner
+   config_list = [{
+           'sparsity': 0.5,
+           'op_types': ['Conv2d']
+       }]
+   pruner = AutoCompressPruner(
+               model, config_list, trainer=trainer, evaluator=evaluator,
+               dummy_input=dummy_input, num_iterations=3, optimize_mode='maximize', base_algo='l1',
+               cool_down_rate=0.9, admm_num_iterations=30, admm_training_epochs=5, experiment_data_dir='./')
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for AutoCompress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.AutoCompressPruner
+AMC Pruner
+----------
+AMC pruner leverages reinforcement learning to provide the model compression policy.
+This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio,
+better preserving the accuracy and freeing human labor.
+.. image:: ../../img/amc_pruner.jpg
+   :target: ../../img/amc_pruner.jpg
+   :alt: 
+For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import AMCPruner
+   config_list = [{
+           'op_types': ['Conv2d', 'Linear']
+       }]
+   pruner = AMCPruner(model, config_list, evaluator, val_loader, flops_ratio=0.5)
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/amc/>` for more information.
+User configuration for AutoCompress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.AMCPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We implemented one of the experiments in `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__\ , we pruned **MobileNet** to 50% FLOPS for ImageNet in the paper. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Top 1 acc.(paper/ours)
+     - Top 5 acc. (paper/ours)
+     - FLOPS
+   * - MobileNet
+     - 70.5% / 69.9%
+     - 89.3% / 89.1%
+     - 50%
+The experiments code can be found at :githublink:`examples/model_compress <examples/model_compress/amc/>`
+ADMM Pruner
+-----------
+Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
+by decomposing the original nonconvex problem into two subproblems that can be solved iteratively. In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively. 
+During the process of solving these two subproblems, the weights of the original model will be changed. An one-shot pruner will then be applied to prune the model according to the config list given.
+This solution framework applies both to non-structured and different variations of structured pruning schemes.
+For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers <https://arxiv.org/abs/1804.03294>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import ADMMPruner
+   config_list = [{
+               'sparsity': 0.8,
+               'op_types': ['Conv2d'],
+               'op_names': ['conv1']
+           }, {
+               'sparsity': 0.92,
+               'op_types': ['Conv2d'],
+               'op_names': ['conv2']
+           }]
+   pruner = ADMMPruner(model, config_list, trainer=trainer, num_iterations=30, epochs=5)
+   pruner.compress()
+You can view :githublink:`example <examples/model_compress/auto_pruners_torch.py>` for more information.
+User configuration for ADMM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.ADMMPruner
+Lottery Ticket Hypothesis
+-------------------------
+`The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks <https://arxiv.org/abs/1803.03635>`__\ , authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis, and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.
+In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :
+..
+   #. Randomly initialize a neural network f(x;theta_0) (where theta\ *0 follows D*\ {theta}).
+   #. Train the network for j iterations, arriving at parameters theta_j.
+   #. Prune p% of the parameters in theta_j, creating a mask m.
+   #. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
+   #. Repeat step 2, 3, and 4.
+If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LotteryTicketPruner
+   config_list = [{
+       'prune_iterations': 5,
+       'sparsity': 0.8,
+       'op_types': ['default']
+   }]
+   pruner = LotteryTicketPruner(model, config_list, optimizer)
+   pruner.compress()
+   for _ in pruner.get_prune_iterations():
+       pruner.prune_iteration_start()
+       for epoch in range(epoch_num):
+           ...
+The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs ``model`` and ``optimizer`` (\ **Note that should add ``lr_scheduler`` if used**\ ) to reset their states every time a new prune iteration starts. Please use ``get_prune_iterations`` to get the pruning iterations, and invoke ``prune_iteration_start`` at the beginning of each iteration. ``epoch_num`` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
+*Tensorflow version will be supported later.*
+User configuration for LotteryTicket Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.LotteryTicketPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here <examples/model_compress/lottery_torch_mnist_fc.py>`. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+.. image:: ../../img/lottery_ticket_mnist_fc.png
+   :target: ../../img/lottery_ticket_mnist_fc.png
+   :alt: 
+The above figure shows the result of the fully connected network. ``round0-sparsity-0.0`` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
+Sensitivity Pruner
+------------------
+For each round, SensitivityPruner prunes the model based on the sensitivity to the accuracy of each layer until meeting the final configured sparsity of the whole model:
+.. code-block:: bash
+       1. Analyze the sensitivity of each layer in the current state of the model.
+       2. Prune each layer according to the sensitivity.
+For more details, please refer to `Learning both Weights and Connections for Efficient Neural Networks  <https://arxiv.org/abs/1506.02626>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import SensitivityPruner
+   config_list = [{
+           'sparsity': 0.5,
+           'op_types': ['Conv2d']
+       }]
+   pruner = SensitivityPruner(model, config_list, finetuner=fine_tuner, evaluator=evaluator)
+   # eval_args and finetune_args are the parameters passed to the evaluator and finetuner respectively
+   pruner.compress(eval_args=[model], finetune_args=[model])
+User configuration for Sensitivity Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.SensitivityPruner
--- a/docs/en_US/Compression/Quantizer.rst
+++ b/docs/en_US/Compression/Quantizer.rst
+Supported Quantization Algorithms on NNI
+========================================
+Index of supported quantization algorithms
+* `Naive Quantizer <#naive-quantizer>`__
+* `QAT Quantizer <#qat-quantizer>`__
+* `DoReFa Quantizer <#dorefa-quantizer>`__
+* `BNN Quantizer <#bnn-quantizer>`__
+Naive Quantizer
+---------------
+We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
+Usage
+^^^^^
+pytorch
+.. code-block:: python
+   model = nni.algorithms.compression.pytorch.quantization.NaiveQuantizer(model).compress()
+----
+QAT Quantizer
+-------------
+In `Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__\ , authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
+..
+   We propose an approach that simulates quantization effects in the forward pass of training. Backpropagation still happens as usual, and all weights and biases are stored in floating point so that they can be easily nudged by small amounts. The forward propagation pass however simulates quantized inference as it will happen in the inference engine, by implementing in floating-point arithmetic the rounding behavior of the quantization scheme
+   * Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer, the batch normalization parameters are “folded into” the weights before quantization.
+   * Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
+Usage
+^^^^^
+You can quantize your model to 8 bits with the code below before your training code.
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+   model = Mnist()
+   config_list = [{
+       'quant_types': ['weight'],
+       'quant_bits': {
+           'weight': 8,
+       }, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
+       'op_types':['Conv2d', 'Linear']
+   }, {
+       'quant_types': ['output'],
+       'quant_bits': 8,
+       'quant_start_step': 7000,
+       'op_types':['ReLU6']
+   }]
+   quantizer = QAT_Quantizer(model, config_list)
+   quantizer.compress()
+You can view example for more information
+User configuration for QAT Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+configuration needed by this algorithm :
+* **quant_start_step:** int
+disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
+state where activation quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
+note
+^^^^
+batch normalization folding is currently not supported.
+----
+DoReFa Quantizer
+----------------
+In `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__\ , authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
+Usage
+^^^^^
+To implement DoReFa Quantizer, you can add code below before your training code
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
+   config_list = [{ 
+       'quant_types': ['weight'],
+       'quant_bits': 8, 
+       'op_types': 'default' 
+   }]
+   quantizer = DoReFaQuantizer(model, config_list)
+   quantizer.compress()
+You can view example for more information
+User configuration for DoReFa Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+configuration needed by this algorithm :
+----
+BNN Quantizer
+-------------
+In `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , 
+..
+   We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
+   model = VGG_Cifar10(num_classes=10)
+   configure_list = [{
+       'quant_bits': 1,
+       'quant_types': ['weight'],
+       'op_types': ['Conv2d', 'Linear'],
+       'op_names': ['features.0', 'features.3', 'features.7', 'features.10', 'features.14', 'features.17', 'classifier.0', 'classifier.3']
+   }, {
+       'quant_bits': 1,
+       'quant_types': ['output'],
+       'op_types': ['Hardtanh'],
+       'op_names': ['features.6', 'features.9', 'features.13', 'features.16', 'features.20', 'classifier.2', 'classifier.5']
+   }]
+   quantizer = BNNQuantizer(model, configure_list)
+   model = quantizer.compress()
+You can view example :githublink:`examples/model_compress/BNN_quantizer_cifar10.py <examples/model_compress/BNN_quantizer_cifar10.py>` for more information.
+User configuration for BNN Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+configuration needed by this algorithm :
+Experiment
+^^^^^^^^^^
+We implemented one of the experiments in `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Accuracy
+   * - VGGNet
+     - 86.93%
+The experiments code can be found at :githublink:`examples/model_compress/BNN_quantizer_cifar10.py <examples/model_compress/BNN_quantizer_cifar10.py>` 
--- a/docs/en_US/Compression/QuickStart.rst
+++ b/docs/en_US/Compression/QuickStart.rst
+Tutorial for Model Compression
+==============================
+.. contents::
+In this tutorial, we use the `first section <#quick-start-to-compress-a-model>`__ to quickly go through the usage of model compression on NNI. Then use the `second section <#detailed-usage-guide>`__ to explain more details of the usage.
+Quick Start to Compress a Model
+-------------------------------
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use `slim pruner </Compression/Pruner.html#slim-pruner>`__ as an example to show the usage.
+Write configuration
+^^^^^^^^^^^^^^^^^^^
+Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``BatchNorm2d``\ s to sparsity 0.7 while keeping other layers unpruned.
+.. code-block:: python
+   configure_list = [{
+       'sparsity': 0.7,
+       'op_types': ['BatchNorm2d'],
+   }]
+The specification of configuration can be found `here <#specification-of-config-list>`__. Note that different pruners may have their own defined fields in configuration, for exmaple ``start_epoch`` in AGP pruner. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
+Choose a compression algorithm
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model.
+.. code-block:: python
+   pruner = SlimPruner(model, configure_list)
+   model = pruner.compress()
+Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
+Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^
+After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
+.. code-block:: python
+   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
+The complete code of model compression examples can be found :githublink:`here <examples/model_compress/model_prune_torch.py>`.
+Speed up the model
+^^^^^^^^^^^^^^^^^^
+Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
+.. code-block:: python
+   from nni.compression.pytorch import apply_compression_results
+   apply_compression_results(model, 'mask_vgg19_cifar10.pth')
+Please refer to `here <ModelSpeedup.rst>`__ for detailed description.
+Detailed Usage Guide
+--------------------
+The example code for users to apply model compression on a user model can be found below:
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   pruner.compress()
+Tensorflow code
+.. code-block:: python
+   from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(tf.get_default_graph(), config_list)
+   pruner.compress()
+You can use other compression algorithms in the package of ``nni.compression``. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under ``nni.compression.pytorch`` and ``nni.compression.tensorflow`` respectively. You can refer to `Pruner <./Pruner.md>`__ and `Quantizer <./Quantizer.md>`__ for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to `KDExample <../TrialExample/KDExample.rst>`__
+A compression algorithm is first instantiated with a ``config_list`` passed in. The specification of this ``config_list`` will be described later.
+The function call ``pruner.compress()`` modifies user defined model (in Tensorflow the model can be obtained with ``tf.get_default_graph()``\ , while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.
+*Note that, ``pruner.compress`` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after ``pruner.compress``.*
+Specification of ``config_list``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 
+The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 
+There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
+* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
+* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
+* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
+Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.md>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
+A simple example of configuration is shown below:
+.. code-block:: python
+   [
+       {
+           'sparsity': 0.8,
+           'op_types': ['default']
+       },
+       {
+           'sparsity': 0.6,
+           'op_names': ['op_name1', 'op_name2']
+       },
+       {
+           'exclude': True,
+           'op_names': ['op_name3']
+       }
+   ]
+It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
+Quantization specific keys
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
+* **quant_types** : list of string. 
+Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
+to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
+* **quant_bits** : int or dict of {str : int}
+bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 
+.. code-block:: bash
+   {
+       quant_bits: {
+           'weight': 8,
+           'output': 4,
+           },
+   }
+when the value is int type, all quantization types share same bits length. eg. 
+.. code-block:: bash
+   {
+       quant_bits: 8, # weight or output quantization are all 8 bits
+   }
+The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
+.. code-block:: bash
+   configure_list = [{
+           'quant_types': ['weight'],        
+           'quant_bits': 8, 
+           'op_names': ['conv1']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 4,
+           'quant_start_step': 0,
+           'op_names': ['conv2']
+       }, {
+           'quant_types': ['weight'],
+           'quant_bits': 3,
+           'op_names': ['fc1']
+           },
+          {
+           'quant_types': ['weight'],
+           'quant_bits': 2,
+           'op_names': ['fc2']
+           }
+   ]
+In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
+APIs for Updating Fine Tuning Status
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Some compression algorithms use epochs to control the progress of compression (e.g. `AGP </Compression/Pruner.html#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
+``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.
+Export Compressed Model
+^^^^^^^^^^^^^^^^^^^^^^^
+You can easily export the compressed model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. In this exported ``model.pth``\ , the masked weights are zero.
+.. code-block:: bash
+   pruner.export_model(model_path='model.pth')
+``mask_dict`` and pruned model in ``onnx`` format(\ ``input_shape`` need to be specified) can also be exported like this:
+.. code-block:: python
+   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])
+If you want to really speed up the compressed model, please refer to `NNI model speedup <./ModelSpeedup.rst>`__ for details.
--- a/docs/en_US/FeatureEngineering/GBDTSelector.rst
+++ b/docs/en_US/FeatureEngineering/GBDTSelector.rst
+GBDTSelector
+------------
+GBDTSelector is based on `LightGBM <https://github.com/microsoft/LightGBM>`__\ , which is a gradient boosting framework that uses tree-based learning algorithms.
+When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
+We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
+For now, we support the ``importance_type`` is ``split`` and ``gain``. But we will support customized ``importance_type`` in the future, which means the user could define how to calculate the ``feature score`` by themselves.
+Usage
+^^^^^
+First you need to install dependency:
+.. code-block:: bash
+   pip install lightgbm
+Then
+.. code-block:: python
+   from nni.feature_engineering.gbdt_selector import GBDTSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = GBDTSelector()
+   # fit data
+   fgs.fit(X_train, y_train, ...)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features(10))
+   ...
+And you could reference the examples in ``/examples/feature_engineering/gbdt_selector/``\ , too.
+**Requirement of ``fit`` FuncArgs**
+* 
+  **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+* 
+  **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+* 
+  **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
+* 
+  **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
+* 
+  **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__.
+* 
+  **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance>`__.
+* 
+  **num_boost_round** (int, require) - number of boost round. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train>`__.
+**Requirement of ``get_selected_features`` FuncArgs**
+* **topk** (int, require) - the topK impotance features you want to selected.
--- a/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
+++ b/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
+GradientFeatureSelector
+-----------------------
+The algorithm in GradientFeatureSelector comes from `"Feature Gradients: Scalable Feature Selection via Discrete Relaxation" <https://arxiv.org/pdf/1908.10382.pdf>`__.
+GradientFeatureSelector, a gradient-based search algorithm
+for feature selection. 
+1) This approach extends a recent result on the estimation of
+learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N. 
+2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
+3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
+Usage
+^^^^^
+.. code-block:: python
+   from nni.feature_engineering.gradient_selector import FeatureGradientSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = FeatureGradientSelector(n_features=10)
+   # fit data
+   fgs.fit(X_train, y_train)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features())
+   ...
+And you could reference the examples in ``/examples/feature_engineering/gradient_feature_selector/``\ , too.
+**Parameters of class FeatureGradientSelector constructor**
+* 
+  **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
+* 
+  **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.
+* 
+  **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
+* 
+  **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
+* 
+  **learning_rate** (float, optional, default = 1e-1) - learning rate
+* 
+  **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*\ ) - How to initialize the vector of scores. 'zero' is the default.
+* 
+  **n_epochs** (int, optional, default = 1) - number of epochs to run
+* 
+  **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
+* 
+  **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
+* 
+  **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
+* 
+  **classification** (bool, optional, default = True) - If True, problem is classification, else regression.
+* 
+  **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
+* 
+  **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
+* 
+  **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
+* 
+  **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
+* 
+  **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
+* 
+  **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
+**Requirement of ``fit`` FuncArgs**
+* 
+  **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+* 
+  **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+* 
+  **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
+**Requirement of ``get_selected_features`` FuncArgs**
+ For now, the ``get_selected_features`` function has no parameters.
--- a/docs/en_US/FeatureEngineering/Overview.rst
+++ b/docs/en_US/FeatureEngineering/Overview.rst
+Feature Engineering with NNI
+============================
+We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.
+For now, we support the following feature selector:
+* `GradientFeatureSelector <./GradientFeatureSelector.rst>`__
+* `GBDTSelector <./GBDTSelector.rst>`__
+These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).
+In addition, those selector only for feature selection. If you want to:
+1) generate high-order combined features on nni while doing feature selection;
+2) leverage your distributed resources;
+you could try this :githublink:`example <examples/feature_engineering/auto-feature-engineering>`.
+How to use?
+-----------
+.. code-block:: python
+   from nni.feature_engineering.gradient_selector import FeatureGradientSelector
+   # from nni.feature_engineering.gbdt_selector import GBDTSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = FeatureGradientSelector(...)
+   # fit data
+   fgs.fit(X_train, y_train)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features(...))
+   ...
+When using the built-in Selector, you first need to ``import`` a feature selector, and ``initialize`` it. You could call the function ``fit`` in the selector to pass the data to the selector. After that, you could use ``get_seleteced_features`` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it. 
+How to customize?
+-----------------
+NNI provides *state-of-the-art* feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.
+If you want to implement a customized feature selector, you need to:
+#. Inherit the base FeatureSelector class
+#. Implement *fit* and _get_selected*features* function
+#. Integrate with sklearn (Optional)
+Here is an example:
+**1. Inherit the base Featureselector Class**
+.. code-block:: python
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector):
+       def __init__(self, ...):
+       ...
+**2. Implement *fit* and _get_selected*features* Function**
+.. code-block:: python
+   from nni.tuner import Tuner
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector):
+       def __init__(self, ...):
+       ...
+       def fit(self, X, y, **kwargs):
+           """
+           Fit the training data to FeatureSelector
+           Parameters
+           ------------
+           X : array-like numpy matrix
+           The training input samples, which shape is [n_samples, n_features].
+           y: array-like numpy matrix
+           The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
+           """
+           self.X = X
+           self.y = y
+           ...
+       def get_selected_features(self):
+           """
+           Get important feature
+           Returns
+           -------
+           list :
+           Return the index of the important feature.
+           """
+           ...
+           return self.selected_features_
+       ...
+**3. Integrate with Sklearn**
+``sklearn.pipeline.Pipeline`` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow. 
+The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a module of the pipeline.
+#. Inherit the calss *sklearn.base.BaseEstimator*
+#. Implement _get\ *params* and _set*params* function in *BaseEstimator*
+#. Inherit the class _sklearn.feature\ *selection.base.SelectorMixin*
+#. Implement _get\ *support*\ , *transform* and _inverse*transform* Function in *SelectorMixin*
+Here is an example:
+**1. Inherit the BaseEstimator Class and its Function**
+.. code-block:: python
+   from sklearn.base import BaseEstimator
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector, BaseEstimator):
+       def __init__(self, ...):
+       ...
+       def get_params(self, ...):
+           """
+           Get parameters for this estimator.
+           """
+           params = self.__dict__
+           params = {key: val for (key, val) in params.items()
+           if not key.endswith('_')}
+           return params
+       def set_params(self, **params):
+           """
+           Set the parameters of this estimator.
+           """
+           for param in params:
+           if hasattr(self, param):
+           setattr(self, param, params[param])
+           return self
+**2. Inherit the SelectorMixin Class and its Function**
+.. code-block:: python
+   from sklearn.base import BaseEstimator
+   from sklearn.feature_selection.base import SelectorMixin
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector, BaseEstimator, SelectorMixin):
+       def __init__(self, ...):
+           ...
+       def get_params(self, ...):
+           """
+           Get parameters for this estimator.
+           """
+           params = self.__dict__
+           params = {key: val for (key, val) in params.items()
+           if not key.endswith('_')}
+           return params
+       def set_params(self, **params):
+           """
+           Set the parameters of this estimator.
+           """
+           for param in params:
+           if hasattr(self, param):
+           setattr(self, param, params[param])
+           return self
+       def get_support(self, indices=False):
+           """
+           Get a mask, or integer index, of the features selected.
+           Parameters
+           ----------
+           indices : bool
+           Default False. If True, the return value will be an array of integers, rather than a boolean mask.
+           Returns
+           -------
+           list :
+           returns support: An index that selects the retained features from a feature vector.
+           If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
+           If indices are True, this is an integer array of shape [# output features] whose values
+           are indices into the input feature vector.
+           """
+           ...
+           return mask
+       def transform(self, X):
+           """Reduce X to the selected features.
+           Parameters
+           ----------
+           X : array
+           which shape is [n_samples, n_features]
+           Returns
+           -------
+           X_r : array
+           which shape is [n_samples, n_selected_features]
+           The input samples with only the selected features.
+           """
+           ...
+           return X_r
+       def inverse_transform(self, X):
+           """
+           Reverse the transformation operation
+           Parameters
+           ----------
+           X : array
+           shape is [n_samples, n_selected_features]
+           Returns
+           -------
+           X_r : array
+           shape is [n_samples, n_original_features]
+           """
+           ...
+           return X_r
+After integrating with Sklearn, we could use the feature selector as follows:
+.. code-block:: python
+   from sklearn.linear_model import LogisticRegression
+   # load data
+   ...
+   X_train, y_train = ...
+   # build a ppipeline
+   pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
+   pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+   pipeline.fit(X_train, y_train)
+   # score
+   print("Pipeline Score: ", pipeline.score(X_train, y_train))
+Benchmark
+---------
+``Baseline`` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Dataset
+     - All Features + LR (acc, time, memory)
+     - GradientFeatureSelector + LR (acc, time, memory)
+     - TreeBasedClassifier + LR (acc, time, memory)
+     - #Train
+     - #Feature
+   * - colon-cancer
+     - 0.7547, 890ms, 348MiB
+     - 0.7368, 363ms, 286MiB
+     - 0.7223, 171ms, 1171 MiB
+     - 62
+     - 2,000
+   * - gisette
+     - 0.9725, 215ms, 584MiB
+     - 0.89416, 446ms, 397MiB
+     - 0.9792, 911ms, 234MiB
+     - 6,000
+     - 5,000
+   * - avazu
+     - 0.8834, N/A, N/A
+     - N/A, N/A, N/A
+     - N/A, N/A, N/A
+     - 40,428,967
+     - 1,000,000
+   * - rcv1
+     - 0.9644, 557ms, 241MiB
+     - 0.7333, 401ms, 281MiB
+     - 0.9615, 752ms, 284MiB
+     - 20,242
+     - 47,236
+   * - news20.binary
+     - 0.9208, 707ms, 361MiB
+     - 0.6870, 565ms, 371MiB
+     - 0.9070, 904ms, 364MiB
+     - 19,996
+     - 1,355,191
+   * - real-sim
+     - 0.9681, 433ms, 274MiB
+     - 0.7969, 251ms, 274MiB
+     - 0.9591, 643ms, 367MiB
+     - 72,309
+     - 20,958
+The dataset of benchmark could be download in `here <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/>`__
+The code could be refenrence ``/examples/feature_engineering/gradient_feature_selector/benchmark_test.py``.
+Reference and Feedback
+----------------------
+* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
+* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
+* To know more about :githublink:`Neural Architecture Search with NNI <docs/en_US/NAS/Overview.rst>`\ ;
+* To know more about :githublink:`Model Compression with NNI <docs/en_US/Compression/Overview.rst>`\ ;
+* To know more about :githublink:`Hyperparameter Tuning with NNI <docs/en_US/Tuner/BuiltinTuner.rst>`\ ;
--- a/docs/en_US/NAS/Advanced.rst
+++ b/docs/en_US/NAS/Advanced.rst
+Customize a NAS Algorithm
+=========================
+Extend the Ability of One-Shot Trainers
+---------------------------------------
+Users might want to do multiple things if they are using the trainers on real tasks, for example, distributed training, half-precision training, logging periodically, writing tensorboard, dumping checkpoints and so on. As mentioned previously, some trainers do have support for some of the items listed above; others might not. Generally, there are two recommended ways to add anything you want to an existing trainer: inherit an existing trainer and override, or copy an existing trainer and modify.
+Either way, you are walking into the scope of implementing a new trainer. Basically, implementing a one-shot trainer is no different from any traditional deep learning trainer, except that a new concept called mutator will reveal itself. So that the implementation will be different in at least two places:
+* Initialization
+.. code-block:: python
+   model = Model()
+   mutator = MyMutator(model)
+* Training
+.. code-block:: python
+   for _ in range(epochs):
+       for x, y in data_loader:
+           mutator.reset()  # reset all the choices in model
+           out = model(x)  # like traditional model
+           loss = criterion(out, y)
+           loss.backward()
+           # no difference below
+To demonstrate what mutators are for, we need to know how one-shot NAS normally works. Usually, one-shot NAS "co-optimize model weights and architecture weights". It repeatedly: sample an architecture or combination of several architectures from the supernet, train the chosen architectures like traditional deep learning model, update the trained parameters to the supernet, and use the metrics or loss as some signal to guide the architecture sampler. The mutator, is the architecture sampler here, often defined to be another deep-learning model. Therefore, you can treat it as any model, by defining parameters in it and optimizing it with optimizers. One mutator is initialized with exactly one model. Once a mutator is binded to a model, it cannot be rebinded to another model.
+``mutator.reset()`` is the core step. That's where all the choices in the model are finalized. The reset result will be always effective, until the next reset flushes the data. After the reset, the model can be seen as a traditional model to do forward-pass and backward-pass.
+Finally, mutators provide a method called ``mutator.export()`` that export a dict with architectures to the model. Note that currently this dict this a mapping from keys of mutables to tensors of selection. So in order to dump to json, users need to convert the tensors explicitly into python list.
+Meanwhile, NNI provides some useful tools so that users can implement trainers more easily. See `Trainers <./NasReference.rst>`__ for details.
+Implement New Mutators
+----------------------
+To start with, here is the pseudo-code that demonstrates what happens on ``mutator.reset()`` and ``mutator.export()``.
+.. code-block:: python
+   def reset(self):
+       self.apply_on_model(self.sample_search())
+.. code-block:: python
+   def export(self):
+       return self.sample_final()
+On reset, a new architecture is sampled with ``sample_search()`` and applied on the model. Then the model is trained for one or more steps in search phase. On export, a new architecture is sampled with ``sample_final()`` and **do nothing to the model**. This is either for checkpoint or exporting the final architecture.
+The requirements of return values of ``sample_search()`` and ``sample_final()`` are the same: a mapping from mutable keys to tensors. The tensor can be either a BoolTensor (true for selected, false for negative), or a FloatTensor which applies weight on each candidate. The selected branches will then be computed (in ``LayerChoice``\ , modules will be called; in ``InputChoice``\ , it's just tensors themselves), and reduce with the reduction operation specified in the choices. For most algorithms only worrying about the former part, here is an example of your mutator implementation.
+.. code-block:: python
+   class RandomMutator(Mutator):
+       def __init__(self, model):
+           super().__init__(model)  # don't forget to call super
+           # do something else
+       def sample_search(self):
+           result = dict()
+           for mutable in self.mutables:  # this is all the mutable modules in user model
+               # mutables share the same key will be de-duplicated
+               if isinstance(mutable, LayerChoice):
+                   # decided that this mutable should choose `gen_index`
+                   gen_index = np.random.randint(mutable.length)
+                   result[mutable.key] = torch.tensor([i == gen_index for i in range(mutable.length)], 
+                                                      dtype=torch.bool)
+               elif isinstance(mutable, InputChoice):
+                   if mutable.n_chosen is None:  # n_chosen is None, then choose any number
+                       result[mutable.key] = torch.randint(high=2, size=(mutable.n_candidates,)).view(-1).bool()
+                   # else do something else
+           return result
+       def sample_final(self):
+           return self.sample_search()  # use the same logic here. you can do something different
+The complete example of random mutator can be found :githublink:`here <src/sdk/pynni/nni/nas/pytorch/random/mutator.py>`.
+For advanced usages, e.g., users want to manipulate the way modules in ``LayerChoice`` are executed, they can inherit ``BaseMutator``\ , and overwrite ``on_forward_layer_choice`` and ``on_forward_input_choice``\ , which are the callback implementation of ``LayerChoice`` and ``InputChoice`` respectively. Users can still use property ``mutables`` to get all ``LayerChoice`` and ``InputChoice`` in the model code. For details, please refer to :githublink:`reference <src/sdk/pynni/nni/nas/pytorch>` here to learn more.
+.. tip::
+    A useful application of random mutator is for debugging. Use
+    .. code-block:: python
+        mutator = RandomMutator(model)
+        mutator.reset()
+    will immediately set one possible candidate in the search space as the active one.
+Implemented a Distributed NAS Tuner
+-----------------------------------
+Before learning how to write a distributed NAS tuner, users should first learn how to write a general tuner. read `Customize Tuner <../Tuner/CustomizeTuner.rst>`__ for tutorials.
+When users call "\ `nnictl ss_gen <../Tutorial/Nnictl.rst>`__\ " to generate search space file, a search space file like this will be generated:
+.. code-block:: json
+   {
+       "key_name": {
+           "_type": "layer_choice",
+           "_value": ["op1_repr", "op2_repr", "op3_repr"]
+       },
+       "key_name": {
+           "_type": "input_choice",
+           "_value": {
+               "candidates": ["in1_key", "in2_key", "in3_key"],
+               "n_chosen": 1
+           }
+       }
+   }
+This is the exact search space tuners will receive in ``update_search_space``. It's then tuners' responsibility to interpret the search space and generate new candidates in ``generate_parameters``. A valid "parameters" will be in the following format:
+.. code-block:: json
+   {
+       "key_name": {
+           "_value": "op1_repr",
+           "_idx": 0
+       },
+       "key_name": {
+           "_value": ["in2_key"],
+           "_idex": [1]
+       }
+   }
+Send it through ``generate_parameters``\ , and the tuner would look like any HPO tuner. Refer to `SPOS <./SPOS.rst>`__ example code for an example.
--- a/docs/en_US/NAS/Benchmarks.rst
+++ b/docs/en_US/NAS/Benchmarks.rst
+NAS Benchmarks
+==============
+..  toctree::
+    :hidden:
+    Example Usages <BenchmarksExample>
+Introduction
+------------
+To imporve the reproducibility of NAS algorithms as well as reducing computing resource requirements, researchers proposed a series of NAS benchmarks such as `NAS-Bench-101 <https://arxiv.org/abs/1902.09635>`__\ , `NAS-Bench-201 <https://arxiv.org/abs/2001.00326>`__\ , `NDS <https://arxiv.org/abs/1905.13214>`__\ , etc. NNI provides a query interface for users to acquire these benchmarks. Within just a few lines of code, researcher are able to evaluate their NAS algorithms easily and fairly by utilizing these benchmarks.
+Prerequisites
+-------------
+* Please prepare a folder to household all the benchmark databases. By default, it can be found at ``${HOME}/.nni/nasbenchmark``. You can place it anywhere you like, and specify it in ``NASBENCHMARK_DIR`` via ``export NASBENCHMARK_DIR=/path/to/your/nasbenchmark`` before importing NNI.
+* Please install ``peewee`` via ``pip3 install peewee``\ , which NNI uses to connect to database.
+Data Preparation
+----------------
+To avoid storage and legality issues, we do not provide any prepared databases. Please follow the following steps.
+#. 
+   Clone NNI to your machine and enter ``examples/nas/benchmarks`` directory.
+   .. code-block:: bash
+      git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
+      cd nni/examples/nas/benchmarks
+   Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v1.9``.
+#. 
+   Install dependencies via ``pip3 install -r xxx.requirements.txt``. ``xxx`` can be ``nasbench101``\ , ``nasbench201`` or ``nds``.
+#. Generate the database via ``./xxx.sh``. The directory that stores the benchmark file can be configured with ``NASBENCHMARK_DIR`` environment variable, which defaults to ``~/.nni/nasbenchmark``. Note that the NAS-Bench-201 dataset will be downloaded from a google drive.
+Please make sure there is at least 10GB free disk space and note that the conversion process can take up to hours to complete.
+Example Usages
+--------------
+Please refer to `examples usages of Benchmarks API <./BenchmarksExample>`__.
+NAS-Bench-101
+-------------
+`Paper link <https://arxiv.org/abs/1902.09635>`__ &nbsp; &nbsp; `Open-source <https://github.com/google-research/nasbench>`__
+NAS-Bench-101 contains 423,624 unique neural networks, combined with 4 variations in number of epochs (4, 12, 36, 108), each of which is trained 3 times. It is a cell-wise search space, which constructs and stacks a cell by enumerating DAGs with at most 7 operators, and no more than 9 connections. All operators can be chosen from ``CONV3X3_BN_RELU``\ , ``CONV1X1_BN_RELU`` and ``MAXPOOL3X3``\ , except the first operator (always ``INPUT``\ ) and last operator (always ``OUTPUT``\ ).
+Notably, NAS-Bench-101 eliminates invalid cells (e.g., there is no path from input to output, or there is redundant computation). Furthermore, isomorphic cells are de-duplicated, i.e., all the remaining cells are computationally unique.
+API Documentation
+^^^^^^^^^^^^^^^^^
+.. autofunction:: nni.nas.benchmarks.nasbench101.query_nb101_trial_stats
+.. autoattribute:: nni.nas.benchmarks.nasbench101.INPUT
+.. autoattribute:: nni.nas.benchmarks.nasbench101.OUTPUT
+.. autoattribute:: nni.nas.benchmarks.nasbench101.CONV3X3_BN_RELU
+.. autoattribute:: nni.nas.benchmarks.nasbench101.CONV1X1_BN_RELU
+.. autoattribute:: nni.nas.benchmarks.nasbench101.MAXPOOL3X3
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101TrialConfig
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101TrialStats
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101IntermediateStats
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.nasbench_format_to_architecture_repr
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.infer_num_vertices
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.hash_module
+NAS-Bench-201
+-------------
+`Paper link <https://arxiv.org/abs/2001.00326>`__ &nbsp; &nbsp; `Open-source API <https://github.com/D-X-Y/NAS-Bench-201>`__ &nbsp; &nbsp;\ `Implementations <https://github.com/D-X-Y/AutoDL-Projects>`__
+NAS-Bench-201 is a cell-wise search space that views nodes as tensors and edges as operators. The search space contains all possible densely-connected DAGs with 4 nodes, resulting in 15,625 candidates in total. Each operator (i.e., edge) is selected from a pre-defined operator set (\ ``NONE``\ , ``SKIP_CONNECT``\ , ``CONV_1X1``\ , ``CONV_3X3`` and ``AVG_POOL_3X3``\ ). Training appraoches vary in the dataset used (CIFAR-10, CIFAR-100, ImageNet) and number of epochs scheduled (12 and 200). Each combination of architecture and training approach is repeated 1 - 3 times with different random seeds.
+API Documentation
+^^^^^^^^^^^^^^^^^
+.. autofunction:: nni.nas.benchmarks.nasbench201.query_nb201_trial_stats
+.. autoattribute:: nni.nas.benchmarks.nasbench201.NONE
+.. autoattribute:: nni.nas.benchmarks.nasbench201.SKIP_CONNECT
+.. autoattribute:: nni.nas.benchmarks.nasbench201.CONV_1X1
+.. autoattribute:: nni.nas.benchmarks.nasbench201.CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nasbench201.AVG_POOL_3X3
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201TrialConfig
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201TrialStats
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201IntermediateStats
+NDS
+---
+`Paper link <https://arxiv.org/abs/1905.13214>`__ &nbsp; &nbsp; `Open-source <https://github.com/facebookresearch/nds>`__
+*On Network Design Spaces for Visual Recognition* released trial statistics of over 100,000 configurations (models + hyper-parameters) sampled from multiple model families, including vanilla (feedforward network loosely inspired by VGG), ResNet and ResNeXt (residual basic block and residual bottleneck block) and NAS cells (following popular design from NASNet, Ameoba, PNAS, ENAS and DARTS). Most configurations are trained only once with a fixed seed, except a few that are trained twice or three times.
+Instead of storing results obtained with different configurations in separate files, we dump them into one single database to enable comparison in multiple dimensions. Specifically, we use ``model_family`` to distinguish model types, ``model_spec`` for all hyper-parameters needed to build this model, ``cell_spec`` for detailed information on operators and connections if it is a NAS cell, ``generator`` to denote the sampling policy through which this configuration is generated. Refer to API documentation for details.
+Available Operators
+-------------------
+Here is a list of available operators used in NDS.
+.. autoattribute:: nni.nas.benchmarks.nds.constants.NONE
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SKIP_CONNECT
+.. autoattribute:: nni.nas.benchmarks.nds.constants.AVG_POOL_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_5X5
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_7X7
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_1X1
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_3X1_1X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_7X1_1X7
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_CONV_5X5
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_5X5
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_7X7
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_SEP_CONV_3X3
+API Documentation
+^^^^^^^^^^^^^^^^^
+.. autofunction:: nni.nas.benchmarks.nds.query_nds_trial_stats
+.. autoclass:: nni.nas.benchmarks.nds.NdsTrialConfig
+.. autoclass:: nni.nas.benchmarks.nds.NdsTrialStats
+.. autoclass:: nni.nas.benchmarks.nds.NdsIntermediateStats
--- a/docs/en_US/NAS/CDARTS.rst
+++ b/docs/en_US/NAS/CDARTS.rst
+CDARTS
+======
+Introduction
+------------
+`CDARTS <https://arxiv.org/pdf/2006.10724.pdf>`__ builds a cyclic feedback mechanism between the search and evaluation networks. First, the search network generates an initial topology for evaluation, so that the weights of the evaluation network can be optimized. Second, the architecture topology in the search network is further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks, and thus enables the evolution of the topology to fit the final evaluation network.
+In implementation of ``CdartsTrainer``\ , it first instantiates two models and two mutators (one for each). The first model is the so-called "search network", which is mutated with a ``RegularizedDartsMutator`` -- a mutator with subtle differences with ``DartsMutator``. The second model is the "evaluation network", which is mutated with a discrete mutator that leverages the previous search network mutator, to sample a single path each time. Trainers train models and mutators alternatively. Users can refer to `paper <https://arxiv.org/pdf/2006.10724.pdf>`__ if they are interested in more details on these trainers and mutators.
+Reproduction Results
+--------------------
+This is CDARTS based on the NNI platform, which currently supports CIFAR10 search and retrain. ImageNet search and retrain should also be supported, and we provide corresponding interfaces. Our reproduced results on NNI are slightly lower than the paper, but much higher than the original DARTS. Here we show the results of three independent experiments on CIFAR10.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Runs
+     - Paper
+     - NNI
+   * - 1
+     - 97.52
+     - 97.44
+   * - 2
+     - 97.53
+     - 97.48
+   * - 3
+     - 97.58
+     - 97.56
+Examples
+--------
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/cdarts>`__
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # install apex for distributed training.
+   git clone https://github.com/NVIDIA/apex
+   cd apex
+   python setup.py install --cpp_ext --cuda_ext
+   # search the best architecture
+   cd examples/nas/cdarts
+   bash run_search_cifar.sh
+   # train the best architecture.
+   bash run_retrain_cifar.sh
+Reference
+---------
+PyTorch
+^^^^^^^
+..  autoclass:: nni.algorithms.nas.pytorch.cdarts.CdartsTrainer
+    :members:
+..  autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedDartsMutator
+    :members:
+..  autoclass:: nni.algorithms.nas.pytorch.cdarts.DartsDiscreteMutator
+    :members:
+..  autoclass:: nni.algorithms.nas.pytorch.cdarts.RegularizedMutatorParallel
+    :members:
--- a/docs/en_US/NAS/ClassicNas.rst
+++ b/docs/en_US/NAS/ClassicNas.rst
+.. role:: raw-html(raw)
+   :format: html
+Classic NAS Algorithms
+======================
+In classic NAS algorithms, each architecture is trained as a trial and the NAS algorithm acts as a tuner. Thus, this training mode naturally fits within the NNI hyper-parameter tuning framework, where Tuner generates new architecture for the next trial and trials run in the training service.
+Quick Start
+-----------
+The following example shows how to use classic NAS algorithms. You can see it is quite similar to NNI hyper-parameter tuning.
+.. code-block:: python
+   model = Net()
+   # get the chosen architecture from tuner and apply it on model
+   get_and_apply_next_architecture(model)
+   train(model)  # your code for training the model
+   acc = test(model)  # test the trained model
+   nni.report_final_result(acc)  # report the performance of the chosen architecture
+First, instantiate the model. Search space has been defined in this model through ``LayerChoice`` and ``InputChoice``. After that, user should invoke ``get_and_apply_next_architecture(model)`` to settle down to a specific architecture. This function receives the architecture from tuner (i.e., the classic NAS algorithm) and applies the architecture to ``model``. At this point, ``model`` becomes a specific architecture rather than a search space. Then users are free to train this model just like training a normal PyTorch model. After get the accuracy of this model, users should invoke ``nni.report_final_result(acc)`` to report the result to the tuner.
+At this point, trial code is ready. Then, we can prepare an NNI experiment, i.e., search space file and experiment config file. Different from NNI hyper-parameter tuning, search space file is automatically generated from the trial code by running the command (the detailed usage of this command can be found `here <../Tutorial/Nnictl.rst>`__\ ):
+``nnictl ss_gen --trial_command="the command for running your trial code"``
+A file named ``nni_auto_gen_search_space.json`` is generated by this command. Then put the path of the generated search space in the field ``searchSpacePath`` of the experiment config file. The other fields of the config file can be filled by referring `this tutorial <../Tutorial/QuickStart.rst>`__.
+Currently, we only support :githublink:`PPO Tuner <examples/tuners/random_nas_tuner>` for classic NAS. More classic NAS algorithms will be supported soon.
+The complete examples can be found :githublink:`here <examples/nas/classic_nas>` for PyTorch and :githublink:`here <examples/nas/classic_nas-tf>` for TensorFlow.
+Standalone mode for easy debugging
+----------------------------------
+We support a standalone mode for easy debugging, where you can directly run the trial command without launching an NNI experiment. This is for checking whether your trial code can correctly run. The first candidate(s) are chosen for ``LayerChoice`` and ``InputChoice`` in this standalone mode.
+:raw-html:`<a name="regulaized-evolution-tuner"></a>`
+Regularized Evolution Tuner
+---------------------------
+This is a tuner geared for NNI’s Neural Architecture Search (NAS) interface. It uses the `evolution algorithm <https://arxiv.org/pdf/1802.01548.pdf>`__.
+The tuner first randomly initializes the number of ``population`` models and evaluates them. After that, every time to produce a new architecture, the tuner randomly chooses the number of ``sample`` architectures from ``population``\ , then mutates the best model in ``sample``\ , the parent model, to produce the child model. The mutation includes the hidden mutation and the op mutation. The hidden state mutation consists of replacing a hidden state with another hidden state from within the cell, subject to the constraint that no loops are formed. The op mutation behaves like the hidden state mutation as far as replacing one op with another op from the op set. Note that keeping the child model the same as its parent is not allowed. After evaluating the child model, it is added to the tail of the ``population``\ , then pops the front one.
+Note that **trial concurrency should be less than the population of the model**\ , otherwise NO_MORE_TRIAL exception will be raised.
+The whole procedure is summarized by the pseudocode below.
+.. image:: ../../img/EvoNasTuner.png
+   :target: ../../img/EvoNasTuner.png
+   :alt: 
--- a/docs/en_US/NAS/Cream.rst
+++ b/docs/en_US/NAS/Cream.rst
+.. role:: raw-html(raw)
+   :format: html
+Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search
+=======================================================================================
+ **`[Paper] <https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf>`__ `[Models-Google Drive] <https://drive.google.com/drive/folders/1NLGAbBF9bA1IUAxKlk2VjgRXhr6RHvRW?usp=sharing>`__\ `[Models-Baidu Disk (PWD: wqw6)] <https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g>`__ `[BibTex] <https://scholar.googleusercontent.com/scholar.bib?q=info:ICWVXc_SsKAJ:scholar.google.com/&output=citation&scisdr=CgUmooXfEMfTi0cV5aU:AAGBfm0AAAAAX7sQ_aXoamdKRaBI12tAVN8REq1VKNwM&scisig=AAGBfm0AAAAAX7sQ_RdYtp6BSro3zgbXVJU2MCgsG730&scisf=4&ct=citation&cd=-1&hl=ja>`__**   :raw-html:`<br/>`
+In this work, we present a simple yet effective architecture distillation method. The central idea is that subnetworks can learn collaboratively and teach each other throughout the training process, aiming to boost the convergence of individual models. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Distilling knowledge from the prioritized paths is able to boost the training of subnetworks. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop. The discovered architectures achieve superior performance compared to the recent `MobileNetV3 <https://arxiv.org/abs/1905.02244>`__ and `EfficientNet <https://arxiv.org/abs/1905.11946>`__ families under aligned settings.
+:raw-html:`<div ><img src="https://github.com/microsoft/Cream/blob/main/demo/intro.jpg" width="800"/></div>`
+Reproduced Results
+------------------
+Top-1 Accuracy on ImageNet. The top-1 accuracy of Cream search algorithm surpasses MobileNetV3 and EfficientNet-B0/B1 on ImageNet.
+The training with 16 Gpus is a little bit superior than 8 Gpus, as below.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model (M Flops)
+     - 8Gpus
+     - 16Gpus
+   * - 14M
+     - 53.7
+     - 53.8
+   * - 43M
+     - 65.8
+     - 66.5
+   * - 114M
+     - 72.1
+     - 72.8
+   * - 287M
+     - 76.7
+     - 77.6
+   * - 481M
+     - 78.9
+     - 79.2
+   * - 604M
+     - 79.4
+     - 80.0
+.. raw:: html
+   <table style="border: none">
+       <th><img src="./../../img/cream_flops100.jpg" alt="drawing" width="400"/></th>
+       <th><img src="./../../img/cream_flops600.jpg" alt="drawing" width="400"/></th>
+   </table>
+Examples
+--------
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/cream>`__
+Please run the following scripts in the example folder.
+Data Preparation
+----------------
+You need to first download the `ImageNet-2012 <http://www.image-net.org/>`__ to the folder ``./data/imagenet`` and move the validation set to the subfolder ``./data/imagenet/val``. To move the validation set, you cloud use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh 
+Put the imagenet data in ``./data``. It should be like following:
+.. code-block:: bash
+   ./data/imagenet/train
+   ./data/imagenet/val
+   ...
+Quick Start
+-----------
+I. Search
+^^^^^^^^^
+First, build environments for searching.
+.. code-block:: bash
+   pip install -r ./requirements
+   git clone https://github.com/NVIDIA/apex.git
+   cd apex
+   python setup.py install --cpp_ext --cuda_ext
+To search for an architecture, you need to configure the parameters ``FLOPS_MINIMUM`` and ``FLOPS_MAXIMUM`` to specify the desired model flops, such as [0,600]MB flops. You can specify the flops interval by changing these two parameters in ``./configs/train.yaml``
+.. code-block:: bash
+   FLOPS_MINIMUM: 0 # Minimum Flops of Architecture
+   FLOPS_MAXIMUM: 600 # Maximum Flops of Architecture
+For example, if you expect to search an architecture with model flops <= 200M, please set the ``FLOPS_MINIMUM`` and ``FLOPS_MAXIMUM`` to be ``0`` and ``200``.
+After you specify the flops of the architectures you would like to search, you can search an architecture now by running:
+.. code-block:: bash
+   python -m torch.distributed.launch --nproc_per_node=8 ./train.py --cfg ./configs/train.yaml
+The searched architectures need to be retrained and obtain the final model. The final model is saved in ``.pth.tar`` format. Retraining code will be released soon.
+II. Retrain
+^^^^^^^^^^^
+To train searched architectures, you need to configure the parameter ``MODEL_SELECTION`` to specify the model Flops. To specify which model to train, you should add ``MODEL_SELECTION`` in ``./configs/retrain.yaml``. You can select one from [14,43,112,287,481,604], which stands for different Flops(MB).
+.. code-block:: bash
+   MODEL_SELECTION: 43 # Retrain 43m model
+   MODEL_SELECTION: 481 # Retrain 481m model
+   ......
+To train random architectures, you need specify ``MODEL_SELECTION`` to ``-1`` and configure the parameter ``INPUT_ARCH``\ :
+.. code-block:: bash
+   MODEL_SELECTION: -1 # Train random architectures
+   INPUT_ARCH: [[0], [3], [3, 3], [3, 1, 3], [3, 3, 3, 3], [3, 3, 3], [0]] # Random Architectures
+   ......
+After adding ``MODEL_SELECTION`` in ``./configs/retrain.yaml``\ , you need to use the following command to train the model.
+.. code-block:: bash
+   python -m torch.distributed.launch --nproc_per_node=8 ./retrain.py --cfg ./configs/retrain.yaml
+III. Test
+^^^^^^^^^
+To test our trained of models, you need to use ``MODEL_SELECTION`` in ``./configs/test.yaml`` to specify which model to test.
+.. code-block:: bash
+   MODEL_SELECTION: 43 # test 43m model
+   MODEL_SELECTION: 481 # test 470m model
+   ......
+After specifying the flops of the model, you need to write the path to the resume model in ``./test.sh``.
+.. code-block:: bash
+   RESUME_PATH: './43.pth.tar'
+   RESUME_PATH: './481.pth.tar'
+   ......
+We provide 14M/43M/114M/287M/481M/604M pretrained models in `google drive <https://drive.google.com/drive/folders/1CQjyBryZ4F20Rutj7coF8HWFcedApUn2>`__ or `[Models-Baidu Disk (password: wqw6)] <https://pan.baidu.com/s/1TqQNm2s14oEdyNPimw3T9g>`__ .
+After downloading the pretrained models and adding ``MODEL_SELECTION`` and ``RESUME_PATH`` in './configs/test.yaml', you need to use the following command to test the model.
+.. code-block:: bash
+   python -m torch.distributed.launch --nproc_per_node=8 ./test.py --cfg ./configs/test.yaml
--- a/docs/en_US/NAS/DARTS.rst
+++ b/docs/en_US/NAS/DARTS.rst
+DARTS
+=====
+Introduction
+------------
+The paper `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Their method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
+Authors' code optimizes the network weights and architecture weights alternatively in mini-batches. They further explore the possibility that uses second order optimization (unroll) instead of first order, to improve the performance.
+Implementation on NNI is based on the `official implementation <https://github.com/quark0/darts>`__ and a `popular 3rd-party repo <https://github.com/khanrc/pt.darts>`__. DARTS on NNI is designed to be general for arbitrary search space. A CNN search space tailored for CIFAR10, same as the original paper, is implemented as a use case of DARTS.
+Reproduction Results
+--------------------
+The above-mentioned example is meant to reproduce the results in the paper, we do experiments with first and second order optimization. Due to the time limit, we retrain *only the best architecture* derived from the search phase and we repeat the experiment *only once*. Our results is currently on par with the results reported in paper. We will add more results later when ready.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - 
+     - In paper
+     - Reproduction
+   * - First order (CIFAR10)
+     - 3.00 +/- 0.14
+     - 2.78
+   * - Second order (CIFAR10)
+     - 2.76 +/- 0.09
+     - 2.80
+Examples
+--------
+CNN Search Space
+^^^^^^^^^^^^^^^^
+:githublink:`Example code <examples/nas/darts>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/darts
+   python3 search.py
+   # train the best architecture
+   python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
+Reference
+---------
+PyTorch
+^^^^^^^
+..  autoclass:: nni.algorithms.nas.pytorch.darts.DartsTrainer
+    :members:
+..  autoclass:: nni.algorithms.nas.pytorch.darts.DartsMutator
+    :members:
+Limitations
+-----------
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
--- a/docs/en_US/NAS/ENAS.rst
+++ b/docs/en_US/NAS/ENAS.rst
+ENAS
+====
+Introduction
+------------
+The paper `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__ uses parameter sharing between child models to accelerate the NAS process. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss.
+Implementation on NNI is based on the `official implementation in Tensorflow <https://github.com/melodyguan/enas>`__\ , including a general-purpose Reinforcement-learning controller and a trainer that trains target network and this controller alternatively. Following paper, we have also implemented macro and micro search space on CIFAR10 to demonstrate how to use these trainers. Since code to train from scratch on NNI is not ready yet, reproduction results are currently unavailable.
+Examples
+--------
+CIFAR10 Macro/Micro Search Space
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+:githublink:`Example code <examples/nas/enas>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/enas
+   # search in macro search space
+   python3 search.py --search-for macro
+   # search in micro search space
+   python3 search.py --search-for micro
+   # view more options for search
+   python3 search.py -h
+Reference
+---------
+PyTorch
+^^^^^^^
+..  autoclass:: nni.algorithms.nas.pytorch.enas.EnasTrainer
+    :members:
+..  autoclass:: nni.algorithms.nas.pytorch.enas.EnasMutator
+    :members:
--- a/docs/en_US/NAS/NasGuide.rst
+++ b/docs/en_US/NAS/NasGuide.rst
+One-shot NAS algorithms
+=======================
+Besides `classic NAS algorithms <./ClassicNas.rst>`__\ , users also apply more advanced one-shot NAS algorithms to find better models from a search space. There are lots of related works about one-shot NAS algorithms, such as `SMASH <https://arxiv.org/abs/1708.05344>`__\ , `ENAS <https://arxiv.org/abs/1802.03268>`__\ , `DARTS <https://arxiv.org/abs/1808.05377>`__\ , `FBNet <https://arxiv.org/abs/1812.03443>`__\ , `ProxylessNAS <https://arxiv.org/abs/1812.00332>`__\ , `SPOS <https://arxiv.org/abs/1904.00420>`__\ , `Single-Path NAS <https://arxiv.org/abs/1904.02877>`__\ ,  `Understanding One-shot <http://proceedings.mlr.press/v80/bender18a>`__ and `GDAS <https://arxiv.org/abs/1910.04465>`__. One-shot NAS algorithms usually build a supernet containing every candidate in the search space as its subnetwork, and in each step, a subnetwork or combination of several subnetworks is trained.
+Currently, several one-shot NAS methods are supported on NNI. For example, ``DartsTrainer``\ , which uses SGD to train architecture weights and model weights iteratively, and ``ENASTrainer``\ , which `uses a controller to train the model <https://arxiv.org/abs/1802.03268>`__. New and more efficient NAS trainers keep emerging in research community and some will be implemented in future releases of NNI.
+Search with One-shot NAS Algorithms
+-----------------------------------
+Each one-shot NAS algorithm implements a trainer, for which users can find usage details in the description of each algorithm. Here is a simple example, demonstrating how users can use ``EnasTrainer``.
+.. code-block:: python
+   # this is exactly same as traditional model training
+   model = Net()
+   dataset_train = CIFAR10(root="./data", train=True, download=True, transform=train_transform)
+   dataset_valid = CIFAR10(root="./data", train=False, download=True, transform=valid_transform)
+   criterion = nn.CrossEntropyLoss()
+   optimizer = torch.optim.SGD(model.parameters(), 0.05, momentum=0.9, weight_decay=1.0E-4)
+   # use NAS here
+   def top1_accuracy(output, target):
+       # this is the function that computes the reward, as required by ENAS algorithm
+       batch_size = target.size(0)
+       _, predicted = torch.max(output.data, 1)
+       return (predicted == target).sum().item() / batch_size
+   def metrics_fn(output, target):
+       # metrics function receives output and target and computes a dict of metrics
+       return {"acc1": top1_accuracy(output, target)}
+   from nni.algorithms.nas.pytorch import enas
+   trainer = enas.EnasTrainer(model,
+                              loss=criterion,
+                              metrics=metrics_fn,
+                              reward_function=top1_accuracy,
+                              optimizer=optimizer,
+                              batch_size=128
+                              num_epochs=10,  # 10 epochs
+                              dataset_train=dataset_train,
+                              dataset_valid=dataset_valid,
+                              log_frequency=10)  # print log every 10 steps
+   trainer.train()  # training
+   trainer.export(file="model_dir/final_architecture.json")  # export the final architecture to file
+``model`` is the one with `user defined search space <./WriteSearchSpace.rst>`__. Then users should prepare training data and model evaluation metrics. To search from the defined search space, a one-shot algorithm is instantiated, called trainer (e.g., EnasTrainer). The trainer exposes a few arguments that you can customize. For example, the loss function, the metrics function, the optimizer, and the datasets. These should satisfy most usage requirements and we do our best to make sure our built-in trainers work on as many models, tasks, and datasets as possible.
+**Note that** when using one-shot NAS algorithms, there is no need to start an NNI experiment. Users can directly run this Python script (i.e., ``train.py``\ ) through ``python3 train.py`` without ``nnictl``. After training, users can export the best one of the found models through ``trainer.export()``.
+Each trainer in NNI has its targeted scenario and usage. Some trainers have the assumption that the task is a classification task; some trainers might have a different definition of "epoch" (e.g., an ENAS epoch = some child steps + some controller steps). Most trainers do not have support for distributed training: they won't wrap your model with ``DataParallel`` or ``DistributedDataParallel`` to do that. So after a few tryouts, if you want to actually use the trainers on your very customized applications, you might need to `customize your trainer <./Advanced.rst#extend-the-ability-of-one-shot-trainers>`__.
+Furthermore, one-shot NAS can be visualized with our NAS UI. `See more details. <./Visualization.rst>`__
+Retrain with Exported Architecture
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+After the search phase, it's time to train the found architecture. Unlike many open-source NAS algorithms who write a whole new model specifically for retraining. We found that the search model and retraining model are usually very similar, and therefore you can construct your final model with the exact same model code. For example
+.. code-block:: python
+   model = Net()
+   apply_fixed_architecture(model, "model_dir/final_architecture.json")
+The JSON is simply a mapping from mutable keys to choices. Choices can be expressed in:
+* A string: select the candidate with corresponding name.
+* A number: select the candidate with corresponding index.
+* A list of string: select the candidates with corresponding names.
+* A list of number: select the candidates with corresponding indices.
+* A list of boolean values: a multi-hot array.
+For example,
+.. code-block:: json
+   {
+       "LayerChoice1": "conv5x5",
+       "LayerChoice2": 6,
+       "InputChoice3": ["layer1", "layer3"],
+       "InputChoice4": [1, 2],
+       "InputChoice5": [false, true, false, false, true]
+   }
+After applying, the model is then fixed and ready for final training. The model works as a single model, and unused parameters and modules are pruned.
+Also, refer to `DARTS <./DARTS.rst>`__ for code exemplifying retraining.
--- a/docs/en_US/NAS/NasReference.rst
+++ b/docs/en_US/NAS/NasReference.rst
+NAS Reference
+=============
+.. contents::
+Mutables
+--------
+..  autoclass:: nni.nas.pytorch.mutables.Mutable
+    :members:
+..  autoclass:: nni.nas.pytorch.mutables.LayerChoice
+    :members:
+..  autoclass:: nni.nas.pytorch.mutables.InputChoice
+    :members:
+..  autoclass:: nni.nas.pytorch.mutables.MutableScope
+    :members:
+Utilities
+^^^^^^^^^
+..  autofunction:: nni.nas.pytorch.utils.global_mutable_counting
+Mutators
+--------
+..  autoclass:: nni.nas.pytorch.base_mutator.BaseMutator
+    :members:
+..  autoclass:: nni.nas.pytorch.mutator.Mutator
+    :members:
+Random Mutator
+^^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.nas.pytorch.random.RandomMutator
+    :members:
+Utilities
+^^^^^^^^^
+..  autoclass:: nni.nas.pytorch.utils.StructuredMutableTreeNode
+    :members:
+Trainers
+--------
+Trainer
+^^^^^^^
+..  autoclass:: nni.nas.pytorch.base_trainer.BaseTrainer
+    :members:
+..  autoclass:: nni.nas.pytorch.trainer.Trainer
+    :members:
+Retrain
+^^^^^^^
+..  autofunction:: nni.nas.pytorch.fixed.apply_fixed_architecture
+..  autoclass:: nni.nas.pytorch.fixed.FixedArchitecture
+    :members:
+Distributed NAS
+^^^^^^^^^^^^^^^
+..  autofunction:: nni.algorithms.nas.pytorch.classic_nas.get_and_apply_next_architecture
+..  autoclass:: nni.algorithms.nas.pytorch.classic_nas.mutator.ClassicMutator
+    :members:
+Callbacks
+^^^^^^^^^
+..  autoclass:: nni.nas.pytorch.callbacks.Callback
+    :members:
+..  autoclass:: nni.nas.pytorch.callbacks.LRSchedulerCallback
+    :members:
+..  autoclass:: nni.nas.pytorch.callbacks.ArchitectureCheckpoint
+    :members:
+..  autoclass:: nni.nas.pytorch.callbacks.ModelCheckpoint
+    :members:
+Utilities
+^^^^^^^^^
+..  autoclass:: nni.nas.pytorch.utils.AverageMeterGroup
+    :members:
+..  autoclass:: nni.nas.pytorch.utils.AverageMeter
+    :members:
+..  autofunction:: nni.nas.pytorch.utils.to_device
--- a/docs/en_US/NAS/Overview.rst
+++ b/docs/en_US/NAS/Overview.rst
+Neural Architecture Search (NAS) on NNI
+=======================================
+.. contents::
+Overview
+--------
+Automatic neural architecture search is taking an increasingly important role in finding better models. Recent research has proved the feasibility of automatic NAS and has lead to models that beat many manually designed and tuned models. Some representative works are `NASNet <https://arxiv.org/abs/1707.07012>`__\ , `ENAS <https://arxiv.org/abs/1802.03268>`__\ , `DARTS <https://arxiv.org/abs/1806.09055>`__\ , `Network Morphism <https://arxiv.org/abs/1806.10282>`__\ , and `Evolution <https://arxiv.org/abs/1703.01041>`__. Further, new innovations keep emerging.
+However, it takes a great effort to implement NAS algorithms, and it's hard to reuse the code base of existing algorithms for new ones. To facilitate NAS innovations (e.g., the design and implementation of new NAS models, the comparison of different NAS models side-by-side, etc.), an easy-to-use and flexible programming interface is crucial.
+With this motivation, our ambition is to provide a unified architecture in NNI, accelerate innovations on NAS, and apply state-of-the-art algorithms to real-world problems faster.
+With the unified interface, there are two different modes for architecture search. `One <#supported-one-shot-nas-algorithms>`__ is the so-called one-shot NAS, where a super-net is built based on a search space and one-shot training is used to generate a good-performing child model. `The other <#supported-classic-nas-algorithms>`__ is the traditional search-based approach, where each child model within the search space runs as an independent trial. We call it classic NAS.
+NNI also provides dedicated `visualization tool <#nas-visualization>`__ for users to check the status of the neural architecture search process.
+Supported Classic NAS Algorithms
+--------------------------------
+The procedure of classic NAS algorithms is similar to hyper-parameter tuning, users use ``nnictl`` to start experiments and each model runs as a trial. The difference is that search space file is automatically generated from user model (with search space in it) by running ``nnictl ss_gen``. The following table listed supported tuning algorihtms for classic NAS mode. More algorihtms will be supported in future release.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - :githublink:`Random Search <examples/tuners/random_nas_tuner>`
+     - Randomly pick a model from search space
+   * - `PPO Tuner </Tuner/BuiltinTuner.html#PPOTuner>`__
+     - PPO Tuner is a Reinforcement Learning tuner based on PPO algorithm. `Reference Paper <https://arxiv.org/abs/1707.06347>`__
+Please refer to `here <ClassicNas.rst>`__ for the usage of classic NAS algorithms.
+Supported One-shot NAS Algorithms
+---------------------------------
+NNI currently supports the one-shot NAS algorithms listed below and is adding more. Users can reproduce an algorithm or use it on their own dataset. We also encourage users to implement other algorithms with `NNI API <#use-nni-api>`__\ , to benefit more people.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `ENAS </NAS/ENAS.html>`__
+     - `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. It uses parameter sharing between child models to achieve fast speed and excellent performance.
+   * - `DARTS </NAS/DARTS.html>`__
+     - `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ introduces a novel algorithm for differentiable network architecture search on bilevel optimization.
+   * - `P-DARTS </NAS/PDARTS.html>`__
+     - `Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation <https://arxiv.org/abs/1904.12760>`__ is based on DARTS. It introduces an efficient algorithm which allows the depth of searched architectures to grow gradually during the training procedure.
+   * - `SPOS </NAS/SPOS.html>`__
+     - `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ constructs a simplified supernet trained with a uniform path sampling method and applies an evolutionary algorithm to efficiently search for the best-performing architectures.
+   * - `CDARTS </NAS/CDARTS.html>`__
+     - `Cyclic Differentiable Architecture Search <https://arxiv.org/abs/****>`__ builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture.
+   * - `ProxylessNAS </NAS/Proxylessnas.html>`__
+     - `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__. It removes proxy, directly learns the architectures for large-scale target tasks and target hardware platforms.
+   * - `TextNAS </NAS/TextNAS.html>`__
+     - `TextNAS: A Neural Architecture Search Space tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__. It is a neural architecture search algorithm tailored for text representation.
+One-shot algorithms run **standalone without nnictl**. NNI supports both PyTorch and Tensorflow 2.X.
+Here are some common dependencies to run the examples. PyTorch needs to be above 1.2 to use ``BoolTensor``.
+* tensorboard
+* PyTorch 1.2+
+* git
+Please refer to `here <NasGuide.rst>`__ for the usage of one-shot NAS algorithms.
+One-shot NAS can be visualized with our visualization tool. Learn more details `here <./Visualization.rst>`__.
+Search Space Zoo
+----------------
+NNI provides some predefined search space which can be easily reused. By stacking the extracted cells, user can quickly reproduce those NAS models.
+Search Space Zoo contains the following NAS cells:
+* `DartsCell <./SearchSpaceZoo.rst#DartsCell>`__
+* `ENAS micro <./SearchSpaceZoo.rst#ENASMicroLayer>`__
+* `ENAS macro <./SearchSpaceZoo.rst#ENASMacroLayer>`__
+* `NAS Bench 201 <./SearchSpaceZoo.rst#nas-bench-201>`__
+Using NNI API to Write Your Search Space
+----------------------------------------
+The programming interface of designing and searching a model is often demanded in two scenarios.
+#. When designing a neural network, there may be multiple operation choices on a layer, sub-model, or connection, and it's undetermined which one or combination performs best. So, it needs an easy way to express the candidate layers or sub-models.
+#. When applying NAS on a neural network, it needs a unified way to express the search space of architectures, so that it doesn't need to update trial code for different search algorithms.
+For using NNI NAS, we suggest users to first go through `the tutorial of NAS API for building search space <./WriteSearchSpace.rst>`__.
+NAS Visualization
+-----------------
+To help users track the process and status of how the model is searched under specified search space, we developed a visualization tool. It visualizes search space as a super-net and shows importance of subnets and layers/operations, as well as how the importance changes along with the search process. Please refer to `the document of NAS visualization <./Visualization.rst>`__ for how to use it.
+Reference and Feedback
+----------------------
+* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
+* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub.
--- a/docs/en_US/NAS/PDARTS.rst
+++ b/docs/en_US/NAS/PDARTS.rst
+P-DARTS
+=======
+Examples
+--------
+:githublink:`Example code <examples/nas/pdarts>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/pdarts
+   python3 search.py
+   # train the best architecture, it's the same progress as darts.
+   cd ../darts
+   python3 retrain.py --arc-checkpoint ../pdarts/checkpoints/epoch_2.json
--- a/docs/en_US/NAS/Proxylessnas.rst
+++ b/docs/en_US/NAS/Proxylessnas.rst
+ProxylessNAS on NNI
+===================
+Introduction
+------------
+The paper `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/pdf/1812.00332.pdf>`__ removes proxy, it directly learns the architectures for large-scale target tasks and target hardware platforms. They address high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set. Please refer to the paper for the details.
+Usage
+-----
+To use ProxylessNAS training/searching approach, users need to specify search space in their model using `NNI NAS interface <NasGuide.rst>`__\ , e.g., ``LayerChoice``\ , ``InputChoice``. After defining and instantiating the model, the following work can be leaved to ProxylessNasTrainer by instantiating the trainer and passing the model to it.
+.. code-block:: python
+   trainer = ProxylessNasTrainer(model,
+                                 model_optim=optimizer,
+                                 train_loader=data_provider.train,
+                                 valid_loader=data_provider.valid,
+                                 device=device,
+                                 warmup=True,
+                                 ckpt_path=args.checkpoint_path,
+                                 arch_path=args.arch_path)
+   trainer.train()
+   trainer.export(args.arch_path)
+The complete example code can be found :githublink:`here <examples/nas/proxylessnas>`.
+**Input arguments of ProxylessNasTrainer**
+* **model** (*PyTorch model, required*\ ) - The model that users want to tune/search. It has mutables to specify search space.
+* **model_optim** (*PyTorch optimizer, required*\ ) - The optimizer users want to train the model.
+* **device** (*device, required*\ ) - The devices that users provide to do the train/search. The trainer applies data parallel on the model for users.
+* **train_loader** (*PyTorch data loader, required*\ ) - The data loader for training set.
+* **valid_loader** (*PyTorch data loader, required*\ ) - The data loader for validation set.
+* **label_smoothing** (*float, optional, default = 0.1*\ ) - The degree of label smoothing.
+* **n_epochs** (*int, optional, default = 120*\ ) - The number of epochs to train/search.
+* **init_lr** (*float, optional, default = 0.025*\ ) - The initial learning rate for training the model.
+* **binary_mode** (*'two', 'full', or 'full_v2', optional, default = 'full_v2'*\ ) - The forward/backward mode for the binary weights in mutator. 'full' means forward all the candidate ops, 'two' means only forward two sampled ops, 'full_v2' means recomputing the inactive ops during backward.
+* **arch_init_type** (*'normal' or 'uniform', optional, default = 'normal'*\ ) - The way to init architecture parameters.
+* **arch_init_ratio** (*float, optional, default = 1e-3*\ ) - The ratio to init architecture parameters.
+* **arch_optim_lr** (*float, optional, default = 1e-3*\ ) - The learning rate of the architecture parameters optimizer.
+* **arch_weight_decay** (*float, optional, default = 0*\ ) - Weight decay of the architecture parameters optimizer.
+* **grad_update_arch_param_every** (*int, optional, default = 5*\ ) - Update architecture weights every this number of minibatches.
+* **grad_update_steps** (*int, optional, default = 1*\ ) - During each update of architecture weights, the number of steps to train architecture weights.
+* **warmup** (*bool, optional, default = True*\ ) - Whether to do warmup.
+* **warmup_epochs** (*int, optional, default = 25*\ ) - The number of epochs to do during warmup.
+* **arch_valid_frequency** (*int, optional, default = 1*\ ) - The frequency of printing validation result.
+* **load_ckpt** (*bool, optional, default = False*\ ) - Whether to load checkpoint.
+* **ckpt_path** (*str, optional, default = None*\ ) - checkpoint path, if load_ckpt is True, ckpt_path cannot be None.
+* **arch_path** (*str, optional, default = None*\ ) - The path to store chosen architecture.
+Implementation
+--------------
+The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based, and support different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'. In our current implementation on NNI, gradient descent training approach is supported, but has not supported different hardwares. The complete support is ongoing.
+Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in :githublink:`example code <examples/nas/proxylessnas>` using :githublink:`NNI NAS interface <src/sdk/pynni/nni/nas/pytorch/proxylessnas>`.
+.. image:: ../../img/proxylessnas.png
+   :target: ../../img/proxylessnas.png
+   :alt: 
+ProxylessNAS training approach is composed of ProxylessNasMutator and ProxylessNasTrainer. ProxylessNasMutator instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**\ , architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The mutator also exposes two member functions, i.e., ``arch_requires_grad``\ , ``arch_disable_grad``\ , for the trainer to control the training of architecture weights.
+ProxylessNasMutator also implements the forward logic of the mutables (i.e., LayerChoice).
+Reproduce Results
+-----------------
+To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it. The test accuracy of the found architecture is top1: 72.31, top5: 90.26.