the source code of NNI for DCU

1011377c · qianyj · abc22158 · 1011377c · 1011377c · 1011377c
Commit 1011377c authored Mar 31, 2022 by qianyj
20 changed files
--- a/docs/en_US/Compression/v2_pruning.rst
+++ b/docs/en_US/Compression/v2_pruning.rst
+Pruning V2
+==========
+Pruning V2 is a refactoring of the old version and provides more powerful functions.
+Compared with the old version, the iterative pruning process is detached from the pruner and the pruner is only responsible for pruning and generating the masks once.
+What's more, pruning V2 unifies the pruning process and provides a more free combination of pruning components.
+Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning in the next step.
+Pruner will reset with the model and config list given by task generator then generate the masks in current step.
+For a clearer structure vision, please refer to the figure below.
+.. image:: ../../img/pruning_process.png
+   :target: ../../img/pruning_process.png
+   :alt:
+In V2, a pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
+But users can also use pruner directly like in the pruning V1.
+For details, please refer to the following tutorials:
+..  toctree::
+    :maxdepth: 2
+    Pruning Algorithms <v2_pruning_algo>
+    Pruning Scheduler <v2_scheduler>
+    Pruning Config List <v2_pruning_config_list>
--- a/docs/en_US/Compression/v2_pruning_algo.rst
+++ b/docs/en_US/Compression/v2_pruning_algo.rst
+Supported Pruning Algorithms in NNI
+===================================
+NNI provides several pruning algorithms that reproducing from the papers. In pruning v2, NNI split the pruning algorithm into more detailed components.
+This means users can freely combine components from different algorithms,
+or easily use a component of their own implementation to replace a step in the original algorithm to implement their own pruning algorithm.
+Right now, pruning algorithms with how to generate masks in one step are implemented as pruners,
+and how to schedule sparsity in each iteration are implemented as iterative pruners.
+**Pruner**
+* `Level Pruner <#level-pruner>`__
+* `L1 Norm Pruner <#l1-norm-pruner>`__
+* `L2 Norm Pruner <#l2-norm-pruner>`__
+* `FPGM Pruner <#fpgm-pruner>`__
+* `Slim Pruner <#slim-pruner>`__
+* `Activation APoZ Rank Pruner <#activation-apoz-rank-pruner>`__
+* `Activation Mean Rank Pruner <#activation-mean-rank-pruner>`__
+* `Taylor FO Weight Pruner <#taylor-fo-weight-pruner>`__
+* `ADMM Pruner <#admm-pruner>`__
+* `Movement Pruner <#movement-pruner>`__
+**Iterative Pruner**
+* `Linear Pruner <#linear-pruner>`__
+* `AGP Pruner <#agp-pruner>`__
+* `Lottery Ticket Pruner <#lottery-ticket-pruner>`__
+* `Simulated Annealing Pruner <#simulated-annealing-pruner>`__
+* `Auto Compress Pruner <#auto-compress-pruner>`__
+* `AMC Pruner <#amc-pruner>`__
+Level Pruner
+------------
+This is a basic pruner, and in some papers called it magnitude pruning or fine-grained pruning.
+It will mask the weight in each specified layer with smaller absolute value by a ratio configured in the config list.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import LevelPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   pruner = LevelPruner(model, config_list)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/level_pruning_torch.py <examples/model_compress/pruning/v2/level_pruning_torch.py>`
+User configuration for Level Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.LevelPruner
+L1 Norm Pruner
+--------------
+L1 norm pruner computes the l1 norm of the layer weight on the first dimension,
+then prune the weight blocks on this dimension with smaller l1 norm values.
+i.e., compute the l1 norm of the filters in convolution layer as metric values,
+compute the l1 norm of the weight by rows in linear layer as metric values.
+For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__\.
+In addition, L1 norm pruner also supports dependency-aware mode.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = L1NormPruner(model, config_list)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/norm_pruning_torch.py <examples/model_compress/pruning/v2/norm_pruning_torch.py>`
+User configuration for L1 Norm Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.L1NormPruner
+L2 Norm Pruner
+--------------
+L2 norm pruner is a variant of L1 norm pruner. It uses l2 norm as metric to determine which weight elements should be pruned.
+L2 norm pruner also supports dependency-aware mode.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import L2NormPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = L2NormPruner(model, config_list)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/norm_pruning_torch.py <examples/model_compress/pruning/v2/norm_pruning_torch.py>`
+User configuration for L2 Norm Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.L2NormPruner
+FPGM Pruner
+-----------
+FPGM pruner prunes the blocks of the weight on the first dimension with the smallest geometric median.
+FPGM chooses the weight blocks with the most replaceable contribution.
+For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/abs/1811.00250>`__.
+FPGM pruner also supports dependency-aware mode.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import FPGMPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = FPGMPruner(model, config_list)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/fpgm_pruning_torch.py <examples/model_compress/pruning/v2/fpgm_pruning_torch.py>`
+User configuration for FPGM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.FPGMPruner
+Slim Pruner
+-----------
+Slim pruner adds sparsity regularization on the scaling factors of batch normalization (BN) layers during training to identify unimportant channels.
+The channels with small scaling factor values will be pruned.
+For more details, please refer to `Learning Efficient Convolutional Networks through Network Slimming <https://arxiv.org/abs/1708.06519>`__\.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import SlimPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
+   pruner = SlimPruner(model, config_list, trainer, traced_optimizer, criterion, training_epochs=1)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/slim_pruning_torch.py <examples/model_compress/pruning/v2/slim_pruning_torch.py>`
+User configuration for Slim Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.SlimPruner
+Activation APoZ Rank Pruner
+---------------------------
+Activation APoZ rank pruner is a pruner which prunes on the first weight dimension,
+with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
+The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures <https://arxiv.org/abs/1607.03250>`__.
+The APoZ is defined as:
+:math:`APoZ_{c}^{(i)} = APoZ\left(O_{c}^{(i)}\right)=\frac{\sum_{k}^{N} \sum_{j}^{M} f\left(O_{c, j}^{(i)}(k)=0\right)}{N \times M}`
+Activation APoZ rank pruner also supports dependency-aware mode.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import ActivationAPoZRankPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = ActivationAPoZRankPruner(model, config_list, trainer, traced_optimizer, criterion, training_batches=20)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/activation_pruning_torch.py <examples/model_compress/pruning/v2/activation_pruning_torch.py>`
+User configuration for Activation APoZ Rank Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.ActivationAPoZRankPruner
+Activation Mean Rank Pruner
+---------------------------
+Activation mean rank pruner is a pruner which prunes on the first weight dimension,
+with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity.
+The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__.
+Activation mean rank pruner also supports dependency-aware mode.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import ActivationMeanRankPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.traces(torch.optim.Adam)(model.parameters())
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = ActivationMeanRankPruner(model, config_list, trainer, traced_optimizer, criterion, training_batches=20)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/activation_pruning_torch.py <examples/model_compress/pruning/v2/activation_pruning_torch.py>`
+User configuration for Activation Mean Rank Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.ActivationMeanRankPruner
+Taylor FO Weight Pruner
+-----------------------
+Taylor FO weight pruner is a pruner which prunes on the first weight dimension,
+based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity.
+The estimated importance is defined as the paper `Importance Estimation for Neural Network Pruning <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__.
+:math:`\widehat{\mathcal{I}}_{\mathcal{S}}^{(1)}(\mathbf{W}) \triangleq \sum_{s \in \mathcal{S}} \mathcal{I}_{s}^{(1)}(\mathbf{W})=\sum_{s \in \mathcal{S}}\left(g_{s} w_{s}\right)^{2}`
+Taylor FO weight pruner also supports dependency-aware mode.
+What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import TaylorFOWeightPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = TaylorFOWeightPruner(model, config_list, trainer, traced_optimizer, criterion, training_batches=20)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/taylorfo_pruning_torch.py <examples/model_compress/pruning/v2/taylorfo_pruning_torch.py>`
+User configuration for Activation Mean Rank Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.TaylorFOWeightPruner
+ADMM Pruner
+-----------
+Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
+by decomposing the original nonconvex problem into two subproblems that can be solved iteratively.
+In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively. 
+During the process of solving these two subproblems, the weights of the original model will be changed.
+Then a fine-grained pruning will be applied to prune the model according to the config list given.
+This solution framework applies both to non-structured and different variations of structured pruning schemes.
+For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers <https://arxiv.org/abs/1804.03294>`__.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import ADMMPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = ADMMPruner(model, config_list, trainer, traced_optimizer, criterion, iterations=10, training_epochs=1)
+   masked_model, masks = pruner.compress()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/admm_pruning_torch.py <examples/model_compress/pruning/v2/admm_pruning_torch.py>`
+User configuration for ADMM Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.ADMMPruner
+Movement Pruner
+---------------
+Movement pruner is an implementation of movement pruning.
+This is a "fine-pruning" algorithm, which means the masks may change during each fine-tuning step.
+Each weight element will be scored by the opposite of the sum of the product of weight and its gradient during each step.
+This means the weight elements moving towards zero will accumulate negative scores, the weight elements moving away from zero will accumulate positive scores.
+The weight elements with low scores will be masked during inference.
+The following figure from the paper shows the weight pruning by movement pruning.
+.. image:: ../../img/movement_pruning.png
+   :target: ../../img/movement_pruning.png
+   :alt: 
+For more details, please refer to `Movement Pruning: Adaptive Sparsity by Fine-Tuning <https://arxiv.org/abs/2005.07683>`__.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import MovementPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+   config_list = [{'op_types': ['Linear'], 'op_partial_names': ['bert.encoder'], 'sparsity': 0.9}]
+   pruner = MovementPruner(model, config_list, trainer, traced_optimizer, criterion, 10, 3000, 27000)
+   masked_model, masks = pruner.compress()
+User configuration for Movement Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.MovementPruner
+Reproduced Experiment
+^^^^^^^^^^^^^^^^^^^^^
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Dataset
+     - Remaining Weights
+     - MaP acc.(paper/ours)
+     - MvP acc.(paper/ours)
+   * - Bert base
+     - MNLI - Dev
+     - 10%
+     - 77.8% / 73.6%
+     - 79.3% / 78.8%
+Linear Pruner
+-------------
+Linear pruner is an iterative pruner, it will increase sparsity evenly from scratch during each iteration.
+For example, the final sparsity is set as 0.5, and the iteration number is 5, then the sparsity used in each iteration are ``[0, 0.1, 0.2, 0.3, 0.4, 0.5]``.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import LinearPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = LinearPruner(model, config_list, pruning_algorithm='l1', total_iteration=10, finetuner=finetuner)
+   pruner.compress()
+   _, model, masks, _, _ = pruner.get_best_result()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/iterative_pruning_torch.py <examples/model_compress/pruning/v2/iterative_pruning_torch.py>`
+User configuration for Linear Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.LinearPruner
+AGP Pruner
+----------
+This is an iterative pruner, which the sparsity is increased from an initial sparsity value :math:`s_{i}` (usually 0) to a final sparsity value :math:`s_{f}` over a span of :math:`n` pruning iterations,
+starting at training step :math:`t_{0}` and with pruning frequency :math:`\Delta t`:
+:math:`s_{t}=s_{f}+\left(s_{i}-s_{f}\right)\left(1-\frac{t-t_{0}}{n \Delta t}\right)^{3} \text { for } t \in\left\{t_{0}, t_{0}+\Delta t, \ldots, t_{0} + n \Delta t\right\}`
+For more details please refer to `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import AGPPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = AGPPruner(model, config_list, pruning_algorithm='l1', total_iteration=10, finetuner=finetuner)
+   pruner.compress()
+   _, model, masks, _, _ = pruner.get_best_result()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/iterative_pruning_torch.py <examples/model_compress/pruning/v2/iterative_pruning_torch.py>`
+User configuration for AGP Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.AGPPruner
+Lottery Ticket Pruner
+---------------------
+`The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks <https://arxiv.org/abs/1803.03635>`__\ ,
+authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis,
+and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that
+-- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.
+In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :
+..
+   #. Randomly initialize a neural network f(x;theta_0) (where theta\ *0 follows D*\ {theta}).
+   #. Train the network for j iterations, arriving at parameters theta_j.
+   #. Prune p% of the parameters in theta_j, creating a mask m.
+   #. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
+   #. Repeat step 2, 3, and 4.
+If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning,
+each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import LotteryTicketPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = LotteryTicketPruner(model, config_list, pruning_algorithm='l1', total_iteration=10, finetuner=finetuner, reset_weight=True)
+   pruner.compress()
+   _, model, masks, _, _ = pruner.get_best_result()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/iterative_pruning_torch.py <examples/model_compress/pruning/v2/iterative_pruning_torch.py>`
+User configuration for Lottery Ticket Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.LotteryTicketPruner
+Simulated Annealing Pruner
+--------------------------
+We implement a guided heuristic search method, Simulated Annealing (SA) algorithm. As mentioned in the paper, this method is enhanced on guided search based on prior experience.
+The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.
+* Randomly initialize a pruning rate distribution (sparsities).
+* While current_temperature < stop_temperature:
+  #. generate a perturbation to current distribution
+  #. Perform fast evaluation on the perturbated distribution
+  #. accept the perturbation according to the performance and probability, if not accepted, return to step 1
+  #. cool down, current_temperature <- current_temperature * cool_down_rate
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+Usage
+^^^^^^
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import SimulatedAnnealingPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   pruner = SimulatedAnnealingPruner(model, config_list, pruning_algorithm='l1', evaluator=evaluator, cool_down_rate=0.9, finetuner=finetuner)
+   pruner.compress()
+   _, model, masks, _, _ = pruner.get_best_result()
+For detailed example please refer to :githublink:`examples/model_compress/pruning/v2/simulated_anealing_pruning_torch.py <examples/model_compress/pruning/v2/simulated_anealing_pruning_torch.py>`
+User configuration for Simulated Annealing Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.SimulatedAnnealingPruner
+Auto Compress Pruner
+--------------------
+For total iteration number :math:`N`, AutoCompressPruner prune the model that survive the previous iteration for a fixed sparsity ratio (e.g., :math:`1-{(1-0.8)}^{(1/N)}`) to achieve the overall sparsity (e.g., :math:`0.8`):
+.. code-block:: bash
+       1. Generate sparsities distribution using SimulatedAnnealingPruner
+       2. Perform ADMM-based pruning to generate pruning result for the next iteration.
+For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
+Usage
+^^^^^^
+.. code-block:: python
+   import nni
+   from nni.algorithms.compression.v2.pytorch.pruning import AutoCompressPruner
+   # make sure you have used nni.trace to wrap the optimizer class before initialize
+   traced_optimizer = nni.trace(torch.optim.Adam)(model.parameters())
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   admm_params = {
+        'trainer': trainer,
+        'traced_optimizer': traced_optimizer,
+        'criterion': criterion,
+        'iterations': 10,
+        'training_epochs': 1
+    }
+    sa_params = {
+        'evaluator': evaluator
+    }
+    pruner = AutoCompressPruner(model, config_list, 10, admm_params, sa_params, finetuner=finetuner)
+    pruner.compress()
+    _, model, masks, _, _ = pruner.get_best_result()
+The full script can be found :githublink:`here <examples/model_compress/pruning/v2/auto_compress_pruner.py>`.
+User configuration for Auto Compress Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.AutoCompressPruner
+AMC Pruner
+----------
+AMC pruner leverages reinforcement learning to provide the model compression policy.
+According to the author, this learning-based compression policy outperforms conventional rule-based compression policy by having a higher compression ratio,
+better preserving the accuracy and freeing human labor.
+For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
+Usage
+^^^^^
+PyTorch code
+.. code-block:: python
+   from nni.algorithms.compression.v2.pytorch.pruning import AMCPruner
+   config_list = [{'op_types': ['Conv2d'], 'total_sparsity': 0.5, 'max_sparsity_per_layer': 0.8}]
+   pruner = AMCPruner(400, model, config_list, dummy_input, evaluator, finetuner=finetuner)
+   pruner.compress()
+The full script can be found :githublink:`here <examples/model_compress/pruning/v2/amc_pruning_torch.py>`.
+User configuration for AMC Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+**PyTorch**
+..  autoclass:: nni.algorithms.compression.v2.pytorch.pruning.AMCPruner
--- a/docs/en_US/Compression/v2_pruning_config_list.rst
+++ b/docs/en_US/Compression/v2_pruning_config_list.rst
+Pruning Config Specification
+============================
+The Keys in Config List
+-----------------------
+Each sub-config in the config list is a dict, and the scope of each setting (key) is only internal to each sub-config.
+If multiple sub-configs are configured for the same layer, the later ones will overwrite the previous ones.
+op_types
+^^^^^^^^
+The type of the layers targeted by this sub-config.
+If ``op_names`` is not set in this sub-config, all layers in the model that satisfy the type will be selected.
+If ``op_names`` is set in this sub-config, the selected layers should satisfy both type and name.
+op_names
+^^^^^^^^
+The name of the layers targeted by this sub-config.
+If ``op_types`` is set in this sub-config, the selected layer should satisfy both type and name.
+op_partial_names
+^^^^^^^^^^^^^^^^
+This key is for the layers to be pruned with names that have the same sub-string. NNI will find all names in the model,
+find names that contain one of ``op_partial_names``, and append them into the ``op_names``.
+sparsity_per_layer
+^^^^^^^^^^^^^^^^^^
+The sparsity ratio of each selected layer.
+e.g., the ``sparsity_per_layer`` is 0.8 means each selected layer will mask 80% values on the weight.
+If ``layer_1`` (500 parameters) and ``layer_2`` (1000 parameters) are selected in this sub-config,
+then ``layer_1`` will be masked 400 parameters and ``layer_2`` will be masked 800 parameters.
+total_sparsity
+^^^^^^^^^^^^^^
+The sparsity ratio of all selected layers, means that sparsity ratio may no longer be even between layers.
+e.g., the ``total_sparsity`` is 0.8 means 80% of parameters in this sub-config will be masked.
+If ``layer_1`` (500 parameters) and ``layer_2`` (1000 parameters) are selected in this sub-config,
+then ``layer_1`` and ``layer_2`` will be masked a total of 1200 parameters,
+how these total parameters are distributed between the two layers is determined by the pruning algorithm.
+sparsity
+^^^^^^^^
+``sparsity`` is an old config key from the pruning v1, it has the same meaning as ``sparsity_per_layer``.
+You can also use ``sparsity`` right now, but it will be deprecated in the future.
+max_sparsity_per_layer
+^^^^^^^^^^^^^^^^^^^^^^
+This key is usually used with ``total_sparsity``. It limits the maximum sparsity ratio of each layer.
+In ``total_sparsity`` example, there are 1200 parameters that need to be masked and all parameters in ``layer_1`` may be totally masked.
+To avoid this situation, ``max_sparsity_per_layer`` can be set as 0.9, this means up to 450 parameters can be masked in ``layer_1``,
+and 900 parameters can be masked in ``layer_2``.
+exclude
+^^^^^^^
+The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
+If ``exclude`` is set in sub-config, the layers selected by this config will not be pruned.
--- a/docs/en_US/Compression/v2_scheduler.rst
+++ b/docs/en_US/Compression/v2_scheduler.rst
+Pruning Scheduler
+=================
+Pruning scheduler is new feature supported in pruning v2. It can bring more flexibility for pruning the model iteratively.
+All the built-in iterative pruners (e.g., AGPPruner, SimulatedAnnealingPruner) are based on three abstracted components: pruning scheduler, pruners and task generators.
+In addition to using the NNI built-in iterative pruners,
+users can directly use the pruning schedulers to customize their own iterative pruning logic.
+Workflow of Pruning Scheduler
+-----------------------------
+In iterative pruning, the final goal will be broken down into different small goals, and complete a small goal in each iteration.
+For example, each iteration increases a little sparsity ratio, and after several pruning iterations, the continuous pruned model reaches the final overall sparsity;
+fix the overall sparsity, try different ways to allocate sparsity between layers in each iteration, and find the best allocation way.
+We define a small goal as ``Task``, it usually includes states inherited from previous iterations (eg. pruned model and masks) and description of the current goal (eg. a config list that describes how to allocate sparsity).
+Details about ``Task`` can be found in this :githublink:`file <nni/algorithms/compression/v2/pytorch/base/scheduler.py>`.
+Pruning scheduler handles two main components, a basic pruner, and a task generator. The logic of generating ``Task`` is encapsulated in the task generator.
+In an iteration (one pruning step), pruning scheduler parses the ``Task`` getting from the task generator,
+and reset the pruner by ``model``, ``masks``, ``config_list`` parsing from the ``Task``.
+Then pruning scheduler generates the new masks by the pruner. During an iteration, the new masked model may also experience speed-up, finetuning, and evaluating.
+After one iteration is done, the pruning scheduler collects the compact model, new masks and evaluation score, packages them into ``TaskResult``, and passes it to task generator.
+The iteration process will end until the task generator has no more ``Task``.
+How to Customized Iterative Pruning
+-----------------------------------
+Using AGP Pruning as an example to explain how to implement an iterative pruning by scheduler in NNI.
+.. code-block:: python
+    from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner, PruningScheduler
+    from nni.algorithms.compression.v2.pytorch.pruning.tools import AGPTaskGenerator
+    pruner = L1NormPruner(model=None, config_list=None, mode='dependency_aware', dummy_input=torch.rand(10, 3, 224, 224).to(device))
+    task_generator = AGPTaskGenerator(total_iteration=10, origin_model=model, origin_config_list=config_list, log_dir='.', keep_intermediate_result=True)
+    scheduler = PruningScheduler(pruner, task_generator, finetuner=finetuner, speed_up=True, dummy_input=dummy_input, evaluator=None, reset_weight=False)
+    scheduler.compress()
+    _, model, masks, _, _ = scheduler.get_best_result()
+The full script can be found :githublink:`here <examples/model_compress/pruning/v2/scheduler_torch.py>`.
+In this example, we use ``dependency_aware`` mode L1 Norm Pruner as a basic pruner during each iteration.
+Note we do not need to pass ``model`` and ``config_list`` to the pruner, because in each iteration the ``model`` and ``config_list`` used by the pruner are received from the task generator.
+Then we can use ``scheduler`` as an iterative pruner directly. In fact, this is the implementation of ``AGPPruner`` in NNI.
+More about Task Generator
+-------------------------
+The task generator is used to give the model that needs to be pruned in each iteration and the corresponding config_list.
+For example, ``AGPTaskGenerator`` will give the model pruned in the previous iteration and compute the sparsity using in the current iteration.
+``TaskGenerator`` put all these pruning information into ``Task`` and pruning scheduler will get the ``Task``, then run it.
+The pruning result will return to the ``TaskGenerator`` at the end of each iteration and ``TaskGenerator`` will judge whether and how to generate the next ``Task``.
+The information included in the ``Task`` and ``TaskResult`` can be found :githublink:`here <nni/algorithms/compression/v2/pytorch/base/scheduler.py>`.
+A clearer iterative pruning flow chart can be found `here <v2_pruning.rst>`__.
+If you want to implement your own task generator, please following the ``TaskGenerator`` :githublink:`interface <nni/algorithms/compression/v2/pytorch/pruning/tools/base.py>`.
+Two main functions should be implemented, ``init_pending_tasks(self) -> List[Task]`` and ``generate_tasks(self, task_result: TaskResult) -> List[Task]``.
+Why Use Pruning Scheduler
+-------------------------
+One of the benefits of using a scheduler to do iterative pruning is users can use more functions of NNI pruning components,
+because of simplicity of the interface and the restoration of the paper, NNI not fully exposing all the low-level interfaces to the upper layer.
+For example, resetting weight value to the original model in each iteration is a key point in lottery ticket pruning algorithm, and this is implemented in ``LotteryTicketPruner``.
+To reduce the complexity of the interface, we only support this function in ``LotteryTicketPruner``, not other pruners.
+If users want to reset weight during each iteration in AGP pruning, ``AGPPruner`` can not do this, but users can easily set ``reset_weight=True`` in ``PruningScheduler`` to implement this.
+What's more, for a customized pruner or task generator, using scheduler can easily enhance the algorithm.
+In addition, users can also customize the scheduling process to implement their own scheduler.
--- a/docs/en_US/FeatureEngineering/GBDTSelector.rst
+++ b/docs/en_US/FeatureEngineering/GBDTSelector.rst
+GBDTSelector
+------------
+GBDTSelector is based on `LightGBM <https://github.com/microsoft/LightGBM>`__\ , which is a gradient boosting framework that uses tree-based learning algorithms.
+When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
+We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
+For now, we support the ``importance_type`` is ``split`` and ``gain``. But we will support customized ``importance_type`` in the future, which means the user could define how to calculate the ``feature score`` by themselves.
+Usage
+^^^^^
+First you need to install dependency:
+.. code-block:: bash
+   pip install lightgbm
+Then
+.. code-block:: python
+   from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = GBDTSelector()
+   # fit data
+   fgs.fit(X_train, y_train, ...)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features(10))
+   ...
+And you could reference the examples in ``/examples/feature_engineering/gbdt_selector/``\ , too.
+**Requirement of fit FuncArgs**
+* 
+  **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
+* 
+  **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
+* 
+  **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
+* 
+  **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
+* 
+  **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__.
+* 
+  **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance>`__.
+* 
+  **num_boost_round** (int, require) - number of boost round. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train>`__.
+**Requirement of get_selected_features FuncArgs**
+* **topk** (int, require) - the topK impotance features you want to selected.
--- a/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
+++ b/docs/en_US/FeatureEngineering/GradientFeatureSelector.rst
+GradientFeatureSelector
+-----------------------
+The algorithm in GradientFeatureSelector comes from `Feature Gradients: Scalable Feature Selection via Discrete Relaxation <https://arxiv.org/pdf/1908.10382.pdf>`__.
+GradientFeatureSelector, a gradient-based search algorithm
+for feature selection. 
+1) This approach extends a recent result on the estimation of
+learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N. 
+2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
+3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
+Usage
+^^^^^
+.. code-block:: python
+   from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = FeatureGradientSelector(n_features=10)
+   # fit data
+   fgs.fit(X_train, y_train)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features())
+   ...
+And you could reference the examples in ``/examples/feature_engineering/gradient_feature_selector/``\ , too.
+**Parameters of class FeatureGradientSelector constructor**
+* 
+  **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
+* 
+  **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.
+* 
+  **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
+* 
+  **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
+* 
+  **learning_rate** (float, optional, default = 1e-1) - learning rate
+* 
+  **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*\ ) - How to initialize the vector of scores. 'zero' is the default.
+* 
+  **n_epochs** (int, optional, default = 1) - number of epochs to run
+* 
+  **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
+* 
+  **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
+* 
+  **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
+* 
+  **classification** (bool, optional, default = True) - If True, problem is classification, else regression.
+* 
+  **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
+* 
+  **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
+* 
+  **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
+* 
+  **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
+* 
+  **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
+* 
+  **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
+**Requirement of fit FuncArgs**
+* 
+  **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]. `np.ndarry` recommended.
+* 
+  **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples]. `np.ndarry` recommended.
+* 
+  **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
+**Requirement of get_selected_features FuncArgs**
+ For now, the ``get_selected_features`` function has no parameters.
--- a/docs/en_US/FeatureEngineering/Overview.rst
+++ b/docs/en_US/FeatureEngineering/Overview.rst
+Feature Engineering with NNI
+============================
+We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.
+For now, we support the following feature selector:
+* `GradientFeatureSelector <./GradientFeatureSelector.rst>`__
+* `GBDTSelector <./GBDTSelector.rst>`__
+These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).
+In addition, those selector only for feature selection. If you want to:
+1) generate high-order combined features on nni while doing feature selection;
+2) leverage your distributed resources;
+you could try this :githublink:`example <examples/feature_engineering/auto-feature-engineering>`.
+How to use?
+-----------
+.. code-block:: python
+   from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
+   # from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
+   # load data
+   ...
+   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
+   # initlize a selector
+   fgs = FeatureGradientSelector(...)
+   # fit data
+   fgs.fit(X_train, y_train)
+   # get improtant features
+   # will return the index with important feature here.
+   print(fgs.get_selected_features(...))
+   ...
+When using the built-in Selector, you first need to ``import`` a feature selector, and ``initialize`` it. You could call the function ``fit`` in the selector to pass the data to the selector. After that, you could use ``get_seleteced_features`` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it. 
+How to customize?
+-----------------
+NNI provides *state-of-the-art* feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.
+If you want to implement a customized feature selector, you need to:
+#. Inherit the base FeatureSelector class
+#. Implement *fit* and _get_selected *features* function
+#. Integrate with sklearn (Optional)
+Here is an example:
+**1. Inherit the base Featureselector Class**
+.. code-block:: python
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector):
+       def __init__(self, ...):
+       ...
+**2. Implement fit and _get_selected features Function**
+.. code-block:: python
+   from nni.tuner import Tuner
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector):
+       def __init__(self, ...):
+       ...
+       def fit(self, X, y, **kwargs):
+           """
+           Fit the training data to FeatureSelector
+           Parameters
+           ------------
+           X : array-like numpy matrix
+           The training input samples, which shape is [n_samples, n_features].
+           y: array-like numpy matrix
+           The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
+           """
+           self.X = X
+           self.y = y
+           ...
+       def get_selected_features(self):
+           """
+           Get important feature
+           Returns
+           -------
+           list :
+           Return the index of the important feature.
+           """
+           ...
+           return self.selected_features_
+       ...
+**3. Integrate with Sklearn**
+``sklearn.pipeline.Pipeline`` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow. 
+The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a module of the pipeline.
+#. Inherit the calss *sklearn.base.BaseEstimator*
+#. Implement _get\ *params* and _set*params* function in *BaseEstimator*
+#. Inherit the class _sklearn.feature\ *selection.base.SelectorMixin*
+#. Implement _get\ *support*\ , *transform* and _inverse*transform* Function in *SelectorMixin*
+Here is an example:
+**1. Inherit the BaseEstimator Class and its Function**
+.. code-block:: python
+   from sklearn.base import BaseEstimator
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector, BaseEstimator):
+       def __init__(self, ...):
+       ...
+       def get_params(self, ...):
+           """
+           Get parameters for this estimator.
+           """
+           params = self.__dict__
+           params = {key: val for (key, val) in params.items()
+           if not key.endswith('_')}
+           return params
+       def set_params(self, **params):
+           """
+           Set the parameters of this estimator.
+           """
+           for param in params:
+           if hasattr(self, param):
+           setattr(self, param, params[param])
+           return self
+**2. Inherit the SelectorMixin Class and its Function**
+.. code-block:: python
+   from sklearn.base import BaseEstimator
+   from sklearn.feature_selection.base import SelectorMixin
+   from nni.feature_engineering.feature_selector import FeatureSelector
+   class CustomizedSelector(FeatureSelector, BaseEstimator, SelectorMixin):
+       def __init__(self, ...):
+           ...
+       def get_params(self, ...):
+           """
+           Get parameters for this estimator.
+           """
+           params = self.__dict__
+           params = {key: val for (key, val) in params.items()
+           if not key.endswith('_')}
+           return params
+       def set_params(self, **params):
+           """
+           Set the parameters of this estimator.
+           """
+           for param in params:
+           if hasattr(self, param):
+           setattr(self, param, params[param])
+           return self
+       def get_support(self, indices=False):
+           """
+           Get a mask, or integer index, of the features selected.
+           Parameters
+           ----------
+           indices : bool
+           Default False. If True, the return value will be an array of integers, rather than a boolean mask.
+           Returns
+           -------
+           list :
+           returns support: An index that selects the retained features from a feature vector.
+           If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
+           If indices are True, this is an integer array of shape [# output features] whose values
+           are indices into the input feature vector.
+           """
+           ...
+           return mask
+       def transform(self, X):
+           """Reduce X to the selected features.
+           Parameters
+           ----------
+           X : array
+           which shape is [n_samples, n_features]
+           Returns
+           -------
+           X_r : array
+           which shape is [n_samples, n_selected_features]
+           The input samples with only the selected features.
+           """
+           ...
+           return X_r
+       def inverse_transform(self, X):
+           """
+           Reverse the transformation operation
+           Parameters
+           ----------
+           X : array
+           shape is [n_samples, n_selected_features]
+           Returns
+           -------
+           X_r : array
+           shape is [n_samples, n_original_features]
+           """
+           ...
+           return X_r
+After integrating with Sklearn, we could use the feature selector as follows:
+.. code-block:: python
+   from sklearn.linear_model import LogisticRegression
+   # load data
+   ...
+   X_train, y_train = ...
+   # build a ppipeline
+   pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
+   pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+   pipeline.fit(X_train, y_train)
+   # score
+   print("Pipeline Score: ", pipeline.score(X_train, y_train))
+Benchmark
+---------
+``Baseline`` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Dataset
+     - All Features + LR (acc, time, memory)
+     - GradientFeatureSelector + LR (acc, time, memory)
+     - TreeBasedClassifier + LR (acc, time, memory)
+     - #Train
+     - #Feature
+   * - colon-cancer
+     - 0.7547, 890ms, 348MiB
+     - 0.7368, 363ms, 286MiB
+     - 0.7223, 171ms, 1171 MiB
+     - 62
+     - 2,000
+   * - gisette
+     - 0.9725, 215ms, 584MiB
+     - 0.89416, 446ms, 397MiB
+     - 0.9792, 911ms, 234MiB
+     - 6,000
+     - 5,000
+   * - avazu
+     - 0.8834, N/A, N/A
+     - N/A, N/A, N/A
+     - N/A, N/A, N/A
+     - 40,428,967
+     - 1,000,000
+   * - rcv1
+     - 0.9644, 557ms, 241MiB
+     - 0.7333, 401ms, 281MiB
+     - 0.9615, 752ms, 284MiB
+     - 20,242
+     - 47,236
+   * - news20.binary
+     - 0.9208, 707ms, 361MiB
+     - 0.6870, 565ms, 371MiB
+     - 0.9070, 904ms, 364MiB
+     - 19,996
+     - 1,355,191
+   * - real-sim
+     - 0.9681, 433ms, 274MiB
+     - 0.7969, 251ms, 274MiB
+     - 0.9591, 643ms, 367MiB
+     - 72,309
+     - 20,958
+The dataset of benchmark could be download in `here <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/>`__
+The code could be refenrence ``/examples/feature_engineering/gradient_feature_selector/benchmark_test.py``.
+Reference and Feedback
+----------------------
+* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
+* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
+* To know more about :githublink:`Neural Architecture Search with NNI <docs/en_US/NAS/Overview.rst>`\ ;
+* To know more about :githublink:`Model Compression with NNI <docs/en_US/Compression/Overview.rst>`\ ;
+* To know more about :githublink:`Hyperparameter Tuning with NNI <docs/en_US/Tuner/BuiltinTuner.rst>`\ ;
--- a/docs/en_US/Makefile
+++ b/docs/en_US/Makefile
+# Minimal makefile for Sphinx documentation
+#
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+.PHONY: help Makefile
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/en_US/NAS/ApiReference.rst
+++ b/docs/en_US/NAS/ApiReference.rst
+Retiarii API Reference
+======================
+.. contents::
+Inline Mutation APIs
+--------------------
+..  autoclass:: nni.retiarii.nn.pytorch.LayerChoice
+    :members:
+..  autoclass:: nni.retiarii.nn.pytorch.InputChoice
+    :members:
+..  autoclass:: nni.retiarii.nn.pytorch.ValueChoice
+    :members:
+..  autoclass:: nni.retiarii.nn.pytorch.ChosenInputs
+    :members:
+..  autoclass:: nni.retiarii.nn.pytorch.Repeat
+    :members:
+..  autoclass:: nni.retiarii.nn.pytorch.Cell
+    :members:
+Graph Mutation APIs
+-------------------
+..  autoclass:: nni.retiarii.Mutator
+    :members:
+..  autoclass:: nni.retiarii.Model
+    :members:
+..  autoclass:: nni.retiarii.Graph
+    :members:
+..  autoclass:: nni.retiarii.Node
+    :members:
+..  autoclass:: nni.retiarii.Edge
+    :members:
+..  autoclass:: nni.retiarii.Operation
+    :members:
+Evaluators
+----------
+..  autoclass:: nni.retiarii.evaluator.FunctionalEvaluator
+    :members:
+..  autoclass:: nni.retiarii.evaluator.pytorch.lightning.LightningModule
+    :members:
+..  autoclass:: nni.retiarii.evaluator.pytorch.lightning.Classification
+    :members:
+..  autoclass:: nni.retiarii.evaluator.pytorch.lightning.Regression
+    :members:
+Oneshot Trainers
+----------------
+..  autoclass:: nni.retiarii.oneshot.pytorch.DartsTrainer
+    :members:
+..  autoclass:: nni.retiarii.oneshot.pytorch.EnasTrainer
+    :members:
+..  autoclass:: nni.retiarii.oneshot.pytorch.ProxylessTrainer
+    :members:
+..  autoclass:: nni.retiarii.oneshot.pytorch.SinglePathTrainer
+    :members:
+Exploration Strategies
+----------------------
+..  autoclass:: nni.retiarii.strategy.Random
+    :members:
+..  autoclass:: nni.retiarii.strategy.GridSearch
+    :members:
+..  autoclass:: nni.retiarii.strategy.RegularizedEvolution
+    :members:
+..  autoclass:: nni.retiarii.strategy.TPEStrategy
+    :members:
+..  autoclass:: nni.retiarii.strategy.PolicyBasedRL
+    :members:
+Retiarii Experiments
+--------------------
+..  autoclass:: nni.retiarii.experiment.pytorch.RetiariiExperiment
+    :members:
+..  autoclass:: nni.retiarii.experiment.pytorch.RetiariiExeConfig
+    :members:
+CGO Execution
+-------------
+..  autofunction:: nni.retiarii.evaluator.pytorch.cgo.evaluator.MultiModelSupervisedLearningModule
+..  autofunction:: nni.retiarii.evaluator.pytorch.cgo.evaluator.Classification
+..  autofunction:: nni.retiarii.evaluator.pytorch.cgo.evaluator.Regression
+Utilities
+---------
+..  autofunction:: nni.retiarii.basic_unit
+..  autofunction:: nni.retiarii.model_wrapper
+..  autofunction:: nni.retiarii.fixed_arch
--- a/docs/en_US/NAS/Benchmarks.rst
+++ b/docs/en_US/NAS/Benchmarks.rst
+NAS Benchmarks
+==============
+..  toctree::
+    :hidden:
+    Example Usages <BenchmarksExample>
+Introduction
+------------
+To improve the reproducibility of NAS algorithms as well as reducing computing resource requirements, researchers proposed a series of NAS benchmarks such as `NAS-Bench-101 <https://arxiv.org/abs/1902.09635>`__\ , `NAS-Bench-201 <https://arxiv.org/abs/2001.00326>`__\ , `NDS <https://arxiv.org/abs/1905.13214>`__\ , etc. NNI provides a query interface for users to acquire these benchmarks. Within just a few lines of code, researcher are able to evaluate their NAS algorithms easily and fairly by utilizing these benchmarks.
+Prerequisites
+-------------
+* Please prepare a folder to household all the benchmark databases. By default, it can be found at ``${HOME}/.cache/nni/nasbenchmark``. Or you can place it anywhere you like, and specify it in ``NASBENCHMARK_DIR`` via ``export NASBENCHMARK_DIR=/path/to/your/nasbenchmark`` before importing NNI.
+* Please install ``peewee`` via ``pip3 install peewee``\ , which NNI uses to connect to database.
+Data Preparation
+----------------
+Option 1 (Recommended)
+^^^^^^^^^^^^^^^^^^^^^^
+You can download the preprocessed benchmark files via ``python -m nni.nas.benchmarks.download <benchmark_name>``, where ``<benchmark_name>`` can be ``nasbench101``, ``nasbench201``, and etc. Add ``--help`` to the command for supported command line arguments.
+Option 2
+^^^^^^^^
+.. note:: If you have files that are processed before v2.5, it is recommended that you delete them and try option 1.
+#. 
+   Clone NNI to your machine and enter ``examples/nas/benchmarks`` directory.
+   .. code-block:: bash
+      git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
+      cd nni/examples/nas/benchmarks
+   Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.4``.
+#. 
+   Install dependencies via ``pip3 install -r xxx.requirements.txt``. ``xxx`` can be ``nasbench101``\ , ``nasbench201`` or ``nds``.
+#. Generate the database via ``./xxx.sh``. The directory that stores the benchmark file can be configured with ``NASBENCHMARK_DIR`` environment variable, which defaults to ``~/.nni/nasbenchmark``. Note that the NAS-Bench-201 dataset will be downloaded from a google drive.
+Please make sure there is at least 10GB free disk space and note that the conversion process can take up to hours to complete.
+Example Usages
+--------------
+Please refer to `examples usages of Benchmarks API <./BenchmarksExample.rst>`__.
+NAS-Bench-101
+-------------
+* `Paper link <https://arxiv.org/abs/1902.09635>`__ 
+* `Open-source <https://github.com/google-research/nasbench>`__
+NAS-Bench-101 contains 423,624 unique neural networks, combined with 4 variations in number of epochs (4, 12, 36, 108), each of which is trained 3 times. It is a cell-wise search space, which constructs and stacks a cell by enumerating DAGs with at most 7 operators, and no more than 9 connections. All operators can be chosen from ``CONV3X3_BN_RELU``\ , ``CONV1X1_BN_RELU`` and ``MAXPOOL3X3``\ , except the first operator (always ``INPUT``\ ) and last operator (always ``OUTPUT``\ ).
+Notably, NAS-Bench-101 eliminates invalid cells (e.g., there is no path from input to output, or there is redundant computation). Furthermore, isomorphic cells are de-duplicated, i.e., all the remaining cells are computationally unique.
+API Documentation
+^^^^^^^^^^^^^^^^^
+.. autofunction:: nni.nas.benchmarks.nasbench101.query_nb101_trial_stats
+.. autoattribute:: nni.nas.benchmarks.nasbench101.INPUT
+.. autoattribute:: nni.nas.benchmarks.nasbench101.OUTPUT
+.. autoattribute:: nni.nas.benchmarks.nasbench101.CONV3X3_BN_RELU
+.. autoattribute:: nni.nas.benchmarks.nasbench101.CONV1X1_BN_RELU
+.. autoattribute:: nni.nas.benchmarks.nasbench101.MAXPOOL3X3
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101TrialConfig
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101TrialStats
+.. autoclass:: nni.nas.benchmarks.nasbench101.Nb101IntermediateStats
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.nasbench_format_to_architecture_repr
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.infer_num_vertices
+.. autofunction:: nni.nas.benchmarks.nasbench101.graph_util.hash_module
+NAS-Bench-201
+-------------
+* `Paper link <https://arxiv.org/abs/2001.00326>`__ 
+* `Open-source API <https://github.com/D-X-Y/NAS-Bench-201>`__ 
+* `Implementations <https://github.com/D-X-Y/AutoDL-Projects>`__
+NAS-Bench-201 is a cell-wise search space that views nodes as tensors and edges as operators. The search space contains all possible densely-connected DAGs with 4 nodes, resulting in 15,625 candidates in total. Each operator (i.e., edge) is selected from a pre-defined operator set (\ ``NONE``\ , ``SKIP_CONNECT``\ , ``CONV_1X1``\ , ``CONV_3X3`` and ``AVG_POOL_3X3``\ ). Training appraoches vary in the dataset used (CIFAR-10, CIFAR-100, ImageNet) and number of epochs scheduled (12 and 200). Each combination of architecture and training approach is repeated 1 - 3 times with different random seeds.
+API Documentation
+^^^^^^^^^^^^^^^^^
+.. autofunction:: nni.nas.benchmarks.nasbench201.query_nb201_trial_stats
+.. autoattribute:: nni.nas.benchmarks.nasbench201.NONE
+.. autoattribute:: nni.nas.benchmarks.nasbench201.SKIP_CONNECT
+.. autoattribute:: nni.nas.benchmarks.nasbench201.CONV_1X1
+.. autoattribute:: nni.nas.benchmarks.nasbench201.CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nasbench201.AVG_POOL_3X3
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201TrialConfig
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201TrialStats
+.. autoclass:: nni.nas.benchmarks.nasbench201.Nb201IntermediateStats
+NDS
+---
+* `Paper link <https://arxiv.org/abs/1905.13214>`__ 
+* `Open-source <https://github.com/facebookresearch/nds>`__
+*On Network Design Spaces for Visual Recognition* released trial statistics of over 100,000 configurations (models + hyper-parameters) sampled from multiple model families, including vanilla (feedforward network loosely inspired by VGG), ResNet and ResNeXt (residual basic block and residual bottleneck block) and NAS cells (following popular design from NASNet, Ameoba, PNAS, ENAS and DARTS). Most configurations are trained only once with a fixed seed, except a few that are trained twice or three times.
+Instead of storing results obtained with different configurations in separate files, we dump them into one single database to enable comparison in multiple dimensions. Specifically, we use ``model_family`` to distinguish model types, ``model_spec`` for all hyper-parameters needed to build this model, ``cell_spec`` for detailed information on operators and connections if it is a NAS cell, ``generator`` to denote the sampling policy through which this configuration is generated. Refer to API documentation for details.
+Available Operators
+-------------------
+Here is a list of available operators used in NDS.
+.. autoattribute:: nni.nas.benchmarks.nds.constants.NONE
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SKIP_CONNECT
+.. autoattribute:: nni.nas.benchmarks.nds.constants.AVG_POOL_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_5X5
+.. autoattribute:: nni.nas.benchmarks.nds.constants.MAX_POOL_7X7
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_1X1
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_3X1_1X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.CONV_7X1_1X7
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_CONV_5X5
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_3X3
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_5X5
+.. autoattribute:: nni.nas.benchmarks.nds.constants.SEP_CONV_7X7
+.. autoattribute:: nni.nas.benchmarks.nds.constants.DIL_SEP_CONV_3X3
+API Documentation
+^^^^^^^^^^^^^^^^^
+.. autofunction:: nni.nas.benchmarks.nds.query_nds_trial_stats
+.. autoclass:: nni.nas.benchmarks.nds.NdsTrialConfig
+.. autoclass:: nni.nas.benchmarks.nds.NdsTrialStats
+.. autoclass:: nni.nas.benchmarks.nds.NdsIntermediateStats
--- a/docs/en_US/NAS/BenchmarksExample.ipynb
+++ b/docs/en_US/NAS/BenchmarksExample.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Example Usages of NAS Benchmarks"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import pprint\n",
+        "import time\n",
+        "\n",
+        "from nni.nas.benchmarks.nasbench101 import query_nb101_trial_stats\n",
+        "from nni.nas.benchmarks.nasbench201 import query_nb201_trial_stats\n",
+        "from nni.nas.benchmarks.nds import query_nds_trial_stats\n",
+        "\n",
+        "ti = time.time()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## NAS-Bench-101"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Use the following architecture as an example:\n",
+        "\n",
+        "![nas-101](../../img/nas-bench-101-example.png)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "arch = {\n",
+        "    'op1': 'conv3x3-bn-relu',\n",
+        "    'op2': 'maxpool3x3',\n",
+        "    'op3': 'conv3x3-bn-relu',\n",
+        "    'op4': 'conv3x3-bn-relu',\n",
+        "    'op5': 'conv1x1-bn-relu',\n",
+        "    'input1': [0],\n",
+        "    'input2': [1],\n",
+        "    'input3': [2],\n",
+        "    'input4': [0],\n",
+        "    'input5': [0, 3, 4],\n",
+        "    'input6': [2, 5]\n",
+        "}\n",
+        "for t in query_nb101_trial_stats(arch, 108, include_intermediates=True):\n",
+        "    pprint.pprint(t)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "An architecture of NAS-Bench-101 could be trained more than once. Each element of the returned generator is a dict which contains one of the training results of this trial config (architecture + hyper-parameters) including train/valid/test accuracy, training time, number of epochs, etc. The results of NAS-Bench-201 and NDS follow similar formats."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## NAS-Bench-201"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Use the following architecture as an example:\n",
+        "\n",
+        "![nas-201](../../img/nas-bench-201-example.png)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "arch = {\n",
+        "    '0_1': 'avg_pool_3x3',\n",
+        "    '0_2': 'conv_1x1',\n",
+        "    '1_2': 'skip_connect',\n",
+        "    '0_3': 'conv_1x1',\n",
+        "    '1_3': 'skip_connect',\n",
+        "    '2_3': 'skip_connect'\n",
+        "}\n",
+        "for t in query_nb201_trial_stats(arch, 200, 'cifar100'):\n",
+        "    pprint.pprint(t)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Intermediate results are also available."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "for t in query_nb201_trial_stats(arch, None, 'imagenet16-120', include_intermediates=True):\n",
+        "    print(t['config'])\n",
+        "    print('Intermediates:', len(t['intermediates']))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## NDS"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Use the following architecture as an example:<br>\n",
+        "![nds](../../img/nas-bench-nds-example.png)\n",
+        "\n",
+        "Here, `bot_muls`, `ds`, `num_gs`, `ss` and `ws` stand for \"bottleneck multipliers\", \"depths\", \"number of groups\", \"strides\" and \"widths\" respectively."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "model_spec = {\n",
+        "    'bot_muls': [0.0, 0.25, 0.25, 0.25],\n",
+        "    'ds': [1, 16, 1, 4],\n",
+        "    'num_gs': [1, 2, 1, 2],\n",
+        "    'ss': [1, 1, 2, 2],\n",
+        "    'ws': [16, 64, 128, 16]\n",
+        "}\n",
+        "# Use none as a wildcard\n",
+        "for t in query_nds_trial_stats('residual_bottleneck', None, None, model_spec, None, 'cifar10'):\n",
+        "    pprint.pprint(t)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "model_spec = {\n",
+        "    'bot_muls': [0.0, 0.25, 0.25, 0.25],\n",
+        "    'ds': [1, 16, 1, 4],\n",
+        "    'num_gs': [1, 2, 1, 2],\n",
+        "    'ss': [1, 1, 2, 2],\n",
+        "    'ws': [16, 64, 128, 16]\n",
+        "}\n",
+        "for t in query_nds_trial_stats('residual_bottleneck', None, None, model_spec, None, 'cifar10', include_intermediates=True):\n",
+        "    pprint.pprint(t['intermediates'][:10])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "model_spec = {'ds': [1, 12, 12, 12], 'ss': [1, 1, 2, 2], 'ws': [16, 24, 24, 40]}\n",
+        "for t in query_nds_trial_stats('residual_basic', 'resnet', 'random', model_spec, {}, 'cifar10'):\n",
+        "    pprint.pprint(t)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "# get the first one\n",
+        "pprint.pprint(next(query_nds_trial_stats('vanilla', None, None, None, None, None)))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "# count number\n",
+        "model_spec = {'num_nodes_normal': 5, 'num_nodes_reduce': 5, 'depth': 12, 'width': 32, 'aux': False, 'drop_prob': 0.0}\n",
+        "cell_spec = {\n",
+        "    'normal_0_op_x': 'avg_pool_3x3',\n",
+        "    'normal_0_input_x': 0,\n",
+        "    'normal_0_op_y': 'conv_7x1_1x7',\n",
+        "    'normal_0_input_y': 1,\n",
+        "    'normal_1_op_x': 'sep_conv_3x3',\n",
+        "    'normal_1_input_x': 2,\n",
+        "    'normal_1_op_y': 'sep_conv_5x5',\n",
+        "    'normal_1_input_y': 0,\n",
+        "    'normal_2_op_x': 'dil_sep_conv_3x3',\n",
+        "    'normal_2_input_x': 2,\n",
+        "    'normal_2_op_y': 'dil_sep_conv_3x3',\n",
+        "    'normal_2_input_y': 2,\n",
+        "    'normal_3_op_x': 'skip_connect',\n",
+        "    'normal_3_input_x': 4,\n",
+        "    'normal_3_op_y': 'dil_sep_conv_3x3',\n",
+        "    'normal_3_input_y': 4,\n",
+        "    'normal_4_op_x': 'conv_7x1_1x7',\n",
+        "    'normal_4_input_x': 2,\n",
+        "    'normal_4_op_y': 'sep_conv_3x3',\n",
+        "    'normal_4_input_y': 4,\n",
+        "    'normal_concat': [3, 5, 6],\n",
+        "    'reduce_0_op_x': 'avg_pool_3x3',\n",
+        "    'reduce_0_input_x': 0,\n",
+        "    'reduce_0_op_y': 'dil_sep_conv_3x3',\n",
+        "    'reduce_0_input_y': 1,\n",
+        "    'reduce_1_op_x': 'sep_conv_3x3',\n",
+        "    'reduce_1_input_x': 0,\n",
+        "    'reduce_1_op_y': 'sep_conv_3x3',\n",
+        "    'reduce_1_input_y': 0,\n",
+        "    'reduce_2_op_x': 'skip_connect',\n",
+        "    'reduce_2_input_x': 2,\n",
+        "    'reduce_2_op_y': 'sep_conv_7x7',\n",
+        "    'reduce_2_input_y': 0,\n",
+        "    'reduce_3_op_x': 'conv_7x1_1x7',\n",
+        "    'reduce_3_input_x': 4,\n",
+        "    'reduce_3_op_y': 'skip_connect',\n",
+        "    'reduce_3_input_y': 4,\n",
+        "    'reduce_4_op_x': 'conv_7x1_1x7',\n",
+        "    'reduce_4_input_x': 0,\n",
+        "    'reduce_4_op_y': 'conv_7x1_1x7',\n",
+        "    'reduce_4_input_y': 5,\n",
+        "    'reduce_concat': [3, 6]\n",
+        "}\n",
+        "\n",
+        "for t in query_nds_trial_stats('nas_cell', None, None, model_spec, cell_spec, 'cifar10'):\n",
+        "    assert t['config']['model_spec'] == model_spec\n",
+        "    assert t['config']['cell_spec'] == cell_spec\n",
+        "    pprint.pprint(t)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {
+        "tags": []
+      },
+      "outputs": [],
+      "source": [
+        "# count number\n",
+        "print('NDS (amoeba) count:', len(list(query_nds_trial_stats(None, 'amoeba', None, None, None, None, None))))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## NLP"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "pycharm": {
+          "metadata": false
+        }
+      },
+      "source": [
+        "Use the following two architectures as examples. \n",
+        "The arch in the paper is called \"receipe\" with nested variable, and now it is nunested in the benchmarks for NNI.\n",
+        "An arch has multiple Node, Node_input_n and Node_op, you can refer to doc for more details.\n",
+        "\n",
+        "arch1 : <img src=\"../../img/nas-bench-nlp-example1.jpeg\" width=400 height=300 /> \n",
+        "\n",
+        "\n",
+        "arch2 : <img src=\"../../img/nas-bench-nlp-example2.jpeg\" width=400 height=300 /> \n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "{'config': {'arch': {'h_new_0_input_0': 'node_3',\n                     'h_new_0_input_1': 'node_2',\n                     'h_new_0_input_2': 'node_1',\n                     'h_new_0_op': 'blend',\n                     'node_0_input_0': 'x',\n                     'node_0_input_1': 'h_prev_0',\n                     'node_0_op': 'linear',\n                     'node_1_input_0': 'node_0',\n                     'node_1_op': 'activation_tanh',\n                     'node_2_input_0': 'h_prev_0',\n                     'node_2_input_1': 'node_1',\n                     'node_2_input_2': 'x',\n                     'node_2_op': 'linear',\n                     'node_3_input_0': 'node_2',\n                     'node_3_op': 'activation_leaky_relu'},\n            'dataset': 'ptb',\n            'id': 20003},\n 'id': 16291,\n 'test_loss': 4.680262297102549,\n 'train_loss': 4.132040537087838,\n 'training_time': 177.05208373069763,\n 'val_loss': 4.707944253177966}\n"
+          ]
+        }
+      ],
+      "source": [
+        "import pprint\n",
+        "from nni.nas.benchmarks.nlp import query_nlp_trial_stats\n",
+        "\n",
+        "arch1 = {'h_new_0_input_0': 'node_3', 'h_new_0_input_1': 'node_2', 'h_new_0_input_2': 'node_1', 'h_new_0_op': 'blend', 'node_0_input_0': 'x', 'node_0_input_1': 'h_prev_0', 'node_0_op': 'linear','node_1_input_0': 'node_0', 'node_1_op': 'activation_tanh', 'node_2_input_0': 'h_prev_0', 'node_2_input_1': 'node_1', 'node_2_input_2': 'x', 'node_2_op': 'linear', 'node_3_input_0': 'node_2', 'node_3_op': 'activation_leaky_relu'}\n",
+        "for i in query_nlp_trial_stats(arch=arch1, dataset=\"ptb\"):\n",
+        "    pprint.pprint(i)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {},
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "[{'current_epoch': 46,\n  'id': 1796,\n  'test_loss': 6.233430054978619,\n  'train_loss': 6.4866799231542664,\n  'training_time': 146.5680329799652,\n  'val_loss': 6.326836978687959},\n {'current_epoch': 47,\n  'id': 1797,\n  'test_loss': 6.2402057403023825,\n  'train_loss': 6.485401405247535,\n  'training_time': 146.05511450767517,\n  'val_loss': 6.3239741605870865},\n {'current_epoch': 48,\n  'id': 1798,\n  'test_loss': 6.351145308363877,\n  'train_loss': 6.611281181173992,\n  'training_time': 145.8849437236786,\n  'val_loss': 6.436160816865809},\n {'current_epoch': 49,\n  'id': 1799,\n  'test_loss': 6.227155079159031,\n  'train_loss': 6.473414458249545,\n  'training_time': 145.51414465904236,\n  'val_loss': 6.313294354607077}]\n"
+          ]
+        }
+      ],
+      "source": [
+        "arch2 = {\"h_new_0_input_0\":\"node_0\",\"h_new_0_input_1\":\"node_1\",\"h_new_0_op\":\"elementwise_sum\",\"node_0_input_0\":\"x\",\"node_0_input_1\":\"h_prev_0\",\"node_0_op\":\"linear\",\"node_1_input_0\":\"node_0\",\"node_1_op\":\"activation_tanh\"}\n",
+        "for i in query_nlp_trial_stats(arch=arch2, dataset='wikitext-2', include_intermediates=True):\n",
+        "    pprint.pprint(i['intermediates'][45:49])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "pycharm": {},
+        "tags": []
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Elapsed time:  5.60982608795166 seconds\n"
+          ]
+        }
+      ],
+      "source": [
+        "print('Elapsed time: ', time.time() - ti, 'seconds')"
+      ]
+    }
+  ],
+  "metadata": {
+    "file_extension": ".py",
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "name": "python",
+      "version": "3.8.5-final"
+    },
+    "mimetype": "text/x-python",
+    "name": "python",
+    "npconvert_exporter": "python",
+    "orig_nbformat": 2,
+    "pygments_lexer": "ipython3",
+    "version": 3
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}
\ No newline at end of file
--- a/docs/en_US/NAS/DARTS.rst
+++ b/docs/en_US/NAS/DARTS.rst
+DARTS
+=====
+Introduction
+------------
+The paper `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Their method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
+Authors' code optimizes the network weights and architecture weights alternatively in mini-batches. They further explore the possibility that uses second order optimization (unroll) instead of first order, to improve the performance.
+Implementation on NNI is based on the `official implementation <https://github.com/quark0/darts>`__ and a `popular 3rd-party repo <https://github.com/khanrc/pt.darts>`__. DARTS on NNI is designed to be general for arbitrary search space. A CNN search space tailored for CIFAR10, same as the original paper, is implemented as a use case of DARTS.
+Reproduction Results
+--------------------
+The above-mentioned example is meant to reproduce the results in the paper, we do experiments with first and second order optimization. Due to the time limit, we retrain *only the best architecture* derived from the search phase and we repeat the experiment *only once*. Our results is currently on par with the results reported in paper. We will add more results later when ready.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - 
+     - In paper
+     - Reproduction
+   * - First order (CIFAR10)
+     - 3.00 +/- 0.14
+     - 2.78
+   * - Second order (CIFAR10)
+     - 2.76 +/- 0.09
+     - 2.80
+Examples
+--------
+CNN Search Space
+^^^^^^^^^^^^^^^^
+:githublink:`Example code <examples/nas/oneshot/darts>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/oneshot/darts
+   python3 search.py
+   # train the best architecture
+   python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
+Reference
+---------
+PyTorch
+^^^^^^^
+..  autoclass:: nni.retiarii.oneshot.pytorch.DartsTrainer
+    :noindex:
+Limitations
+-----------
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
--- a/docs/en_US/NAS/ENAS.rst
+++ b/docs/en_US/NAS/ENAS.rst
+ENAS
+====
+Introduction
+------------
+The paper `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__ uses parameter sharing between child models to accelerate the NAS process. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss.
+Implementation on NNI is based on the `official implementation in Tensorflow <https://github.com/melodyguan/enas>`__\ , including a general-purpose Reinforcement-learning controller and a trainer that trains target network and this controller alternatively. Following paper, we have also implemented macro and micro search space on CIFAR10 to demonstrate how to use these trainers. Since code to train from scratch on NNI is not ready yet, reproduction results are currently unavailable.
+Examples
+--------
+CIFAR10 Macro/Micro Search Space
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+:githublink:`Example code <examples/nas/oneshot/enas>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/oneshot/enas
+   # search in macro search space
+   python3 search.py --search-for macro
+   # search in micro search space
+   python3 search.py --search-for micro
+   # view more options for search
+   python3 search.py -h
+Reference
+---------
+PyTorch
+^^^^^^^
+..  autoclass:: nni.retiarii.oneshot.pytorch.EnasTrainer
+    :noindex:
--- a/docs/en_US/NAS/ExecutionEngines.rst
+++ b/docs/en_US/NAS/ExecutionEngines.rst
+Execution Engines
+=================
+Execution engine is for running Retiarii Experiment. NNI supports three execution engines, users can choose a speicific engine according to the type of their model mutation definition and their requirements for cross-model optimizations. 
+* **Pure-python execution engine** is the default engine, it supports the model space expressed by `inline mutation API <./MutationPrimitives.rst>`__. 
+* **Graph-based execution engine** supports the use of `inline mutation APIs <./MutationPrimitives.rst>`__ and model spaces represented by `mutators <./Mutators.rst>`__. It requires the user's model to be parsed by `TorchScript <https://pytorch.org/docs/stable/jit.html>`__.
+* **CGO execution engine** has the same requirements and capabilities as the **Graph-based execution engine**. But further enables cross-model optimizations, which makes model space exploration faster.
+Pure-python Execution Engine
+----------------------------
+Pure-python Execution Engine is the default engine, we recommend users to keep using this execution engine, if they are new to NNI NAS. Pure-python execution engine plays magic within the scope of inline mutation APIs, while does not touch the rest of user model. Thus, it has minimal requirement on user model. 
+One steps are needed to use this engine now.
+1. Add ``@nni.retiarii.model_wrapper`` decorator outside the whole PyTorch model.
+.. note:: You should always use ``super().__init__()`` instead of ``super(MyNetwork, self).__init__()`` in the PyTorch model, because the latter one has issues with model wrapper.
+Graph-based Execution Engine
+----------------------------
+For graph-based execution engine, it converts user-defined model to a graph representation (called graph IR) using `TorchScript <https://pytorch.org/docs/stable/jit.html>`__, each instantiated module in the model is converted to a subgraph. Then mutations are applied to the graph to generate new graphs. Each new graph is then converted back to PyTorch code and executed on the user specified training service.
+Users may find ``@basic_unit`` helpful in some cases. ``@basic_unit`` here means the module will not be converted to a subgraph, instead, it is converted to a single graph node as a basic unit.
+``@basic_unit`` is usually used in the following cases:
+* When users want to tune initialization parameters of a module using ``ValueChoice``, then decorate the module with ``@basic_unit``. For example, ``self.conv = MyConv(kernel_size=nn.ValueChoice([1, 3, 5]))``, here ``MyConv`` should be decorated.
+* When a module cannot be successfully parsed to a subgraph, decorate the module with ``@basic_unit``. The parse failure could be due to complex control flow. Currently Retiarii does not support adhoc loop, if there is adhoc loop in a module's forward, this class should be decorated as serializable module. For example, the following ``MyModule`` should be decorated.
+  .. code-block:: python
+    @basic_unit
+    class MyModule(nn.Module):
+      def __init__(self):
+        ...
+      def forward(self, x):
+        for i in range(10): # <- adhoc loop
+          ...
+* Some inline mutation APIs require their handled module to be decorated with ``@basic_unit``. For example, user-defined module that is provided to ``LayerChoice`` as a candidate op should be decorated.
+Three steps are need to use graph-based execution engine.
+1. Remove ``@nni.retiarii.model_wrapper`` if there is any in your model.
+2. Add ``config.execution_engine = 'base'`` to ``RetiariiExeConfig``. The default value of ``execution_engine`` is 'py', which means pure-python execution engine.
+3. Add ``@basic_unit`` when necessary following the above guidelines.
+For exporting top models, graph-based execution engine supports exporting source code for top models by running ``exp.export_top_models(formatter='code')``.
+CGO Execution Engine (experimental)
+-----------------------------------
+CGO（Cross-Graph Optimization) execution engine does cross-model optimizations based on the graph-based execution engine. In CGO execution engine, multiple models could be merged and trained together in one trial.
+Currently, it only supports ``DedupInputOptimizer`` that can merge graphs sharing the same dataset to only loading and pre-processing each batch of data once, which can avoid bottleneck on data loading. 
+.. note :: To use CGO engine, PyTorch-lightning above version 1.4.2 is required.
+To enable CGO execution engine, you need to follow these steps:
+1. Create RetiariiExeConfig with remote training service. CGO execution engine currently only supports remote training service.
+2. Add configurations for remote training service
+3. Add configurations for CGO engine
+  .. code-block:: python
+    exp = RetiariiExperiment(base_model, trainer, mutators, strategy)
+    config = RetiariiExeConfig('remote')
+    # ...
+    # other configurations of RetiariiExeConfig
+    config.execution_engine = 'cgo' # set execution engine to CGO
+    config.max_concurrency_cgo = 3 # the maximum number of concurrent models to merge
+    config.batch_waiting_time = 10  # how many seconds CGO execution engine should wait before optimizing a new batch of models
+    rm_conf = RemoteMachineConfig()
+    # ...
+    # server configuration in rm_conf
+    rm_conf.gpu_indices = [0, 1, 2, 3] # gpu_indices must be set in RemoteMachineConfig for CGO execution engine
+    config.training_service.machine_list = [rm_conf]
+    exp.run(config, 8099)
+CGO Execution Engine only supports pytorch-lightning trainer that inherits :class:`nni.retiarii.evaluator.pytorch.cgo.evaluator.MultiModelSupervisedLearningModule`.
+For a trial running multiple models, the trainers inheriting :class:`nni.retiarii.evaluator.pytorch.cgo.evaluator.MultiModelSupervisedLearningModule` can handle the multiple outputs from the merged model for training, test and validation.
+We have already implemented two trainers: :class:`nni.retiarii.evaluator.pytorch.cgo.evaluator.Classification` and :class:`nni.retiarii.evaluator.pytorch.cgo.evaluator.Regression`.
+.. code-block:: python
+  from nni.retiarii.evaluator.pytorch.cgo.evaluator import Classification
+  trainer = Classification(train_dataloader=pl.DataLoader(train_dataset, batch_size=100),
+                                val_dataloaders=pl.DataLoader(test_dataset, batch_size=100),
+                                max_epochs=1, limit_train_batches=0.2)
+Advanced users can also implement their own trainers by inheriting ``MultiModelSupervisedLearningModule``.
+Sometimes, a mutated model cannot be executed (e.g., due to shape mismatch). When a trial running multiple models contains 
+a bad model, CGO execution engine will re-run each model independently in seperate trials without cross-model optimizations.
--- a/docs/en_US/NAS/ExplorationStrategies.rst
+++ b/docs/en_US/NAS/ExplorationStrategies.rst
+Exploration Strategies for Multi-trial NAS
+==========================================
+Usage of Exploration Strategy
+-----------------------------
+To use an exploration strategy, users simply instantiate an exploration strategy and pass the instantiated object to ``RetiariiExperiment``. Below is a simple example.
+.. code-block:: python
+  import nni.retiarii.strategy as strategy
+  exploration_strategy = strategy.Random(dedup=True)  # dedup=False if deduplication is not wanted
+Supported Exploration Strategies
+--------------------------------
+NNI provides the following exploration strategies for multi-trial NAS.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Random Strategy <./ApiReference.rst#nni.retiarii.strategy.Random>`__
+     - Randomly sampling new model(s) from user defined model space. (``nni.retiarii.strategy.Random``)
+   * - `Grid Search <./ApiReference.rst#nni.retiarii.strategy.GridSearch>`__
+     - Sampling new model(s) from user defined model space using grid search algorithm. (``nni.retiarii.strategy.GridSearch``)
+   * - `Regularized Evolution <./ApiReference.rst#nni.retiarii.strategy.RegularizedEvolution>`__
+     - Generating new model(s) from generated models using `regularized evolution algorithm <https://arxiv.org/abs/1802.01548>`__ . (``nni.retiarii.strategy.RegularizedEvolution``)
+   * - `TPE Strategy <./ApiReference.rst#nni.retiarii.strategy.TPEStrategy>`__
+     - Sampling new model(s) from user defined model space using `TPE algorithm <https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf>`__ . (``nni.retiarii.strategy.TPEStrategy``)
+   * - `RL Strategy <./ApiReference.rst#nni.retiarii.strategy.PolicyBasedRL>`__
+     - It uses `PPO algorithm <https://arxiv.org/abs/1707.06347>`__ to sample new model(s) from user defined model space. (``nni.retiarii.strategy.PolicyBasedRL``)
+Customize Exploration Strategy
+------------------------------
+If users want to innovate a new exploration strategy, they can easily customize a new one following the interface provided by NNI. Specifically, users should inherit the base strategy class ``BaseStrategy``, then implement the member function ``run``. This member function takes ``base_model`` and ``applied_mutators`` as its input arguments. It can simply apply the user specified mutators in ``applied_mutators`` onto ``base_model`` to generate a new model. When a mutator is applied, it should be bound with a sampler (e.g., ``RandomSampler``). Every sampler implements the ``choice`` function which chooses value(s) from candidate values. The ``choice`` functions invoked in mutators are executed with the sampler.
+Below is a very simple random strategy, which makes the choices completely random.
+.. code-block:: python
+    from nni.retiarii import Sampler
+    class RandomSampler(Sampler):
+        def choice(self, candidates, mutator, model, index):
+            return random.choice(candidates)
+    class RandomStrategy(BaseStrategy):
+        def __init__(self):
+            self.random_sampler = RandomSampler()
+        def run(self, base_model, applied_mutators):
+            _logger.info('stargety start...')
+            while True:
+                avail_resource = query_available_resources()
+                if avail_resource > 0:
+                    model = base_model
+                    _logger.info('apply mutators...')
+                    _logger.info('mutators: %s', str(applied_mutators))
+                    for mutator in applied_mutators:
+                        mutator.bind_sampler(self.random_sampler)
+                        model = mutator.apply(model)
+                    # run models
+                    submit_models(model)
+                else:
+                    time.sleep(2)
+You can find that this strategy does not know the search space beforehand, it passively makes decisions every time ``choice`` is invoked from mutators. If a strategy wants to know the whole search space before making any decision (e.g., TPE, SMAC), it can use ``dry_run`` function provided by ``Mutator`` to obtain the space. An example strategy can be found :githublink:`here <nni/retiarii/strategy/tpe_strategy.py>`.
+After generating a new model, the strategy can use our provided APIs (e.g., ``submit_models``, ``is_stopped_exec``) to submit the model and get its reported results. More APIs can be found in `API References <./ApiReference.rst>`__.
--- a/docs/en_US/NAS/FBNet.rst
+++ b/docs/en_US/NAS/FBNet.rst
+FBNet
+======
+.. note:: This one-shot NAS is still implemented under NNI NAS 1.0, and will `be migrated to Retiarii framework in v2.4 <https://github.com/microsoft/nni/issues/3814>`__.
+For the mobile application of facial landmark, based on the basic architecture of PFLD model, we have applied the FBNet (Block-wise DNAS) to design an concise model with the trade-off between latency and accuracy. References are listed as below:
+* `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__
+* `PFLD: A Practical Facial Landmark Detector <https://arxiv.org/abs/1902.10859>`__
+FBNet is a block-wise differentiable NAS method (Block-wise DNAS), where the best candidate building blocks can be chosen by using Gumbel Softmax random sampling and differentiable training. At each layer (or stage) to be searched, the diverse candidate blocks are side by side planned (just like the effectiveness of structural re-parameterization), leading to sufficient pre-training of the supernet. The pre-trained supernet is further sampled for finetuning of the subnet, to achieve better performance.
+.. image:: ../../img/fbnet.png
+   :target: ../../img/fbnet.png
+   :alt:
+PFLD is a lightweight facial landmark model for realtime application. The architecture of PLFD is firstly simplified for acceleration, by using the stem block of PeleeNet, average pooling with depthwise convolution and eSE module.
+To achieve better trade-off between latency and accuracy, the FBNet is further applied on the simplified PFLD for searching the best block at each specific layer. The search space is based on the FBNet space, and optimized for mobile deployment by using the average pooling with depthwise convolution and eSE module etc.
+Experiments
+------------
+To verify the effectiveness of FBNet applied on PFLD, we choose the open source dataset with 106 landmark points as the benchmark:
+* `Grand Challenge of 106-Point Facial Landmark Localization <https://arxiv.org/abs/1905.03469>`__
+The baseline model is denoted as MobileNet-V3 PFLD (`Reference baseline <https://github.com/Hsintao/pfld_106_face_landmarks>`__), and the searched model is denoted as Subnet. The experimental results are listed as below, where the latency is tested on Qualcomm 625 CPU (ARMv8):
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Size
+     - Latency
+     - Validation NME
+   * - MobileNet-V3 PFLD
+     - 1.01MB
+     - 10ms
+     - 6.22%
+   * - Subnet
+     - 693KB
+     - 1.60ms
+     - 5.58%
+Example
+--------
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/oneshot/pfld>`__
+Please run the following scripts at the example directory.
+The Python dependencies used here are listed as below:
+.. code-block:: bash
+   numpy==1.18.5
+   opencv-python==4.5.1.48
+   torch==1.6.0
+   torchvision==0.7.0
+   onnx==1.8.1
+   onnx-simplifier==0.3.5
+   onnxruntime==1.7.0
+Data Preparation
+-----------------
+Firstly, you should download the dataset `106points dataset <https://drive.google.com/file/d/1I7QdnLxAlyG2Tq3L66QYzGhiBEoVfzKo/view?usp=sharing>`__ to the path ``./data/106points`` . The dataset includes the train-set and test-set:
+.. code-block:: bash
+   ./data/106points/train_data/imgs
+   ./data/106points/train_data/list.txt
+   ./data/106points/test_data/imgs
+   ./data/106points/test_data/list.txt
+Quik Start
+-----------
+1. Search
+^^^^^^^^^^
+Based on the architecture of simplified PFLD, the setting of multi-stage search space and hyper-parameters for searching should be firstly configured to construct the supernet, as an example:
+.. code-block:: bash
+   from lib.builder import search_space
+   from lib.ops import PRIMITIVES
+   from lib.supernet import PFLDInference, AuxiliaryNet
+   from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig,
+   # configuration of hyper-parameters
+   # search_space defines the multi-stage search space
+   nas_config = NASConfig(
+          model_dir="./ckpt_save",
+          nas_lr=0.01,
+          mode="mul",
+          alpha=0.25,
+          beta=0.6,
+          search_space=search_space,
+      )
+   # lookup table to manage the information
+   lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+   # created supernet
+   pfld_backbone = PFLDInference(lookup_table)
+After creation of the supernet with the specification of search space and hyper-parameters, we can run below command to start searching and training of the supernet:
+.. code-block:: bash
+   python train.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points"
+The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/supernet/checkpoint_best.pth``.
+2. Finetune
+^^^^^^^^^^^^
+After pre-training of the supernet, we can run below command to sample the subnet and conduct the finetuning:
+.. code-block:: bash
+   python retrain.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points" \
+                     --supernet "./ckpt_save/supernet/checkpoint_best.pth"
+The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/subnet/checkpoint_best.pth``.
+3. Export
+^^^^^^^^^^
+After the finetuning of subnet, we can run below command to export the ONNX model:
+.. code-block:: bash
+   python export.py --supernet "./ckpt_save/supernet/checkpoint_best.pth" \
+                    --resume "./ckpt_save/subnet/checkpoint_best.pth"
+ONNX model is saved as ``./output/subnet.onnx``, which can be further converted to the mobile inference engine by using `MNN <https://github.com/alibaba/MNN>`__ .
+The checkpoints of pre-trained supernet and subnet are offered as below:
+* `Supernet <https://drive.google.com/file/d/1TCuWKq8u4_BQ84BWbHSCZ45N3JGB9kFJ/view?usp=sharing>`__
+* `Subnet <https://drive.google.com/file/d/160rkuwB7y7qlBZNM3W_T53cb6MQIYHIE/view?usp=sharing>`__
+* `ONNX model <https://drive.google.com/file/d/1s-v-aOiMv0cqBspPVF3vSGujTbn_T_Uo/view?usp=sharing>`__
\ No newline at end of file
--- a/docs/en_US/NAS/HardwareAwareNAS.rst
+++ b/docs/en_US/NAS/HardwareAwareNAS.rst
+Hardware-aware NAS
+==================
+.. contents::
+End-to-end Multi-trial SPOS Demo
+--------------------------------
+To empower affordable DNN on the edge and mobile devices, hardware-aware NAS searches both high accuracy and low latency models. In particular, the search algorithm only considers the models within the target latency constraints during the search process.
+To run this demo, first install nn-Meter by running:
+.. code-block:: bash
+  pip install nn-meter
+Then run multi-trail SPOS demo:
+.. code-block:: bash
+  python ${NNI_ROOT}/examples/nas/oneshot/spos/multi_trial.py
+How the demo works
+^^^^^^^^^^^^^^^^^^
+To support hardware-aware NAS, you first need a ``Strategy`` that supports filtering the models by latency. We provide such a filter named ``LatencyFilter`` in NNI and initialize a ``Random`` strategy with the filter:
+.. code-block:: python
+  simple_strategy = strategy.Random(model_filter=LatencyFilter(threshold=100, predictor=base_predictor))
+``LatencyFilter`` will predict the models\' latency by using nn-Meter and filter out the models whose latency are larger than the threshold (i.e., ``100`` in this example).
+You can also build your own strategies and filters to support more flexible NAS such as sorting the models according to latency.
+Then, pass this strategy to ``RetiariiExperiment``:
+.. code-block:: python
+  exp = RetiariiExperiment(base_model, trainer, strategy=simple_strategy)
+  exp_config = RetiariiExeConfig('local')
+  ...
+  exp_config.dummy_input = [1, 3, 32, 32]
+  exp.run(exp_config, port)
+In ``exp_config``, ``dummy_input`` is required for tracing shape info.
+End-to-end ProxylessNAS with Latency Constraints
+------------------------------------------------
+`ProxylessNAS <https://arxiv.org/pdf/1812.00332.pdf>`__ is a hardware-aware one-shot NAS algorithm. ProxylessNAS applies the expected latency of the model to build a differentiable metric and design efficient neural network architectures for hardware. The latency loss is added as a regularization term for architecture parameter optimization. In this example, nn-Meter provides a latency estimator to predict expected latency for the mixed operation on other types of mobile and edge hardware. 
+To run the one-shot ProxylessNAS demo, first install nn-Meter by running:
+.. code-block:: bash
+  pip install nn-meter
+Then run one-shot ProxylessNAS demo:
+```bash
+python ${NNI_ROOT}/examples/nas/oneshot/proxylessnas/main.py --applied_hardware <hardware> --reference_latency <reference latency (ms)>
+```
+How the demo works
+^^^^^^^^^^^^^^^^^^
+In the implementation of ProxylessNAS ``trainer``, we provide a ``HardwareLatencyEstimator`` which currently builds a lookup table, that stores the measured latency of each candidate building block in the search space. The latency sum of all building blocks in a candidate model will be treated as the model inference latency. The latency prediction is obtained by ``nn-Meter``. ``HardwareLatencyEstimator`` predicts expected latency for the mixed operation based on the path weight of `ProxylessLayerChoice`. With leveraging ``nn-Meter`` in NNI, users can apply ProxylessNAS to search efficient DNN models on more types of edge devices. 
+Despite of ``applied_hardware`` and ``reference_latency``, There are some other parameters related to hardware-aware ProxylessNAS training in this :githublink:`example <examples/nas/oneshot/proxylessnas/main.py>`:
+* ``grad_reg_loss_type``: Regularization type to add hardware related loss. Allowed types include ``"mul#log"`` and ``"add#linear"``. Type of ``mul#log`` is calculate by ``(torch.log(expected_latency) / math.log(reference_latency)) ** beta``. Type of ``"add#linear"`` is calculate by ``reg_lambda * (expected_latency - reference_latency) / reference_latency``. 
+* ``grad_reg_loss_lambda``: Regularization params, is set to ``0.1`` by default.
+* ``grad_reg_loss_alpha``: Regularization params, is set to ``0.2`` by default.
+* ``grad_reg_loss_beta``: Regularization params, is set to ``0.3`` by default.
+* ``dummy_input``: The dummy input shape when applied to the target hardware. This parameter is set as (1, 3, 224, 224) by default.
--- a/docs/en_US/NAS/Hypermodules.rst
+++ b/docs/en_US/NAS/Hypermodules.rst
+Hypermodules
+============
+Hypermodule is a (PyTorch) module which contains many architecture/hyperparameter candidates for this module. By using hypermodule in user defined model, NNI will help users automatically find the best architecture/hyperparameter of the hypermodules for this model. This follows the design philosophy of Retiarii that users write DNN model as a space.
+There has been proposed some hypermodules in NAS community, such as AutoActivation, AutoDropout. Some of them are implemented in the Retiarii framework.
+..  autoclass:: nni.retiarii.nn.pytorch.AutoActivation
+    :members:
\ No newline at end of file
--- a/docs/en_US/NAS/ModelEvaluators.rst
+++ b/docs/en_US/NAS/ModelEvaluators.rst
+Model Evaluators
+================
+A model evaluator is for training and validating each generated model. They are necessary to evaluate the performance of new explored models.
+Customize Evaluator with Any Function
+-------------------------------------
+The simplest way to customize a new evaluator is with functional APIs, which is very easy when training code is already available. Users only need to write a fit function that wraps everything, which usually includes training, validating and testing of a single model. This function takes one positional arguments (``model_cls``) and possible keyword arguments. The keyword arguments (other than ``model_cls``) are fed to FunctionEvaluator as its initialization parameters (note that they will be `serialized <./Serialization.rst>`__). In this way, users get everything under their control, but expose less information to the framework and as a result, further optimizations like `CGO <./ExecutionEngines.rst#cgo-execution-engine-experimental>`__ might be not feasible. An example is as belows:
+.. code-block:: python
+    from nni.retiarii.evaluator import FunctionalEvaluator
+    from nni.retiarii.experiment.pytorch import RetiariiExperiment
+    def fit(model_cls, dataloader):
+        model = model_cls()
+        train(model, dataloader)
+        acc = test(model, dataloader)
+        nni.report_final_result(acc)
+    # The dataloader will be serialized, thus ``nni.trace`` is needed here.
+    # See serialization tutorial for more details.
+    evaluator = FunctionalEvaluator(fit, dataloader=nni.trace(DataLoader)(foo, bar))
+    experiment = RetiariiExperiment(base_model, evaluator, mutators, strategy)
+.. tip::
+    When using customized evaluators, if you want to visualize models, you need to export your model and save it into ``$NNI_OUTPUT_DIR/model.onnx`` in your evaluator. An example here:
+    .. code-block:: python
+        def fit(model_cls):
+            model = model_cls()
+            onnx_path = Path(os.environ.get('NNI_OUTPUT_DIR', '.')) / 'model.onnx'
+            onnx_path.parent.mkdir(exist_ok=True)
+            dummy_input = torch.randn(10, 3, 224, 224)
+            torch.onnx.export(model, dummy_input, onnx_path)
+            # the rest of training code here
+    If the conversion is successful, the model will be able to be visualized with powerful tools `Netron <https://netron.app/>`__.
+Evaluators with PyTorch-Lightning
+---------------------------------
+Use Built-in Evaluators
+^^^^^^^^^^^^^^^^^^^^^^^
+NNI provides some commonly used model evaluators for users' convenience. These evaluators are built upon the awesome library PyTorch-Lightning.
+We recommend to read the `serialization tutorial <./Serialization.rst>`__ before using these evaluators. A few notes to summarize the tutorial:
+1. ``pl.DataLoader`` should be used in place of ``torch.utils.data.DataLoader``.
+2. The datasets used in data-loader should be decorated with ``nni.trace`` recursively.
+For example,
+.. code-block:: python
+  import nni.retiarii.evaluator.pytorch.lightning as pl
+  from torchvision import transforms
+  transform = nni.trace(transforms.Compose, [nni.trace(transforms.ToTensor()), nni.trace(transforms.Normalize, (0.1307,), (0.3081,))])
+  train_dataset = nni.trace(MNIST, root='data/mnist', train=True, download=True, transform=transform)
+  test_dataset = nni.trace(MNIST, root='data/mnist', train=False, download=True, transform=transform)
+  # pl.DataLoader and pl.Classification is already traced and supports serialization.
+  evaluator = pl.Classification(train_dataloader=pl.DataLoader(train_dataset, batch_size=100),
+                                val_dataloaders=pl.DataLoader(test_dataset, batch_size=100),
+                                max_epochs=10)
+..  autoclass:: nni.retiarii.evaluator.pytorch.lightning.Classification
+    :noindex:
+..  autoclass:: nni.retiarii.evaluator.pytorch.lightning.Regression
+    :noindex:
+Customize Evaluator with PyTorch-Lightning
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Another approach is to write training code in PyTorch-Lightning style, that is, to write a LightningModule that defines all elements needed for training (e.g., loss function, optimizer) and to define a trainer that takes (optional) dataloaders to execute the training. Before that, please read the `document of PyTorch-lightning <https://pytorch-lightning.readthedocs.io/>`__ to learn the basic concepts and components provided by PyTorch-lightning.
+In practice, writing a new training module in Retiarii should inherit ``nni.retiarii.evaluator.pytorch.lightning.LightningModule``, which has a ``set_model`` that will be called after ``__init__`` to save the candidate model (generated by strategy) as ``self.model``. The rest of the process (like ``training_step``) should be the same as writing any other lightning module. Evaluators should also communicate with strategies via two API calls (``nni.report_intermediate_result`` for periodical metrics and ``nni.report_final_result`` for final metrics), added in ``on_validation_epoch_end`` and ``teardown`` respectively. 
+An example is as follows:
+.. code-block:: python
+    from nni.retiarii.evaluator.pytorch.lightning import LightningModule  # please import this one
+    @nni.trace
+    class AutoEncoder(LightningModule):
+        def __init__(self):
+            super().__init__()
+            self.decoder = nn.Sequential(
+                nn.Linear(3, 64),
+                nn.ReLU(),
+                nn.Linear(64, 28*28)
+            )
+        def forward(self, x):
+            embedding = self.model(x)  # let's search for encoder
+            return embedding
+        def training_step(self, batch, batch_idx):
+            # training_step defined the train loop.
+            # It is independent of forward
+            x, y = batch
+            x = x.view(x.size(0), -1)
+            z = self.model(x)  # model is the one that is searched for
+            x_hat = self.decoder(z)
+            loss = F.mse_loss(x_hat, x)
+            # Logging to TensorBoard by default
+            self.log('train_loss', loss)
+            return loss
+        def validation_step(self, batch, batch_idx):
+            x, y = batch
+            x = x.view(x.size(0), -1)
+            z = self.model(x)
+            x_hat = self.decoder(z)
+            loss = F.mse_loss(x_hat, x)
+            self.log('val_loss', loss)
+        def configure_optimizers(self):
+            optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
+            return optimizer
+        def on_validation_epoch_end(self):
+            nni.report_intermediate_result(self.trainer.callback_metrics['val_loss'].item())
+        def teardown(self, stage):
+            if stage == 'fit':
+                nni.report_final_result(self.trainer.callback_metrics['val_loss'].item())
+Then, users need to wrap everything (including LightningModule, trainer and dataloaders) into a ``Lightning`` object, and pass this object into a Retiarii experiment.
+.. code-block:: python
+    import nni.retiarii.evaluator.pytorch.lightning as pl
+    from nni.retiarii.experiment.pytorch import RetiariiExperiment
+    lightning = pl.Lightning(AutoEncoder(),
+                             pl.Trainer(max_epochs=10),
+                             train_dataloader=pl.DataLoader(train_dataset, batch_size=100),
+                             val_dataloaders=pl.DataLoader(test_dataset, batch_size=100))
+    experiment = RetiariiExperiment(base_model, lightning, mutators, strategy)
--- a/docs/en_US/NAS/MutationPrimitives.rst
+++ b/docs/en_US/NAS/MutationPrimitives.rst
+Mutation Primitives
+===================
+To make users easily express a model space within their PyTorch/TensorFlow model, NNI provides some inline mutation APIs as shown below.
+* `nn.LayerChoice <./ApiReference.rst#nni.retiarii.nn.pytorch.LayerChoice>`__. It allows users to put several candidate operations (e.g., PyTorch modules), one of them is chosen in each explored model.
+  .. code-block:: python
+    # import nni.retiarii.nn.pytorch as nn
+    # declared in `__init__` method
+    self.layer = nn.LayerChoice([
+      ops.PoolBN('max', channels, 3, stride, 1),
+      ops.SepConv(channels, channels, 3, stride, 1),
+      nn.Identity()
+    ])
+    # invoked in `forward` method
+    out = self.layer(x)
+* `nn.InputChoice <./ApiReference.rst#nni.retiarii.nn.pytorch.InputChoice>`__. It is mainly for choosing (or trying) different connections. It takes several tensors and chooses ``n_chosen`` tensors from them.
+  .. code-block:: python
+    # import nni.retiarii.nn.pytorch as nn
+    # declared in `__init__` method
+    self.input_switch = nn.InputChoice(n_chosen=1)
+    # invoked in `forward` method, choose one from the three
+    out = self.input_switch([tensor1, tensor2, tensor3])
+* `nn.ValueChoice <./ApiReference.rst#nni.retiarii.nn.pytorch.ValueChoice>`__. It is for choosing one value from some candidate values. It can only be used as input argument of basic units, that is, modules in ``nni.retiarii.nn.pytorch`` and user-defined modules decorated with ``@basic_unit``.
+  .. code-block:: python
+    # import nni.retiarii.nn.pytorch as nn
+    # used in `__init__` method
+    self.conv = nn.Conv2d(XX, XX, kernel_size=nn.ValueChoice([1, 3, 5])
+    self.op = MyOp(nn.ValueChoice([0, 1]), nn.ValueChoice([-1, 1]))
+* `nn.Repeat <./ApiReference.rst#nni.retiarii.nn.pytorch.Repeat>`__. Repeat a block by a variable number of times.
+* `nn.Cell <./ApiReference.rst#nni.retiarii.nn.pytorch.Cell>`__. `This cell structure is popularly used in NAS literature <https://arxiv.org/abs/1611.01578>`__. Specifically, the cell consists of multiple "nodes". Each node is a sum of multiple operators. Each operator is chosen from user specified candidates, and takes one input from previous nodes and predecessors. Predecessor means the input of cell. The output of cell is the concatenation of some of the nodes in the cell (currently all the nodes).
+All the APIs have an optional argument called ``label``, mutations with the same label will share the same choice. A typical example is,
+  .. code-block:: python
+    self.net = nn.Sequential(
+        nn.Linear(10, nn.ValueChoice([32, 64, 128], label='hidden_dim'),
+        nn.Linear(nn.ValueChoice([32, 64, 128], label='hidden_dim'), 3)
+    )