Merge pull request #4668 from microsoft/doc-refactor

51d261e7 · J-shang · GitHub · d63a2ea3 · b469e1c1 · 51d261e7
Unverified Commit 51d261e7 authored Mar 22, 2022 by J-shang Committed by GitHub Mar 22, 2022
20 changed files
--- a/docs/source/NAS/ExecutionEngines.rst
+++ b/docs/source/NAS/ExecutionEngines.rst
 Execution Engines
 =================

-Execution engine is for running Retiarii Experiment. NNI supports three execution engines, users can choose a speicific engine according to the type of their model mutation definition and their requirements for cross-model optimizations. 
+Execution engine is for running Retiarii Experiment. NNI supports three execution engines, users can choose a specific engine according to the type of their model mutation definition and their requirements for cross-model optimizations. 

 * **Pure-python execution engine** is the default engine, it supports the model space expressed by `inline mutation API <./MutationPrimitives.rst>`__. 

@@ -53,10 +53,12 @@ Three steps are need to use graph-based execution engine.

 For exporting top models, graph-based execution engine supports exporting source code for top models by running ``exp.export_top_models(formatter='code')``.

+.. _cgo-execution-engine:
+
 CGO Execution Engine (experimental)
 -----------------------------------

-CGO（Cross-Graph Optimization) execution engine does cross-model optimizations based on the graph-based execution engine. In CGO execution engine, multiple models could be merged and trained together in one trial.
+CGO (Cross-Graph Optimization) execution engine does cross-model optimizations based on the graph-based execution engine. In CGO execution engine, multiple models could be merged and trained together in one trial.
 Currently, it only supports ``DedupInputOptimizer`` that can merge graphs sharing the same dataset to only loading and pre-processing each batch of data once, which can avoid bottleneck on data loading. 

 .. note :: To use CGO engine, PyTorch-lightning above version 1.4.2 is required.
@@ -67,7 +69,7 @@ To enable CGO execution engine, you need to follow these steps:
 2. Add configurations for remote training service
 3. Add configurations for CGO engine

-  .. code-block:: python
+.. code-block:: python
  
    exp = RetiariiExperiment(base_model, trainer, mutators, strategy)
    config = RetiariiExeConfig('remote')
@@ -103,4 +105,16 @@ We have already implemented two trainers: :class:`nni.retiarii.evaluator.pytorch
 Advanced users can also implement their own trainers by inheriting ``MultiModelSupervisedLearningModule``.

 Sometimes, a mutated model cannot be executed (e.g., due to shape mismatch). When a trial running multiple models contains 
-a bad model, CGO execution engine will re-run each model independently in seperate trials without cross-model optimizations.
+a bad model, CGO execution engine will re-run each model independently in separate trials without cross-model optimizations.
+
+References
+^^^^^^^^^^
+
+..  autoclass:: nni.retiarii.evaluator.pytorch.cgo.evaluator.MultiModelSupervisedLearningModule
+    :members:
+
+..  autoclass:: nni.retiarii.evaluator.pytorch.cgo.evaluator.Classification
+    :members:
+
+..  autoclass:: nni.retiarii.evaluator.pytorch.cgo.evaluator.Regression
+    :members:
--- a/docs/source/nas/exploration_strategy.rst
+++ b/docs/source/nas/exploration_strategy.rst
+Exploration Strategy
+====================
+
+There are two types of model space exploration approach: **Multi-trial NAS** and **One-shot NAS**. Mutli-trial NAS trains each sampled model in the model space independently, while One-shot NAS samples the model from a super model. After constructing the model space, users can use either exploration approach to explore the model space. 
+
+.. _multi-trial-nas:
+
+Multi-trial strategy
+--------------------
+
+Multi-trial NAS means each sampled model from model space is trained independently. A typical multi-trial NAS is `NASNet <https://arxiv.org/abs/1707.07012>`__. In multi-trial NAS, users need model evaluator to evaluate the performance of each sampled model, and need an exploration strategy to sample models from a defined model space. Here, users could use NNI provided model evaluators or write their own model evalutor. They can simply choose a exploration strategy. Advanced users can also customize new exploration strategy.
+
+To use an exploration strategy, users simply instantiate an exploration strategy and pass the instantiated object to :class:`nni.retiarii.nn.pytorch.RetiariiExperiment`. Below is a simple example.
+
+.. code-block:: python
+
+  import nni.retiarii.strategy as strategy
+
+  exploration_strategy = strategy.Random(dedup=True)
+
+Rather than using ``strategy.Random``, users can choose one of the strategies from below.
+
+.. _random-strategy:
+
+Random
+^^^^^^
+
+.. autoclass:: nni.retiarii.strategy.Random
+   :members:
+   :noindex:
+
+.. _grid-search-strategy:
+
+GridSearch
+^^^^^^^^^^
+
+.. autoclass:: nni.retiarii.strategy.GridSearch
+   :members:
+   :noindex:
+
+.. _regularized-evolution-strategy:
+
+RegularizedEvolution
+^^^^^^^^^^^^^^^^^^^^
+
+.. autoclass:: nni.retiarii.strategy.RegularizedEvolution
+   :members:
+   :noindex:
+
+.. _tpe-strategy:
+
+TPE
+^^^
+
+.. autoclass:: nni.retiarii.strategy.TPE
+   :members:
+   :noindex:
+
+.. footbibliography::
+
+.. _policy-based-rl-strategy:
+
+PolicyBasedRL
+^^^^^^^^^^^^^
+
+.. autoclass:: nni.retiarii.strategy.PolicyBasedRL
+   :members:
+   :noindex:
+
+.. footbibliography::
+
+.. _one-shot-nas:
+
+One-shot strategy
+-----------------
+
+One-shot NAS algorithms leverage weight sharing among models in neural architecture search space to train a supernet, and use this supernet to guide the selection of better models. This type of algorihtms greatly reduces computational resource compared to independently training each model from scratch (which we call "Multi-trial NAS"). NNI has supported many popular One-shot NAS algorithms as following.
+
+.. _darts-strategy:
+
+DARTS
+^^^^^
+
+The paper `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Their method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
+
+Authors' code optimizes the network weights and architecture weights alternatively in mini-batches. They further explore the possibility that uses second order optimization (unroll) instead of first order, to improve the performance.
+
+Implementation on NNI is based on the `official implementation <https://github.com/quark0/darts>`__ and a `popular 3rd-party repo <https://github.com/khanrc/pt.darts>`__. DARTS on NNI is designed to be general for arbitrary search space. A CNN search space tailored for CIFAR10, same as the original paper, is implemented as a use case of DARTS.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.DartsTrainer
+    :noindex:
+
+Reproduction Results
+""""""""""""""""""""
+
+The above-mentioned example is meant to reproduce the results in the paper, we do experiments with first and second order optimization. Due to the time limit, we retrain *only the best architecture* derived from the search phase and we repeat the experiment *only once*. Our results is currently on par with the results reported in paper. We will add more results later when ready.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - 
+     - In paper
+     - Reproduction
+   * - First order (CIFAR10)
+     - 3.00 +/- 0.14
+     - 2.78
+   * - Second order (CIFAR10)
+     - 2.76 +/- 0.09
+     - 2.80
+
+Examples
+""""""""
+
+:githublink:`Example code <examples/nas/oneshot/darts>`
+
+.. code-block:: bash
+
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+
+   # search the best architecture
+   cd examples/nas/oneshot/darts
+   python3 search.py
+
+   # train the best architecture
+   python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
+
+Limitations
+"""""""""""
+
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
+
+.. _enas-strategy:
+
+ENAS
+^^^^
+
+The paper `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__ uses parameter sharing between child models to accelerate the NAS process. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss.
+
+Implementation on NNI is based on the `official implementation in Tensorflow <https://github.com/melodyguan/enas>`__, including a general-purpose Reinforcement-learning controller and a trainer that trains target network and this controller alternatively. Following paper, we have also implemented macro and micro search space on CIFAR10 to demonstrate how to use these trainers. Since code to train from scratch on NNI is not ready yet, reproduction results are currently unavailable.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.EnasTrainer
+    :noindex:
+
+Examples
+""""""""
+
+:githublink:`Example code <examples/nas/oneshot/enas>`
+
+.. code-block:: bash
+
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+
+   # search the best architecture
+   cd examples/nas/oneshot/enas
+
+   # search in macro search space
+   python3 search.py --search-for macro
+
+   # search in micro search space
+   python3 search.py --search-for micro
+
+   # view more options for search
+   python3 search.py -h
+
+.. _fbnet-strategy:
+
+FBNet
+^^^^^
+
+.. note:: This one-shot NAS is still implemented under NNI NAS 1.0, and will `be migrated to Retiarii framework in v2.4 <https://github.com/microsoft/nni/issues/3814>`__.
+
+For the mobile application of facial landmark, based on the basic architecture of PFLD model, we have applied the FBNet (Block-wise DNAS) to design an concise model with the trade-off between latency and accuracy. References are listed as below:
+
+* `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__
+* `PFLD: A Practical Facial Landmark Detector <https://arxiv.org/abs/1902.10859>`__
+
+FBNet is a block-wise differentiable NAS method (Block-wise DNAS), where the best candidate building blocks can be chosen by using Gumbel Softmax random sampling and differentiable training. At each layer (or stage) to be searched, the diverse candidate blocks are side by side planned (just like the effectiveness of structural re-parameterization), leading to sufficient pre-training of the supernet. The pre-trained supernet is further sampled for finetuning of the subnet, to achieve better performance.
+
+.. image:: ../../img/fbnet.png
+
+PFLD is a lightweight facial landmark model for realtime application. The architecture of PLFD is firstly simplified for acceleration, by using the stem block of PeleeNet, average pooling with depthwise convolution and eSE module.
+
+To achieve better trade-off between latency and accuracy, the FBNet is further applied on the simplified PFLD for searching the best block at each specific layer. The search space is based on the FBNet space, and optimized for mobile deployment by using the average pooling with depthwise convolution and eSE module etc.
+
+Experiments
+"""""""""""
+
+To verify the effectiveness of FBNet applied on PFLD, we choose the open source dataset with 106 landmark points as the benchmark:
+
+* `Grand Challenge of 106-Point Facial Landmark Localization <https://arxiv.org/abs/1905.03469>`__
+
+The baseline model is denoted as MobileNet-V3 PFLD (`Reference baseline <https://github.com/Hsintao/pfld_106_face_landmarks>`__), and the searched model is denoted as Subnet. The experimental results are listed as below, where the latency is tested on Qualcomm 625 CPU (ARMv8):
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Model
+     - Size
+     - Latency
+     - Validation NME
+   * - MobileNet-V3 PFLD
+     - 1.01MB
+     - 10ms
+     - 6.22%
+   * - Subnet
+     - 693KB
+     - 1.60ms
+     - 5.58%
+
+Example
+"""""""
+
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/oneshot/pfld>`__
+
+Please run the following scripts at the example directory.
+
+The Python dependencies used here are listed as below:
+
+.. code-block:: bash
+
+   numpy==1.18.5
+   opencv-python==4.5.1.48
+   torch==1.6.0
+   torchvision==0.7.0
+   onnx==1.8.1
+   onnx-simplifier==0.3.5
+   onnxruntime==1.7.0
+
+To run the tutorial, follow the steps below:
+
+1. **Data Preparation**: Firstly, you should download the dataset `106points dataset <https://drive.google.com/file/d/1I7QdnLxAlyG2Tq3L66QYzGhiBEoVfzKo/view?usp=sharing>`__ to the path ``./data/106points`` . The dataset includes the train-set and test-set:
+
+   .. code-block:: bash
+
+      ./data/106points/train_data/imgs
+      ./data/106points/train_data/list.txt
+      ./data/106points/test_data/imgs
+      ./data/106points/test_data/list.txt
+
+2. **Search**: Based on the architecture of simplified PFLD, the setting of multi-stage search space and hyper-parameters for searching should be firstly configured to construct the supernet. For example,
+
+   .. code-block:: bash
+
+      from lib.builder import search_space
+      from lib.ops import PRIMITIVES
+      from lib.supernet import PFLDInference, AuxiliaryNet
+      from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig,
+
+      # configuration of hyper-parameters
+      # search_space defines the multi-stage search space
+      nas_config = NASConfig(
+            model_dir="./ckpt_save",
+            nas_lr=0.01,
+            mode="mul",
+            alpha=0.25,
+            beta=0.6,
+            search_space=search_space,
+         )
+      # lookup table to manage the information
+      lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+      # created supernet
+      pfld_backbone = PFLDInference(lookup_table)
+
+   After creation of the supernet with the specification of search space and hyper-parameters, we can run below command to start searching and training of the supernet:
+
+   .. code-block:: bash
+
+      python train.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points"
+
+   The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/supernet/checkpoint_best.pth``.
+
+3. **Finetune**: After pre-training of the supernet, we can run below command to sample the subnet and conduct the finetuning:
+
+   .. code-block:: bash
+
+      python retrain.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points" \
+                        --supernet "./ckpt_save/supernet/checkpoint_best.pth"
+
+   The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/subnet/checkpoint_best.pth``.
+
+4. **Export**: After the finetuning of subnet, we can run below command to export the ONNX model:
+
+   .. code-block:: bash
+
+      python export.py --supernet "./ckpt_save/supernet/checkpoint_best.pth" \
+                       --resume "./ckpt_save/subnet/checkpoint_best.pth"
+
+   ONNX model is saved as ``./output/subnet.onnx``, which can be further converted to the mobile inference engine by using `MNN <https://github.com/alibaba/MNN>`__ .
+   The checkpoints of pre-trained supernet and subnet are offered as below:
+
+   * `Supernet <https://drive.google.com/file/d/1TCuWKq8u4_BQ84BWbHSCZ45N3JGB9kFJ/view?usp=sharing>`__
+   * `Subnet <https://drive.google.com/file/d/160rkuwB7y7qlBZNM3W_T53cb6MQIYHIE/view?usp=sharing>`__
+   * `ONNX model <https://drive.google.com/file/d/1s-v-aOiMv0cqBspPVF3vSGujTbn_T_Uo/view?usp=sharing>`__
+
+.. _spos-strategy:
+
+SPOS
+^^^^
+
+Proposed in `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ is a one-shot NAS method that addresses the difficulties in training One-Shot NAS models by constructing a simplified supernet trained with an uniform path sampling method, so that all underlying architectures (and their weights) get trained fully and equally. An evolutionary algorithm is then applied to efficiently search for the best-performing architectures without any fine tuning.
+
+Implementation on NNI is based on `official repo <https://github.com/megvii-model/SinglePathOneShot>`__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.SinglePathTrainer
+    :noindex:
+
+Examples
+""""""""
+
+Here is a use case, which is the search space in paper. However, we applied latency limit instead of flops limit to perform the architecture search phase.
+
+:githublink:`Example code <examples/nas/oneshot/spos>`
+
+**Requirements:** Prepare ImageNet in the standard format (follow the script `here <https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4>`__). Linking it to ``data/imagenet`` will be more convenient. Download the checkpoint file from `here <https://1drv.ms/u/s!Am_mmG2-KsrnajesvSdfsq_cN48?e=aHVppN>`__ (maintained by `Megvii <https://github.com/megvii-model>`__) if you don't want to retrain the supernet. Put ``checkpoint-150000.pth.tar`` under ``data`` directory. After preparation, it's expected to have the following code structure:
+
+.. code-block:: bash
+
+   spos
+   ├── architecture_final.json
+   ├── blocks.py
+   ├── data
+   │   ├── imagenet
+   │   │   ├── train
+   │   │   └── val
+   │   └── checkpoint-150000.pth.tar
+   ├── network.py
+   ├── readme.md
+   ├── supernet.py
+   ├── evaluation.py
+   ├── search.py
+   └── utils.py
+
+Then follow the 3 steps:
+
+1. **Train Supernet**:
+
+   .. code-block:: bash
+
+      python supernet.py
+
+   This will export the checkpoint to ``checkpoints`` directory, for the next step.
+
+   .. note:: The data loading used in the official repo is `slightly different from usual <https://github.com/megvii-model/SinglePathOneShot/issues/5>`__, as they use BGR tensor and keep the values between 0 and 255 intentionally to align with their own DL framework. The option ``--spos-preprocessing`` will simulate the behavior used originally and enable you to use the checkpoints pretrained.
+
+2. **Evolution Search**: Single Path One-Shot leverages evolution algorithm to search for the best architecture. In the paper, the search module, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
+   In this example, it will inherit the ``state_dict`` of supernet from `./data/checkpoint-150000.pth.tar`, and search the best architecture with the regularized evolution strategy. Search in the supernet with the following command
+
+   .. code-block:: bash
+
+      python search.py
+
+   NNI support a latency filter to filter unsatisfied model from search phase. Latency is predicted by Microsoft nn-Meter (https://github.com/microsoft/nn-Meter). To apply the latency filter, users could run search.py with additional arguments ``--latency-filter``. Here is an example:
+
+   .. code-block:: bash
+
+      python search.py --latency-filter cortexA76cpu_tflite21
+
+   Note that the latency filter is only supported for base execution engine.
+
+   The final architecture exported from every epoch of evolution can be found in ``trials`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/trials``.
+
+3. **Train for Evaluation**:
+
+   .. code-block:: bash
+
+      python evaluation.py
+
+   By default, it will use ``architecture_final.json``. This architecture is provided by the official repo (converted into NNI format). You can use any architecture (e.g., the architecture found in step 2) with ``--fixed-arc`` option.
+
+Known Limitations
+"""""""""""""""""
+
+* Block search only. Channel search is not supported yet.
+
+Current Reproduction Results
+""""""""""""""""""""""""""""
+
+Reproduction is still undergoing. Due to the gap between official release and original paper, we compare our current results with official repo (our run) and paper.
+
+* Evolution phase is almost aligned with official repo. Our evolution algorithm shows a converging trend and reaches ~65% accuracy at the end of search. Nevertheless, this result is not on par with paper. For details, please refer to `this issue <https://github.com/megvii-model/SinglePathOneShot/issues/6>`__.
+* Retrain phase is not aligned. Our retraining code, which uses the architecture released by the authors, reaches 72.14% accuracy, still having a gap towards 73.61% by official release and 74.3% reported in original paper.
+
+.. _proxylessnas-strategy:
+
+ProxylessNAS
+^^^^^^^^^^^^
+
+The paper `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/pdf/1812.00332.pdf>`__ removes proxy, it directly learns the architectures for large-scale target tasks and target hardware platforms. They address high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set. Please refer to the paper for the details.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.ProxylessTrainer
+    :noindex:
+
+Usage
+"""""
+
+To use ProxylessNAS training/searching approach, users need to specify search space in their model using :doc:`NNI NAS interface <./construct_space>`, e.g., ``LayerChoice``, ``InputChoice``. After defining and instantiating the model, the following work can be leaved to ProxylessNasTrainer by instantiating the trainer and passing the model to it.
+
+.. code-block:: python
+
+   trainer = ProxylessTrainer(model,
+                              loss=LabelSmoothingLoss(),
+                              dataset=None,
+                              optimizer=optimizer,
+                              metrics=lambda output, target: accuracy(output, target, topk=(1, 5,)),
+                              num_epochs=120,
+                              log_frequency=10,
+                              grad_reg_loss_type=args.grad_reg_loss_type, 
+                              grad_reg_loss_params=grad_reg_loss_params, 
+                              applied_hardware=args.applied_hardware, dummy_input=(1, 3, 224, 224),
+                              ref_latency=args.reference_latency)
+   trainer.train()
+   trainer.export(args.arch_path)
+
+The complete example code can be found :githublink:`here <examples/nas/oneshot/proxylessnas>`.
+
+Implementation
+""""""""""""""
+
+The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based. In our current implementation on NNI, gradient descent training approach is supported. The complete support of ProxylessNAS is ongoing.
+
+The official implementation supports different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'.  In NNI repo, the hardware latency prediction is supported by `Microsoft nn-Meter <https://github.com/microsoft/nn-Meter>`__. nn-Meter is an accurate inference latency predictor for DNN models on diverse edge devices. nn-Meter support four hardwares up to now, including ``cortexA76cpu_tflite21``, ``adreno640gpu_tflite21``, ``adreno630gpu_tflite21``, and ``myriadvpu_openvino2019r2``. Users can find more information about nn-Meter on its website. More hardware will be supported in the future. Users could find more details about applying ``nn-Meter`` `here <./HardwareAwareNAS.rst>`__ .
+
+Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, please refer to :githublink:`example code <examples/nas/oneshot/proxylessnas>` for a reference.
+
+.. image:: ../../img/proxylessnas.png
+
+ProxylessNAS training approach is composed of ProxylessLayerChoice and ProxylessNasTrainer. ProxylessLayerChoice instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**, architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The ProxylessLayerChoice class also exposes two member functions, i.e., ``resample``, ``finalize_grad``, for the trainer to control the training of architecture weights.
+
+Reproduction Results
+""""""""""""""""""""
+
+To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it.
--- a/docs/source/NAS/HardwareAwareNAS.rst
+++ b/docs/source/NAS/HardwareAwareNAS.rst
 Hardware-aware NAS
 ==================

-.. contents::
+.. This file should be rewritten as a tutorial

 End-to-end Multi-trial SPOS Demo
 --------------------------------
@@ -61,14 +61,14 @@ To run the one-shot ProxylessNAS demo, first install nn-Meter by running:

 Then run one-shot ProxylessNAS demo:

-```bash
-python ${NNI_ROOT}/examples/nas/oneshot/proxylessnas/main.py --applied_hardware <hardware> --reference_latency <reference latency (ms)>
-```
+.. code-block:: bash
+
+   python ${NNI_ROOT}/examples/nas/oneshot/proxylessnas/main.py --applied_hardware <hardware> --reference_latency <reference latency (ms)>

 How the demo works
 ^^^^^^^^^^^^^^^^^^

-In the implementation of ProxylessNAS ``trainer``, we provide a ``HardwareLatencyEstimator`` which currently builds a lookup table, that stores the measured latency of each candidate building block in the search space. The latency sum of all building blocks in a candidate model will be treated as the model inference latency. The latency prediction is obtained by ``nn-Meter``. ``HardwareLatencyEstimator`` predicts expected latency for the mixed operation based on the path weight of `ProxylessLayerChoice`. With leveraging ``nn-Meter`` in NNI, users can apply ProxylessNAS to search efficient DNN models on more types of edge devices. 
+In the implementation of ProxylessNAS ``trainer``, we provide a ``HardwareLatencyEstimator`` which currently builds a lookup table, that stores the measured latency of each candidate building block in the search space. The latency sum of all building blocks in a candidate model will be treated as the model inference latency. The latency prediction is obtained by ``nn-Meter``. ``HardwareLatencyEstimator`` predicts expected latency for the mixed operation based on the path weight of ``ProxylessLayerChoice``. With leveraging ``nn-Meter`` in NNI, users can apply ProxylessNAS to search efficient DNN models on more types of edge devices. 

 Despite of ``applied_hardware`` and ``reference_latency``, There are some other parameters related to hardware-aware ProxylessNAS training in this :githublink:`example <examples/nas/oneshot/proxylessnas/main.py>`:


--- a/docs/source/nas/index.rst
+++ b/docs/source/nas/index.rst
+Retiarii for Neural Architecture Search
+=======================================
+
+.. toctree::
+   :hidden:
+   :titlesonly:
+
+   Quick Start <../tutorials/cp_hello_nas_quickstart>
+   construct_space
+   exploration_strategy
+   evaluator
+   advanced_usage
+   reference
+
+.. attention:: NNI's latest NAS supports are all based on Retiarii Framework, users who are still on `early version using NNI NAS v1.0 <https://nni.readthedocs.io/en/v2.2/nas.html>`__ shall migrate your work to Retiarii as soon as possible.
+
+.. note:: PyTorch is the **only supported framework on Retiarii**. Inquiries of NAS support on Tensorflow is in `this discussion <https://github.com/microsoft/nni/discussions/4605>`__. If you intend to run NAS with DL frameworks other than PyTorch and Tensorflow, please `open new issues <https://github.com/microsoft/nni/issues>`__ to let us know.
+
+.. Using rubric to prevent the section heading to be include into toc
+
+.. rubric:: Motivation
+
+Automatic neural architecture search is playing an increasingly important role in finding better models. Recent research has proven the feasibility of automatic NAS and has led to models that beat many manually designed and tuned models. Representative works include `NASNet <https://arxiv.org/abs/1707.07012>`__, `ENAS <https://arxiv.org/abs/1802.03268>`__, `DARTS <https://arxiv.org/abs/1806.09055>`__, `Network Morphism <https://arxiv.org/abs/1806.10282>`__, and `Evolution <https://arxiv.org/abs/1703.01041>`__. In addition, new innovations continue to emerge.
+
+However, it is pretty hard to use existing NAS work to help develop common DNN models. Therefore, we designed `Retiarii <https://www.usenix.org/system/files/osdi20-zhang_quanlu.pdf>`__, a novel NAS/HPO framework, and implemented it in NNI. It helps users easily construct a model space (or search space, tuning space), and utilize existing NAS algorithms. The framework also facilitates NAS innovation and is used to design new NAS algorithms.
+
+In summary, we highlight the following features for Retiarii:
+
+* Simple APIs are provided for defining model search space within a deep learning model.
+* SOTA NAS algorithms are built-in to be used for exploring model search space.
+* System-level optimizations are implemented for speeding up the exploration.
+
+.. rubric:: Overview
+
+High-level speaking, aiming to solve any particular task with neural architecture search typically requires: search space design, search strategy selection, and performance evaluation. The three components work together with the following loop (the figure is from the famous `NAS survey <https://arxiv.org/abs/1808.05377>`__):
+
+.. image:: ../../img/nas_abstract_illustration.png
+
+To be consistent, we will use the following terminologies throughout our documentation:
+
+* *Model search space*: it means a set of models from which the best model is explored/searched. Sometimes we use *search space* or *model space* in short.
+* *Exploration strategy*: the algorithm that is used to explore a model search space. Sometimes we also call it *search strategy*.
+* *Model evaluator*: it is used to train a model and evaluate the model's performance.
+
+Concretely, an exploration strategy selects an architecture from a predefined search space. The architecture is passed to a performance evaluation to get a score, which represents how well this architecture performs on a particular task. This process is repeated until the search process is able to find the best architecture.
+
+During such process, we list out the core engineering challenges (which are also pointed out by the famous `NAS survey <https://arxiv.org/abs/1808.05377>`__) and the solutions NNI has provided to address them:
+
+* **Search space design:** The search space defines which architectures can be represented in principle. Incorporating prior knowledge about typical properties of architectures well-suited for a task can reduce the size of the search space and simplify the search. However, this also introduces a human bias, which may prevent finding novel architectural building blocks that go beyond the current human knowledge. In NNI, we provide a wide range of APIs to build the search space. There are :doc:`high-level APIs <construct_space>`, that enables incorporating human knowledge about what makes a good architecture or search space. There are also :doc:`low-level APIs <mutator>`, that is a list of primitives to construct a network from operator to operator.
+* **Exploration strategy:** The exploration strategy details how to explore the search space (which is often exponentially large). It encompasses the classical exploration-exploitation trade-off since, on the one hand, it is desirable to find well-performing architectures quickly, while on the other hand, premature convergence to a region of suboptimal architectures should be avoided. In NNI, we have also provided :doc:`a list of strategies <exploration_strategy>`. Some of them are powerful, but time consuming, while others might be suboptimal but really efficient. Users can always find one that matches their need.
+* **Performance estimation / evaluator:** The objective of NAS is typically to find architectures that achieve high predictive performance on unseen data. Performance estimation refers to the process of estimating this performance. In NNI, this process is implemented with :doc:`evaluator <evaluator>`, which is responsible of estimating a model's performance. The choices of evaluators also range from the simplest option, e.g., to perform a standard training and validation of the architecture on data, to complex configurations and implementations.
+
+.. rubric:: Writing Model Space
+
+The following APIs are provided to ease the engineering effort of writing a new search space.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Category
+     - Brief Description
+   * - :ref:`nas-layer-choice`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Select from some PyTorch modules
+   * - :ref:`nas-input-choice`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Select from some inputs (tensors)
+   * - :ref:`nas-value-choice`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Select from some candidate values
+   * - :ref:`nas-repeat`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Repeat a block by a variable number of times
+   * - :ref:`nas-cell`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Cell structure popularly used in literature
+   * - :ref:`nas-cell-101`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Cell structure (variant) proposed by NAS-Bench-101
+   * - :ref:`nas-cell-201`
+     - :ref:`Mutation Primitives <mutation-primitives>`
+     - Cell structure (variant) proposed by NAS-Bench-201
+   * - :ref:`nas-autoactivation`
+     - :ref:`Hyper-modules <hyper-modules>`
+     - Searching for activation functions
+   * - :doc:`Mutator <mutator>`
+     - :doc:`mutator`
+     - Flexible mutations on graphs
+
+.. rubric:: Exploring the Search Space
+
+We provide the following (built-in) algorithms to explore the user-defined search space.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Category
+     - Brief Description
+   * - :ref:`random-strategy`
+     - :ref:`Multi-trial <multi-trial-nas>`
+     - Randomly sample an architecture each time
+   * - :ref:`grid-search-strategy`
+     - :ref:`Multi-trial <multi-trial-nas>`
+     - Traverse the search space and try all possibilities
+   * - :ref:`regularized-evolution-strategy`
+     - :ref:`Multi-trial <multi-trial-nas>`
+     - Evolution algorithm for NAS. `Reference <https://arxiv.org/abs/1802.01548>`__
+   * - :ref:`tpe-strategy`
+     - :ref:`Multi-trial <multi-trial-nas>`
+     - Tree-structured Parzen Estimator (TPE). `Reference <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
+   * - :ref:`policy-based-rl-strategy`
+     - :ref:`Multi-trial <multi-trial-nas>`
+     - Policy-based reinforcement learning, based on implementation of tianshou. `Reference <https://arxiv.org/abs/1611.01578>`__
+   * - :ref:`darts-strategy`
+     - :ref:`One-shot <one-shot-nas>`
+     - Continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. `Reference <https://arxiv.org/abs/1806.09055>`__
+   * - :ref:`enas-strategy`
+     - :ref:`One-shot <one-shot-nas>`
+     - RL controller learns to generate the best network on a super-net. `Reference <https://arxiv.org/abs/1802.03268>`__
+   * - :ref:`fbnet-strategy`
+     - :ref:`One-shot <one-shot-nas>`
+     - Choose the best block by using Gumbel Softmax random sampling and differentiable training. `Reference <https://arxiv.org/abs/1812.03443>`__
+   * - :ref:`spos-strategy`
+     - :ref:`One-shot <one-shot-nas>`
+     - Train a super-net with uniform path sampling. `Reference <https://arxiv.org/abs/1904.00420>`__
+   * - :ref:`proxylessnas-strategy`
+     - :ref:`One-shot <one-shot-nas>`
+     - A low-memory-consuming optimized version of differentiable architecture search. `Reference <https://arxiv.org/abs/1812.00332>`__
+
+.. rubric:: Evaluators
+
+The evaluator APIs can be used to build performance assessment component of your neural architecture search process.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Type
+     - Brief Description
+   * - :ref:`functional-evaluator`
+     - General
+     - Evaluate with any Python function
+   * - :ref:`classification-evaluator`
+     - Built upon `PyTorch Lightning <https://www.pytorchlightning.ai/>`__
+     - For classification tasks
+   * - :ref:`regression-evaluator`
+     - Built upon `PyTorch Lightning <https://www.pytorchlightning.ai/>`__
+     - For regression tasks
--- a/docs/source/NAS/Mutators.rst
+++ b/docs/source/NAS/Mutators.rst
-Express Mutations with Mutators
-===============================
+Construct Space with Mutators
+=============================

-Besides the inline mutation APIs demonstrated `here <./MutationPrimitives.rst>`__, NNI provides a more general approach to express a model space, i.e., *Mutator*, to cover more complex model spaces. Those inline mutation APIs are also implemented with mutator in the underlying system, which can be seen as a special case of model mutation.
+Besides the inline mutation APIs demonstrated :ref:`above <mutation-primitives>`, NNI provides a more general approach to express a model space, i.e., *Mutator*, to cover more complex model spaces. Those inline mutation APIs are also implemented with mutator in the underlying system, which can be seen as a special case of model mutation.

 .. note:: Mutator and inline mutation APIs cannot be used together.

@@ -37,7 +37,7 @@ User-defined mutator should inherit ``Mutator`` class, and implement mutation lo

 The input of ``mutate`` is graph IR (Intermediate Representation) of the base model (please refer to `here <./ApiReference.rst>`__ for the format and APIs of the IR), users can mutate the graph using the graph's member functions (e.g., ``get_nodes_by_label``, ``update_operation``). The mutation operations can be combined with the API ``self.choice``, in order to express a set of possible mutations. In the above example, the node's operation can be changed to any operation from ``candidate_op_list``.

-Use placehoder to make mutation easier: ``nn.Placeholder``. If you want to mutate a subgraph or node of your model, you can define a placeholder in this model to represent the subgraph or node. Then, use mutator to mutate this placeholder to make it real modules.
+Use placeholder to make mutation easier: ``nn.Placeholder``. If you want to mutate a subgraph or node of your model, you can define a placeholder in this model to represent the subgraph or node. Then, use mutator to mutate this placeholder to make it real modules.

 .. code-block:: python

@@ -51,7 +51,7 @@ Use placehoder to make mutation easier: ``nn.Placeholder``. If you want to mutat

 ``label`` is used by mutator to identify this placeholder. The other parameters are the information that is required by mutator. They can be accessed from ``node.operation.parameters`` as a dict, it could include any information that users want to put to pass it to user defined mutator. The complete example code can be found in :githublink:`Mnasnet base model <examples/nas/multi-trial/mnasnet/base_mnasnet.py>`.

-Starting an experiment is almost the same as using inline mutation APIs. The only difference is that the applied mutators should be passed to ``RetiariiExperiment``. Below is a simple example.
+Starting an experiment is almost the same as using inline mutation APIs. The only difference is that the applied mutators should be passed to :class:`nni.retiarii.experiment.pytorch.RetiariiExperiment`. Below is a simple example.

 .. code-block:: python

@@ -62,3 +62,51 @@ Starting an experiment is almost the same as using inline mutation APIs. The onl
  exp_config.max_trial_number = 10
  exp_config.training_service.use_active_gpu = False
  exp.run(exp_config, 8081)
+
+References
+----------
+
+Placeholder
+^^^^^^^^^^^
+
+..  autoclass:: nni.retiarii.nn.pytorch.Placeholder
+    :members:
+    :noindex:
+
+Mutator
+^^^^^^^
+
+..  autoclass:: nni.retiarii.Mutator
+    :members:
+    :noindex:
+
+..  autoclass:: nni.retiarii.Sampler
+    :members:
+    :noindex:
+
+..  autoclass:: nni.retiarii.InvalidMutation
+    :members:
+    :noindex:
+
+Graph
+^^^^^
+
+..  autoclass:: nni.retiarii.Model
+    :members:
+    :noindex:
+
+..  autoclass:: nni.retiarii.Graph
+    :members:
+    :noindex:
+
+..  autoclass:: nni.retiarii.Node
+    :members:
+    :noindex:
+
+..  autoclass:: nni.retiarii.Edge
+    :members:
+    :noindex:
+
+..  autoclass:: nni.retiarii.Operation
+    :members:
+    :noindex:
--- a/docs/source/nas/reference.rst
+++ b/docs/source/nas/reference.rst
+Retiarii API Reference
+======================
+
+nni.retiarii
+------------
+
+..  automodule:: nni.retiarii
+    :imported-members:
+    :members:
+
+nni.retiarii.codegen
+--------------------
+
+..  automodule:: nni.retiarii.codegen
+    :imported-members:
+    :members:
+
+nni.retiarii.converter
+----------------------
+
+..  automodule:: nni.retiarii.converter
+    :imported-members:
+    :members:
+
+nni.retiarii.evaluator
+----------------------
+
+..  automodule:: nni.retiarii.evaluator
+    :imported-members:
+    :members:
+
+..  automodule:: nni.retiarii.evaluator.pytorch
+    :imported-members:
+    :members:
+    :exclude-members: Trainer, DataLoader
+
+..  autoclass:: nni.retiarii.evaluator.pytorch.Trainer
+
+..  autoclass:: nni.retiarii.evaluator.pytorch.DataLoader
+
+nni.retiarii.execution
+----------------------
+
+..  automodule:: nni.retiarii.execution
+    :imported-members:
+    :members:
+    :undoc-members:
+
+nni.retiarii.experiment.pytorch
+-------------------------------
+
+..  automodule:: nni.retiarii.experiment.pytorch
+    :members:
+
+nni.retiarii.nn.pytorch
+-----------------------
+
+Please refer to:
+
+* :doc:`construct_space`.
+* :doc:`mutator`.
+* `torch.nn reference <https://pytorch.org/docs/stable/nn.html>`_.
+
+nni.retiarii.oneshot
+--------------------
+
+..  automodule:: nni.retiarii.oneshot
+    :imported-members:
+    :members:
+
+
+nni.retiarii.operation_def
+--------------------------
+
+..  automodule:: nni.retiarii.operation_def
+    :imported-members:
+    :members:
+
+nni.retiarii.strategy
+---------------------
+
+..  automodule:: nni.retiarii.strategy
+    :imported-members:
+    :members:
+
+nni.retiarii.utils
+------------------
+
+..  automodule:: nni.retiarii.utils
+    :members:
+
+.. footbibliography::
--- a/docs/source/NAS/Serialization.rst
+++ b/docs/source/NAS/Serialization.rst
--- a/docs/source/nas_zh.rst
+++ b/docs/source/nas_zh.rst
-.. 0b36fb7844fd9cc88c4e74ad2c6b9ece
-
-##########################
-神经网络架构搜索
-##########################
-
-自动化的神经网络架构（NAS）搜索在寻找更好的模型方面发挥着越来越重要的作用。
-最近的研究工作证明了自动化 NAS 的可行性，并发现了一些超越手动调整的模型。
-代表工作有 NASNet, ENAS, DARTS, Network Morphism, 以及 Evolution 等。 此外，新的创新不断涌现。
-
-但是，要实现 NAS 算法需要花费大量的精力，并且很难在新算法中重用现有算法的代码。
-为了促进 NAS 创新 (如, 设计实现新的 NAS 模型，比较不同的 NAS 模型)，
-易于使用且灵活的编程接口非常重要。
-
-因此，NNI 设计了 `Retiarii <https://www.usenix.org/system/files/osdi20-zhang_quanlu.pdf>`__， 它是一个深度学习框架，支持在神经网络模型空间，而不是单个神经网络模型上进行探索性训练。
-Retiarii 的探索性训练允许用户以高度灵活的方式表达 *神经网络架构搜索* 和 *超参数调整* 的各种搜索空间。
-
-本文档中的一些常用术语：
-
-* *Model search space（模型搜索空间）* ：它意味着一组模型，用于从中探索/搜索出最佳模型。 有时我们简称为 *search space（搜索空间）* 或 *model space（模型空间）* 。
-* *Exploration strategy（探索策略）*：用于探索模型搜索空间的算法。
-* *Model evaluator（模型评估器）*：用于训练模型并评估模型的性能。
-
-按照以下说明开始您的 Retiarii 之旅。
-
-..  toctree::
-    :maxdepth: 2
-
-    概述 <NAS/Overview>
-    快速入门 <NAS/QuickStart>
-    构建模型空间 <NAS/construct_space>
-    Multi-trial NAS <NAS/multi_trial_nas>
-    One-Shot NAS <NAS/one_shot_nas>
-    硬件相关 NAS <NAS/HardwareAwareNAS>
-    NAS 基准测试 <NAS/Benchmarks>
-    NAS API 参考 <NAS/ApiReference>
--- a/docs/source/notes/build_from_source.rst
+++ b/docs/source/notes/build_from_source.rst
+Build from Source
+=================
+
+This article describes how to build and install NNI from `source code`_.
+
+We recommend using latest setuptools:
+
+.. code-block::
+
+    python -m pip install --upgrade setuptools pip wheel
+
+.. _source code: https://github.com/microsoft/nni
+
+Development Build
+-----------------
+
+If you want to build NNI for your own use, we recommend using `development mode`_.
+
+.. code-block::
+
+    python setup.py develop
+
+This will install NNI as symlink, and the version number will be ``999.dev0``.
+
+.. _development mode: https://setuptools.pypa.io/en/latest/userguide/development_mode.html
+
+Release Build
+-------------
+
+To install in release mode, you must first build a wheel.
+NNI does not support setuptools' "install" command.
+
+A release package requires jupyterlab to build the extension:
+
+.. code-block::
+
+    python -m pip install jupyterlab
+
+And you need to set ``NNI_RELEASE`` environment variable, and compile TypeScript modules before "bdist_wheel".
+
+In bash:
+
+.. code-block::
+
+    export NNI_RELEASE=2.7
+    python setup.py build_ts
+    python bdist_wheel
+
+In PowerShell:
+
+.. code-block::
+
+    $env:NNI_RELEASE=2.7
+    python setup.py build_ts
+    python bdist_wheel
+
+If successful, you will find the wheel in ``dist`` directory.
+
+.. note::
+
+    NNI's build process is somewhat complicated.
+    This is due to setuptools and TypeScript not working well together.
+
+    Setuptools require to provide ``package_data``, the full list of package files, before running any command.
+    However it is nearly impossible to predict what files will be generated before invoking TypeScript compiler.
+
+    If you have any solution for this problem, please open an issue to let us know.
+
+Build Docker Image
+------------------
+
+You can build a Docker image with Dockerfile:
+
+.. code-block::
+
+    export NNI_RELEASE=2.7
+    python setup.py build_ts
+    python setup.py bdist_wheel -p manylinux1_x86_64
+    docker build --build-arg NNI_RELEASE=${NNI_RELEASE} -t my/nni .
+
+To build image for other platforms, please edit Dockerfile yourself.
+
+Other Commands and Options
+--------------------------
+
+Clean
+^^^^^
+
+If the build fails, please clean up and try again:
+
+.. code::
+
+    python setup.py clean
+
+Skip compiling TypeScript modules
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This is useful when you have uninstalled NNI from development mode and want to install again.
+
+It will not work if you have never built TypeScript modules before.
+
+.. code::
+
+    python setup.py develop --skip-ts
--- a/docs/source/quickstart.rst
+++ b/docs/source/quickstart.rst
+Quickstart
+==========
+
+.. TOC
+
+.. toctree::
+   :maxdepth: 2
+   :hidden:
+
+   tutorials/hpo_quickstart_pytorch/cp_global_quickstart_hpo
+   tutorials/cp_global_quickstart_nas
+   tutorials/cp_global_quickstart_compression
+
+.. ----------------------
+
+.. cardlinkitem::
+   :header: HPO Quickstart with PyTorch
+   :description: Use HPO to tune a PyTorch FashionMNIST model
+   :link: tutorials/hpo_quickstart_pytorch/cp_global_quickstart_hpo.html
+   :image: ../img/thumbnails/overview-33.png
+
+.. cardlinkitem::
+   :header: NAS Quickstart
+   :description: Beginners' NAS tutorial on how to search for neural architectures for MNIST dataset.
+   :link: tutorials/cp_global_quickstart_nas.html
+   :image: ../img/thumbnails/overview-30.png
+   :background: cyan
+
+.. cardlinkitem::
+   :header: Get Started with Model Pruning on MNIST
+   :description: Familiarize yourself with pruning to compress your model.
+   :link: tutorials/cp_global_quickstart_compression.html
+   :image: ../img/thumbnails/overview-29.png
+   :background: teal
--- a/docs/source/reference.rst
+++ b/docs/source/reference.rst
+:orphan:
+
+.. to be removed
+
 References
 ==================

@@ -6,11 +10,5 @@ References

    nnictl Commands <reference/nnictl>
    Experiment Configuration <reference/experiment_config>
-    Experiment Configuration (legacy) <Tutorial/ExperimentConfig>
-    Search Space <Tutorial/SearchSpaceSpec>
-    NNI Annotation <Tutorial/AnnotationSpec>
-    SDK API References <sdk_reference>
+    API References <reference/python_api_ref>
    Supported Framework Library <SupportedFramework_Library>
-    Launch from Python <Tutorial/HowToLaunchFromPython>
-    Shared Storage <Tutorial/HowToUseSharedStorage>
-    Tensorboard <Tutorial/Tensorboard>
--- a/docs/source/reference/compression.rst
+++ b/docs/source/reference/compression.rst
+Compression API Reference
+=========================
+
+Pruner
+------
+
+Please refer to :doc:`../compression/pruner`.
+
+Quantizer
+---------
+
+Please refer to :doc:`../compression/quantizer`.
+
+Pruning Speedup
+---------------
+
+.. autoclass:: nni.compression.pytorch.speedup.ModelSpeedup
+    :members:
+
+Quantization Speedup
+--------------------
+
+.. autoclass:: nni.compression.pytorch.quantization_speedup.ModelSpeedupTensorRT
+    :members:
+
+Compression Utilities
+---------------------
+
+.. autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
+    :members:
+
+.. autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
+    :members:
+
+.. autoclass:: nni.compression.pytorch.utils.shape_dependency.GroupDependency
+    :members:
+
+.. autoclass:: nni.compression.pytorch.utils.mask_conflict.ChannelMaskConflict
+    :members:
+
+.. autoclass:: nni.compression.pytorch.utils.mask_conflict.GroupMaskConflict
+    :members:
+
+.. autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
+
+.. autofunction:: nni.algorithms.compression.v2.pytorch.utils.pruning.compute_sparsity
+
+Framework Related
+-----------------
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.base.Pruner
+    :members:
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.base.PrunerModuleWrapper
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.basic_pruner.BasicPruner
+    :members:
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.tools.DataCollector
+    :members:
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.tools.MetricsCalculator
+    :members:
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.tools.SparsityAllocator
+    :members:
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.base.BasePruningScheduler
+    :members:
+
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.tools.TaskGenerator
+    :members:
+
+.. autoclass:: nni.compression.pytorch.compressor.Quantizer
+    :members:
+
+.. autoclass:: nni.compression.pytorch.compressor.QuantizerModuleWrapper
+    :members:
+
+.. autoclass:: nni.compression.pytorch.compressor.QuantGrad
+    :members:
--- a/docs/source/reference/experiment.rst
+++ b/docs/source/reference/experiment.rst
+Experiment API Reference
+========================
+
+.. autoclass:: nni.experiment.Experiment
+    :members:
--- a/docs/source/reference/experiment_config.rst
+++ b/docs/source/reference/experiment_config.rst
@@ -289,10 +289,12 @@ One of the following:

 For `Kubeflow <../TrainingService/KubeflowMode.rst>`_, `FrameworkController <../TrainingService/FrameworkControllerMode.rst>`_, and `AdaptDL <../TrainingService/AdaptDLMode.rst>`_ training platforms, it is suggested to use `v1 config schema <../Tutorial/ExperimentConfig.rst>`_ for now.

+.. _reference-local-config-label:
+
 LocalConfig
 -----------

-Detailed usage can be found `here <../TrainingService/LocalMode.rst>`__.
+Introduction of the corresponding local training service can be found :doc:`../experiment/local`.

 .. list-table::
    :widths: 10 10 80
@@ -330,10 +332,12 @@ Detailed usage can be found `here <../TrainingService/LocalMode.rst>`__.
        If ``trialGpuNumber`` is less than the length of this value, only a subset will be visible to each trial.
        This will be used as ``CUDA_VISIBLE_DEVICES`` environment variable.

+.. _reference-remote-config-label:
+
 RemoteConfig
 ------------

-Detailed usage can be found `here <../TrainingService/RemoteMachineMode.rst>`__.
+Detailed usage can be found :doc:`../experiment/remote`.

 .. list-table::
    :widths: 10 10 80

--- a/docs/source/reference/hpo.rst
+++ b/docs/source/reference/hpo.rst
+HPO API Reference
+=================
+
+Trial APIs
+----------
+
+.. autofunction:: nni.get_experiment_id
+.. autofunction:: nni.get_next_parameter
+.. autofunction:: nni.get_sequence_id
+.. autofunction:: nni.get_trial_id
+.. autofunction:: nni.report_final_result
+.. autofunction:: nni.report_intermediate_result
+
+Tuners
+------
+
+.. autoclass:: nni.algorithms.hpo.batch_tuner.BatchTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.bohb_advisor.BOHB
+    :members:
+.. autoclass:: nni.algorithms.hpo.dngo_tuner.DNGOTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.evolution_tuner.EvolutionTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.gp_tuner.GPTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.gridsearch_tuner.GridSearchTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.hyperband_advisor.Hyperband
+    :members:
+.. autoclass:: nni.algorithms.hpo.hyperopt_tuner.HyperoptTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.metis_tuner.MetisTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.pbt_tuner.PBTTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.ppo_tuner.PPOTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.random_tuner.RandomTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.smac_tuner.SMACTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.tpe_tuner.TpeTuner
+    :members:
+.. autoclass:: nni.algorithms.hpo.tpe_tuner.TpeArguments
+
+Assessors
+---------
+
+.. autoclass:: nni.algorithms.hpo.curvefitting_assessor.CurvefittingAssessor
+    :members:
+.. autoclass:: nni.algorithms.hpo.medianstop_assessor.MedianstopAssessor
+    :members:
+
+Customization
+-------------
+
+.. autoclass:: nni.assessor.AssessResult
+    :members:
+.. autoclass:: nni.assessor.Assessor
+    :members:
+.. autoclass:: nni.tuner.Tuner
+    :members:
--- a/docs/source/reference/python_api.rst
+++ b/docs/source/reference/python_api.rst
+:orphan:
+
+Python API Reference
+====================
+
+.. autosummary::
+   :toctree: _modules
+   :recursive:
+
+   nni
--- a/docs/source/reference/python_api/feature_engineering.rst
+++ b/docs/source/reference/python_api/feature_engineering.rst
+Feature Engineering
+===================
+
+nni.algorithms.feature_engineering
+----------------------------------
--- a/docs/source/reference/python_api/nas.rst
+++ b/docs/source/reference/python_api/nas.rst
+Neural Architecture Search
+==========================
+
+nni.retiarii
+------------
--- a/docs/source/reference/python_api/others.rst
+++ b/docs/source/reference/python_api/others.rst
+Others
+======
+
+nni
+---
+
+nni.common
+----------
+
+nni.utils
+---------
--- a/docs/source/reference/python_api_ref.rst
+++ b/docs/source/reference/python_api_ref.rst
+API Reference
+=============
+
+..  toctree::
+    :maxdepth: 1
+
+    Hyperparameter Optimization <hpo>
+    Neural Architecture Search <./python_api/nas>
+    Model Compression <compression>
+    Feature Engineering <./python_api/feature_engineering>
+    Experiment <experiment>
+    Others <./python_api/others>