add dtk24.04 code

9f73153f · zhanggzh · eb77376e · 9f73153f · 9f73153f · 9f73153f
Commit 9f73153f authored Jul 01, 2024 by zhanggzh
20 changed files
--- a/docs/source/compression/overview.rst
+++ b/docs/source/compression/overview.rst
+Overview of NNI Model Compression
+=================================
+
+Deep neural networks (DNNs) have achieved great success in many tasks like computer vision, nature launguage processing, speech processing.
+However, typical neural networks are both computationally expensive and energy-intensive,
+which can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
+Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
+Model compression techniques can be divided into two categories: pruning and quantization.
+The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
+Quantization refers to compress models by reducing the number of bits required to represent weights or activations.
+We further elaborate on the two methods, pruning and quantization, in the following chapters. Besides, the figure below visualizes the difference between these two methods.
+
+.. image:: ../../img/prune_quant.jpg
+   :target: ../../img/prune_quant.jpg
+   :scale: 40%
+   :align: center
+   :alt:
+
+NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
+For users to compress their models, they only need to add several lines in their code.
+There are some popular model compression algorithms built-in in NNI.
+On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
+
+There are several core features supported by NNI model compression:
+
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speedup a compressed model to make it have lower inference latency and also make it smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
+
+
+Compression Pipeline
+--------------------
+
+.. image:: ../../img/compression_pipeline.png
+   :target: ../../img/compression_pipeline.png
+   :alt:
+   :align: center
+   :scale: 30%
+
+The overall compression pipeline in NNI is shown above. For compressing a pretrained model, pruning and quantization can be used alone or in combination.
+If users want to apply both, a sequential mode is recommended as common practise.
+
+.. note::
+  Note that NNI pruners or quantizers are not meant to physically compact the model but for simulating the compression effect. Whereas NNI speedup tool can truly compress model by changing the network architecture and therefore reduce latency.
+  To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/pruning_speedup>` or :doc:`quantizaiton speedup <../tutorials/quantization_speedup>`. 
+  The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
+
+
+Model Speedup
+-------------
+
+The final goal of model compression is to reduce inference latency and model size.
+However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
+For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
+Given the output masks and quantization bits produced by those algorithms, NNI can really speedup the model.
+
+The following figure shows how NNI prunes and speeds up your models. 
+
+.. image:: ../../img/nni_prune_process.png
+   :target: ../../img/nni_prune_process.png
+   :scale: 30%
+   :align: center
+   :alt:
+
+The detailed tutorial of Speedup Model with Mask can be found :doc:`here <../tutorials/pruning_speedup>`.
+The detailed tutorial of Speedup Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speedup>`.
+
+.. attention::
+
+  NNI's model pruning framework has been upgraded to a more powerful version (named pruning v2 before nni v2.6).
+  The old version (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) will be out of maintenance. If for some reason you have to use the old pruning,
+  v2.6 is the last nni version to support old pruning version.
--- a/docs/source/compression/overview_zh.rst
+++ b/docs/source/compression/overview_zh.rst
+.. b6bdf52910e2e2c72085d03482d45340
+
+模型压缩
+========
+
+深度神经网络（DNNs）在计算机视觉、自然语言处理、语音处理等领域取得了巨大的成功。   
+然而，典型的神经网络是计算和能源密集型的，很难将其部署在计算资源匮乏
+或具有严格延迟要求的设备上。 因此，一个自然的想法就是对模型进行压缩，
+以减小模型大小并加速模型训练/推断，同时不会显着降低模型性能。 
+模型压缩技术可以分为两类：剪枝和量化。 剪枝方法探索模型权重中的冗余，
+并尝试删除/修剪冗余和非关键的权重。 量化是指通过减少权重表示或激活所需的比特数来压缩模型。
+在接下来的章节中，我们将进一步阐述这两种方法: 剪枝和量化。 
+此外，下图直观地展示了这两种方法的区别。  
+
+.. image:: ../../img/prune_quant.jpg
+   :target: ../../img/prune_quant.jpg
+   :scale: 40%
+   :alt:
+
+NNI 提供了易于使用的工具包来帮助用户设计并使用剪枝和量化算法。
+其使用了统一的接口来支持 TensorFlow 和 PyTorch。
+对用户来说， 只需要添加几行代码即可压缩模型。
+NNI 中也内置了一些主流的模型压缩算法。
+用户可以进一步利用 NNI 的自动调优功能找到最佳的压缩模型，
+该功能在自动模型压缩部分有详细介绍。
+另一方面，用户可以使用 NNI 的接口自定义新的压缩算法。
+
+
+NNI 具备以下几个核心特性:
+* 内置许多流行的剪枝和量化算法。
+* 利用最先进的策略和NNI的自动调整能力，来自动化模型剪枝和量化过程。
+* 加速模型，使其有更低的推理延迟。
+* 提供友好和易于使用的压缩工具，让用户深入到压缩过程和结果。
+* 简洁的界面，供用户自定义自己的压缩算法。
+
+压缩流程
+---------
+
+.. image:: ../../img/compression_pipeline.png
+   :target: ../../img/compression_pipeline.png
+   :alt:
+   :align: center
+   :scale: 30%
+
+NNI中模型压缩的整体流程如上图所示。
+为了压缩一个预先训练好的模型，可以单独或联合使用修剪和量化。
+如果用户希望同时应用这两种模式，建议采用串行模式。
+
+
+.. note::
+  值得注意的是，NNI的pruner或quantizer并不能改变网络结构，只能模拟压缩的效果。
+  真正能够压缩模型、改变网络结构、降低推理延迟的是NNI的加速工具。
+  为了获得一个真正的压缩的模型，用户需要执行 :doc:`剪枝加速 <../tutorials/pruning_speedup>` or :doc:`量化加速 <../tutorials/quantization_speedup>`. 
+  PyTorch和TensorFlow的接口都是统一的。目前只支持PyTorch版本，未来将支持TensorFlow版本。
+
+
+模型加速
+---------
+
+模型压缩的最终目标是减少推理延迟和模型大小。
+然而，现有的模型压缩算法主要是通过仿真来检测压缩模型的性能。
+例如，修剪算法使用掩码，量化算法仍将值存储在float32中。
+如果能给定这些算法产生的输出掩码和量化位，NNI的加速工具就可以真正地压缩模型。
+
+下图显示了NNI如何修剪和加速您的模型。
+
+.. image:: ../../img/nni_prune_process.png
+   :target: ../../img/nni_prune_process.png
+   :scale: 30%
+   :align: center
+   :alt:
+
+关于用掩码进行模型加速的详细文档可以参考 :doc:`here <../tutorials/pruning_speedup>`.
+关于用校准配置进行模型加速的详细文档可以参考 :doc:`here <../tutorials/quantization_speedup>`.
+
+
+.. attention::
+
+  NNI的模型剪枝框架已经升级到更高级的版本 (在 nni 2.6 版本前称为pruning v2)。
+  旧版本 (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) 不再进行维护. 
+  如果出于某些原因您不得不使用，v2.6 是最后的支持旧版剪枝算法的版本。
--- a/docs/source/compression/pruner.rst
+++ b/docs/source/compression/pruner.rst
+Pruner in NNI
+=============
+
+NNI implements the main part of the pruning algorithm as pruner. All pruners are implemented as close as possible to what is described in the paper (if it has).
+The following table provides a brief introduction to the pruners implemented in nni, click the link in table to view a more detailed introduction and use cases.
+
+There are two kinds of pruners in NNI, please refer to :ref:`basic pruner <basic-pruner>` and :ref:`scheduled pruner <scheduled-pruner>` for details.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Brief Introduction of Algorithm
+   * - :ref:`level-pruner`
+     - Pruning the specified ratio on each weight element based on absolute value of weight element
+   * - :ref:`l1-norm-pruner`
+     - Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
+   * - :ref:`l2-norm-pruner`
+     - Pruning output channels with the smallest L2 norm of weights
+   * - :ref:`fpgm-pruner`
+     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/abs/1811.00250>`__
+   * - :ref:`slim-pruner`
+     - Pruning output channels by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
+   * - :ref:`activation-apoz-rank-pruner`
+     - Pruning output channels based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
+   * - :ref:`activation-mean-rank-pruner`
+     - Pruning output channels based on the metric that calculates the smallest mean value of output activations
+   * - :ref:`taylor-fo-weight-pruner`
+     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
+   * - :ref:`admm-pruner`
+     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
+   * - :ref:`linear-pruner`
+     - Sparsity ratio increases linearly during each pruning rounds, in each round, using a basic pruner to prune the model.
+   * - :ref:`agp-pruner`
+     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+   * - :ref:`lottery-ticket-pruner`
+     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+   * - :ref:`simulated-annealing-pruner`
+     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - :ref:`auto-compress-pruner`
+     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - :ref:`amc-pruner`
+     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/abs/1802.03494>`__
+   * - :ref:`movement-pruner`
+     - Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
--- a/docs/source/compression/pruning.rst
+++ b/docs/source/compression/pruning.rst
+Overview of NNI Model Pruning
+=============================
+
+Pruning is a common technique to compress neural network models.
+The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
+The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
+
+The following concepts can help you understand pruning in NNI.
+
+Pruning Target
+--------------
+
+Pruning target means where we apply the sparsity.
+Most pruning methods prune the weights to reduce the model size and accelerate the inference latency.
+Other pruning methods also apply sparsity on activations (e.g., inputs, outputs, or feature maps) to accelerate the inference latency.
+NNI supports pruning module weights right now, and will support other pruning targets in the future.
+
+.. _basic-pruner:
+
+Basic Pruner
+------------
+
+Basic pruner generates the masks for each pruning target (weights) for a determined sparsity ratio.
+It usually takes model and config as input arguments, then generates masks for each pruning target.
+
+.. _scheduled-pruner:
+
+Scheduled Pruner
+----------------
+
+Scheduled pruner decides how to allocate sparsity ratio to each pruning target,
+it also handles the model speedup (after each pruning iteration) and finetuning logic.
+From the implementation logic, the scheduled pruner is a combination of pruning scheduler, basic pruner and task generator.
+
+Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning.
+Basic pruner will reset with the model and config list given by task generator then generate the masks.
+
+For a clearer structure vision, please refer to the figure below.
+
+.. image:: ../../img/pruning_process.png
+   :target: ../../img/pruning_process.png
+   :scale: 30%
+   :align: center
+   :alt:
+
+More information about scheduled pruning process please refer to :doc:`Pruning Scheduler <pruning_scheduler>`.
+
+Granularity
+-----------
+
+Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
+Coarse-grained pruning or structured pruning is pruning a regular group of weights, such as a convolutional filter.
+
+Only :ref:`level-pruner` and :ref:`admm-pruner` support fine-grained pruning, all other pruners do some kind of structured pruning on weights.
+
+.. _dependency-aware-mode-for-output-channel-pruning:
+
+Dependency-aware Mode for Output Channel Pruning
+------------------------------------------------
+
+Currently, we support dependency-aware mode in several ``pruner``: :ref:`l1-norm-pruner`, :ref:`l2-norm-pruner`, :ref:`fpgm-pruner`,
+:ref:`activation-apoz-rank-pruner`, :ref:`activation-mean-rank-pruner`, :ref:`taylor-fo-weight-pruner`.
+
+In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
+the algorithm will quantify the importance of each filter based on some specific metrics(such as l1 norm), and prune the less important output channels.
+
+We use pruning convolutional layers as an example to explain dependency-aware mode.
+As :ref:`topology analysis utils <topology-analysis>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
+then these two convolutional layers have channel dependency with each other (more details please see :ref:`ChannelDependency <topology-analysis>`).
+Take the following figure as an example.
+
+.. image:: ../../img/mask_conflict.jpg
+   :target: ../../img/mask_conflict.jpg
+   :scale: 80%
+   :align: center
+   :alt: 
+
+If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
+Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
+In this case, we cannot harvest the speed benefit from the model pruning.
+
+To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
+In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
+
+In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+
+.. image:: ../../img/dependency-aware.jpg
+   :target: ../../img/dependency-aware.jpg
+   :scale: 80%
+   :align: center
+   :alt: 
+
+Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
+Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
+Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
+According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
+Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
+For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
+In this case, the ``dependency-aware pruner`` will 
+
+* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
+* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+
+In addition, for the convolutional layers that have more than one filter group,
+``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
+Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process. 
+
+.. Note:: Operations that will be recognized as having channel dependencies: add/sub/mul/div, addcmul/addcdiv, logical_and/or/xor
+
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
--- a/docs/source/compression/pruning_scheduler.rst
+++ b/docs/source/compression/pruning_scheduler.rst
+Pruning Scheduler
+=================
+
+Pruning scheduler is new feature supported in pruning v2. It can bring more flexibility for pruning the model iteratively.
+All the built-in iterative pruners (e.g., AGPPruner, SimulatedAnnealingPruner) are based on three abstracted components: pruning scheduler, pruners and task generators.
+In addition to using the NNI built-in iterative pruners,
+users can directly use the pruning schedulers to customize their own iterative pruning logic.
+
+Workflow of Pruning Scheduler
+-----------------------------
+
+In iterative pruning, the final goal will be broken down into different small goals, and complete a small goal in each iteration.
+For example, each iteration increases a little sparsity ratio, and after several pruning iterations, the continuous pruned model reaches the final overall sparsity;
+fix the overall sparsity, try different ways to allocate sparsity between layers in each iteration, and find the best allocation way.
+
+We define a small goal as ``Task``, it usually includes states inherited from previous iterations (eg. pruned model and masks) and description of the current goal (eg. a config list that describes how to allocate sparsity).
+Details about ``Task`` can be found in this :githublink:`file <nni/algorithms/compression/v2/pytorch/base/scheduler.py>`.
+
+Pruning scheduler handles two main components, a basic pruner, and a task generator. The logic of generating ``Task`` is encapsulated in the task generator.
+In an iteration (one pruning step), pruning scheduler parses the ``Task`` getting from the task generator,
+and reset the pruner by ``model``, ``masks``, ``config_list`` parsing from the ``Task``.
+Then pruning scheduler generates the new masks by the pruner. During an iteration, the new masked model may also experience speed-up, finetuning, and evaluating.
+After one iteration is done, the pruning scheduler collects the compact model, new masks and evaluation score, packages them into ``TaskResult``, and passes it to task generator.
+The iteration process will end until the task generator has no more ``Task``.
+
+How to Customized Iterative Pruning
+-----------------------------------
+
+Using AGP Pruning as an example to explain how to implement an iterative pruning by scheduler in NNI.
+
+.. code-block:: python
+
+    from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner, PruningScheduler
+    from nni.algorithms.compression.v2.pytorch.pruning.tools import AGPTaskGenerator
+
+    pruner = L1NormPruner(model=None, config_list=None, mode='dependency_aware', dummy_input=torch.rand(10, 3, 224, 224).to(device))
+    task_generator = AGPTaskGenerator(total_iteration=10, origin_model=model, origin_config_list=config_list, log_dir='.', keep_intermediate_result=True)
+    scheduler = PruningScheduler(pruner, task_generator, finetuner=finetuner, speedup=True, dummy_input=dummy_input, evaluator=None, reset_weight=False)
+
+    scheduler.compress()
+    _, model, masks, _, _ = scheduler.get_best_result()
+
+The full script can be found :githublink:`here <examples/model_compress/pruning/scheduler_torch.py>`.
+
+In this example, we use dependency-aware mode L1 Norm Pruner as a basic pruner during each iteration.
+Note we do not need to pass ``model`` and ``config_list`` to the pruner, because in each iteration the ``model`` and ``config_list`` used by the pruner are received from the task generator.
+Then we can use ``scheduler`` as an iterative pruner directly. In fact, this is the implementation of ``AGPPruner`` in NNI.
+
+More about Task Generator
+-------------------------
+
+The task generator is used to give the model that needs to be pruned in each iteration and the corresponding config_list.
+For example, ``AGPTaskGenerator`` will give the model pruned in the previous iteration and compute the sparsity using in the current iteration.
+``TaskGenerator`` put all these pruning information into ``Task`` and pruning scheduler will get the ``Task``, then run it.
+The pruning result will return to the ``TaskGenerator`` at the end of each iteration and ``TaskGenerator`` will judge whether and how to generate the next ``Task``.
+
+The information included in the ``Task`` and ``TaskResult`` can be found :githublink:`here <nni/algorithms/compression/v2/pytorch/base/scheduler.py>`.
+
+A clearer iterative pruning flow chart can be found :doc:`here <pruning>`.
+
+
+If you want to implement your own task generator, please following the ``TaskGenerator`` :githublink:`interface <nni/algorithms/compression/v2/pytorch/pruning/tools/base.py>`.
+Two main functions should be implemented, ``init_pending_tasks(self) -> List[Task]`` and ``generate_tasks(self, task_result: TaskResult) -> List[Task]``.
+
+Why Use Pruning Scheduler
+-------------------------
+
+One of the benefits of using a scheduler to do iterative pruning is users can use more functions of NNI pruning components,
+because of simplicity of the interface and the restoration of the paper, NNI not fully exposing all the low-level interfaces to the upper layer.
+For example, resetting weight value to the original model in each iteration is a key point in lottery ticket pruning algorithm, and this is implemented in ``LotteryTicketPruner``.
+To reduce the complexity of the interface, we only support this function in ``LotteryTicketPruner``, not other pruners.
+If users want to reset weight during each iteration in AGP pruning, ``AGPPruner`` can not do this, but users can easily set ``reset_weight=True`` in ``PruningScheduler`` to implement this.
+
+What's more, for a customized pruner or task generator, using scheduler can easily enhance the algorithm.
+In addition, users can also customize the scheduling process to implement their own scheduler.
--- a/docs/source/compression/quantization.rst
+++ b/docs/source/compression/quantization.rst
+Overview of NNI Model Quantization
+==================================
+
+Quantization refers to compressing models by reducing the number of bits required to represent weights or activations,
+which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical
+format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations
+can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
+is an active field of research.
+
+A quantizer is a quantization algorithm implementation in NNI.
+You can also :doc:`create your own quantizer <../tutorials/quantization_customize>` using NNI model compression interface.
--- a/docs/source/compression/quantizer.rst
+++ b/docs/source/compression/quantizer.rst
+Quantizer in NNI
+================
+
+NNI implements the main part of the quantizaiton algorithm as quantizer. All quantizers are implemented as close as possible to what is described in the paper (if it has).
+The following table provides a brief introduction to the quantizers implemented in nni, click the link in table to view a more detailed introduction and use cases.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Brief Introduction of Algorithm
+   * - :ref:`naive-quantizer`
+     - Quantize weights to default 8 bits
+   * - :ref:`qat-quantizer`
+     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
+   * - :ref:`dorefa-quantizer`
+     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
+   * - :ref:`bnn-quantizer`
+     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
+   * - :ref:`lsq-quantizer`
+     - Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
+   * - :ref:`observer-quantizer`
+     - Post training quantizaiton. Collect quantization information during calibration with observers.
--- a/docs/source/compression/toctree.rst
+++ b/docs/source/compression/toctree.rst
+Compression
+===========
+
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+
+    Overview <overview>
+    Pruning <toctree_pruning>
+    Quantization <toctree_quantization>
+    Config Specification <compression_config_list>
+    Evaluator <compression_evaluator>
+    Advanced Usage <advanced_usage>
--- a/docs/source/compression/toctree_pruning.rst
+++ b/docs/source/compression/toctree_pruning.rst
+Pruning
+=======
+
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+
+    Overview <pruning>
+    Quickstart </tutorials/pruning_quick_start_mnist>
+    Pruner <pruner>
+    Speedup </tutorials/pruning_speedup>
+    Best Practices <best_practices>
--- a/docs/source/compression/toctree_quantization.rst
+++ b/docs/source/compression/toctree_quantization.rst
+Quantization
+============
+
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+
+    Overview <quantization>
+    Quickstart </tutorials/quantization_quick_start_mnist>
+    Quantizer <quantizer>
+    SpeedUp </tutorials/quantization_speedup>
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
+# -*- coding: utf-8 -*-
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import re
+import subprocess
+import sys
+sys.path.insert(0, os.path.abspath('../..'))
+sys.path.insert(0, os.path.abspath('../extension'))
+
+
+# -- Project information ---------------------------------------------------
+
+from datetime import datetime
+project = 'NNI'
+copyright = f'{datetime.now().year}, Microsoft'
+author = 'Microsoft'
+
+# The short X.Y version
+version = ''
+# The full version, including alpha/beta/rc tags
+# FIXME: this should be written somewhere globally
+release = 'v2.10'
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx_gallery.gen_gallery',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.mathjax',
+    'sphinxarg4nni.ext',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.intersphinx',
+    'sphinxcontrib.bibtex',
+    'sphinxcontrib.youtube',
+    # 'nbsphinx',  # nbsphinx has conflicts with sphinx-gallery.
+    'sphinx.ext.extlinks',
+    'IPython.sphinxext.ipython_console_highlighting',
+    'sphinx_tabs.tabs',
+    'sphinx_copybutton',
+
+    # Custom extensions in extension/ folder.
+    'tutorial_links',  # this has to be after sphinx-gallery
+    'getpartialtext',
+    'inplace_translation',
+    'cardlinkitem',
+    'codesnippetcard',
+    'patch_autodoc',
+    'toctree_check',
+]
+
+# Autosummary related settings
+autosummary_imported_members = True
+autosummary_ignore_module_all = False
+
+# Auto-generate stub files before building docs
+autosummary_generate = True
+
+# Add mock modules
+autodoc_mock_imports = [
+    'apex', 'nni_node', 'tensorrt', 'pycuda', 'nn_meter', 'azureml',
+    'ConfigSpace', 'ConfigSpaceNNI', 'smac', 'statsmodels', 'pybnn',
+]
+
+# Some of our modules cannot generate summary
+autosummary_mock_imports = [
+    'nni.retiarii.codegen.tensorflow',
+    'nni.nas.benchmarks.nasbench101.db_gen',
+    'nni.tools.jupyter_extension.management',
+] + autodoc_mock_imports
+
+autodoc_typehints = 'description'
+autodoc_typehints_description_target = 'documented'
+autodoc_inherit_docstrings = False
+
+# Sphinx will warn about all references where the target cannot be found.
+nitpicky = False  # disabled for now
+
+# A list of regular expressions that match URIs that should not be checked.
+linkcheck_ignore = [
+    r'http://localhost:\d+',
+    r'.*://.*/#/',                           # Modern websites that has URLs like xxx.com/#/guide
+    r'https://github\.com/JSong-Jia/Pic/',   # Community links can't be found any more
+
+    # Some URLs that often fail
+    r'https://www\.cs\.toronto\.edu/',                      # CIFAR-10
+    r'https://help\.aliyun\.com/document_detail/\d+\.html', # Aliyun
+    r'http://www\.image-net\.org/',                         # ImageNet
+    r'https://www\.msra\.cn/',                              # MSRA
+    r'https://1drv\.ms/',                                   # OneDrive (shortcut)
+    r'https://onedrive\.live\.com/',                        # OneDrive
+    r'https://www\.openml\.org/',                           # OpenML
+    r'https://ml\.informatik\.uni-freiburg\.de/',
+    r'https://docs\.nvidia\.com/deeplearning/',
+    r'https://cla\.opensource\.microsoft\.com',
+    r'https://www\.docker\.com/',
+]
+
+# Ignore all links located in release.rst
+linkcheck_exclude_documents = ['^release']
+
+# Bibliography files
+bibtex_bibfiles = ['refs.bib']
+
+# Add a heading to bibliography
+bibtex_footbibliography_header = '.. rubric:: Bibliography'
+
+# Set bibliography style
+bibtex_default_style = 'plain'
+
+# Sphinx gallery examples
+sphinx_gallery_conf = {
+    'examples_dirs': '../../examples/tutorials',   # path to your example scripts
+    'gallery_dirs': 'tutorials',                   # path to where to save gallery generated output
+
+    # Control ignored python files.
+    'ignore_pattern': r'__init__\.py|/scripts/',
+
+    # This is `/plot` by default. Only files starting with `/plot` will be executed.
+    # All files should be executed in our case.
+    'filename_pattern': r'.*',
+
+    # Disabling download button of all scripts
+    'download_all_examples': False,
+
+    # Change default thumbnail
+    # Working directory is strange, needs full path.
+    'default_thumb_file': os.path.join(os.path.dirname(__file__), '../img/thumbnails/nni_icon_blue.png'),
+}
+
+# Copybutton: strip and configure input prompts for code cells.
+copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
+copybutton_prompt_is_regexp = True
+
+# Copybutton: customize selector to exclude gallery outputs.
+copybutton_selector = ":not(div.sphx-glr-script-out) > div.highlight pre"
+
+# Allow additional builders to be considered compatible.
+sphinx_tabs_valid_builders = ['linkcheck']
+
+# Disallow the sphinx tabs css from loading.
+sphinx_tabs_disable_css_loading = True
+
+# Some tutorials might need to appear more than once in toc.
+# In this list, we make source/target tutorial pairs.
+# Each "source" tutorial rst will be copied to "target" tutorials.
+# The anchors will be replaced to avoid dupilcate labels.
+# Target should start with ``cp_`` to be properly ignored in git.
+tutorials_copy_list = [
+    # Seems that we don't need it for now.
+    # Add tuples back if we need it in future.
+]
+
+# Toctree ensures that toctree docs do not contain any other contents.
+# Home page should be an exception.
+toctree_check_whitelist = [
+    'index',
+
+    # FIXME: Other exceptions should be correctly handled.
+    'compression/index',
+    'compression/pruning',
+    'compression/quantization',
+    'hpo/hpo_benchmark',
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['../templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+source_suffix = ['.rst']
+
+# The master toctree document.
+master_doc = 'index'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = 'en'
+
+# Translation related settings
+locale_dir = ['locales']
+
+# Documents that requires translation: https://github.com/microsoft/nni/issues/4298
+gettext_documents = [
+    r'^index$',
+    r'^quickstart$',
+    r'^installation$',
+    r'^(nas|hpo|compression)/overview$',
+    r'^tutorials/(hello_nas|pruning_quick_start_mnist|hpo_quickstart_pytorch/main)$',
+]
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = [
+    '_build',
+    'Thumbs.db',
+    '.DS_Store',
+    '**.ipynb_checkpoints',
+    # Exclude translations. They will be added back via replacement later if language is set.
+    '**_zh.rst',
+    # Exclude generated tutorials index
+    'tutorials/index.rst',
+]
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+
+# -- Options for HTML output -------------------------------------------------
+
+# HTML logo
+html_logo = '../img/nni_icon.svg'
+
+# HTML favicon
+html_favicon = '../img/favicon.ico'
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_material'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+
+    # Set the name of the project to appear in the navigation.
+    'nav_title': 'Neural Network Intelligence',
+
+    # Set you GA account ID to enable tracking
+    'google_analytics_account': 'UA-136029994-1',
+
+    # Specify a base_url used to generate sitemap.xml. If not
+    # specified, then no sitemap will be built.
+    'base_url': 'https://nni.readthedocs.io/',
+
+    # Set the color and the accent color
+    # Remember to update static/css/material_custom.css when this is updated.
+    # Set those colors in layout.html.
+    'color_primary': 'custom',
+    'color_accent': 'custom',
+
+    # Set the repo location to get a badge with stats
+    'repo_url': 'https://github.com/microsoft/nni/',
+    'repo_name': 'GitHub',
+
+    # Visible levels of the global TOC; -1 means unlimited
+    'globaltoc_depth': 5,
+
+    # Expand all toc so that they can be dynamically collapsed
+    'globaltoc_collapse': False,
+
+    'version_dropdown': True,
+    # This is a placeholder, which should be replaced later.
+    'version_info': {
+        'current': '/'
+    },
+
+    # Text to appear at the top of the home page in a "hero" div.
+    'heroes': {
+        'index': 'An open source AutoML toolkit for hyperparameter optimization, neural architecture search, '
+                 'model compression and feature engineering.'
+    }
+}
+
+# Disable show source link.
+html_show_sourcelink = False
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['../static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+html_sidebars = {
+    "**": ["logo-text.html", "globaltoc.html", "localtoc.html", "searchbox.html"]
+}
+
+html_title = 'Neural Network Intelligence'
+
+# Add extra css files and js files
+html_css_files = [
+    'css/material_theme.css',
+    'css/material_custom.css',
+    'css/material_dropdown.css',
+    'css/sphinx_gallery.css',
+    'css/index_page.css',
+]
+html_js_files = [
+    'js/version.js',
+    'js/github.js',
+    'js/sphinx_gallery.js',
+    'js/misc.js'
+]
+
+# HTML context that can be used in jinja templates
+git_commit_id = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
+
+html_context = {
+    'git_commit_id': git_commit_id
+}
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'NeuralNetworkIntelligencedoc'
+
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'NeuralNetworkIntelligence.tex', 'Neural Network Intelligence Documentation',
+     'Microsoft', 'manual'),
+]
+
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'neuralnetworkintelligence', 'Neural Network Intelligence Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'NeuralNetworkIntelligence', 'Neural Network Intelligence Documentation',
+     author, 'NeuralNetworkIntelligence', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+# external links (for github code)
+# Reference the code via :githublink:`path/to/your/example/code.py`
+extlinks = {
+    'githublink': ('https://github.com/microsoft/nni/blob/' + git_commit_id + '/%s', 'Github link: %s')
+}
--- a/docs/source/deprecated/oneshot_legacy.rst
+++ b/docs/source/deprecated/oneshot_legacy.rst
+:orphan:
+
+One-shot Strategy (legacy)
+==========================
+
+.. warning:: This page will be removed in future releases.
+
+.. _darts-strategy:
+
+DARTS
+-----
+
+The paper `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Their method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
+
+Authors' code optimizes the network weights and architecture weights alternatively in mini-batches. They further explore the possibility that uses second order optimization (unroll) instead of first order, to improve the performance.
+
+Implementation on NNI is based on the `official implementation <https://github.com/quark0/darts>`__ and a `popular 3rd-party repo <https://github.com/khanrc/pt.darts>`__. DARTS on NNI is designed to be general for arbitrary search space. A CNN search space tailored for CIFAR10, same as the original paper, is implemented as a use case of DARTS.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.DartsTrainer
+
+Reproduction Results
+^^^^^^^^^^^^^^^^^^^^
+
+The above-mentioned example is meant to reproduce the results in the paper, we do experiments with first and second order optimization. Due to the time limit, we retrain *only the best architecture* derived from the search phase and we repeat the experiment *only once*. Our results is currently on par with the results reported in paper. We will add more results later when ready.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - 
+     - In paper
+     - Reproduction
+   * - First order (CIFAR10)
+     - 3.00 +/- 0.14
+     - 2.78
+   * - Second order (CIFAR10)
+     - 2.76 +/- 0.09
+     - 2.80
+
+Examples
+^^^^^^^^
+
+:githublink:`Example code <examples/nas/oneshot/darts>`
+
+.. code-block:: bash
+
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+
+   # search the best architecture
+   cd examples/nas/oneshot/darts
+   python3 search.py
+
+   # train the best architecture
+   python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
+
+Limitations
+^^^^^^^^^^^
+
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
+
+.. _enas-strategy:
+
+ENAS
+----
+
+The paper `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__ uses parameter sharing between child models to accelerate the NAS process. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss.
+
+Implementation on NNI is based on the `official implementation in Tensorflow <https://github.com/melodyguan/enas>`__, including a general-purpose Reinforcement-learning controller and a trainer that trains target network and this controller alternatively. Following paper, we have also implemented macro and micro search space on CIFAR10 to demonstrate how to use these trainers. Since code to train from scratch on NNI is not ready yet, reproduction results are currently unavailable.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.EnasTrainer
+
+Examples
+^^^^^^^^
+
+:githublink:`Example code <examples/nas/oneshot/enas>`
+
+.. code-block:: bash
+
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+
+   # search the best architecture
+   cd examples/nas/oneshot/enas
+
+   # search in macro search space
+   python3 search.py --search-for macro
+
+   # search in micro search space
+   python3 search.py --search-for micro
+
+   # view more options for search
+   python3 search.py -h
+
+.. _fbnet-strategy:
+
+FBNet
+-----
+
+.. note:: This one-shot NAS is still implemented under NNI NAS 1.0, and will `be migrated to Retiarii framework in near future <https://github.com/microsoft/nni/issues/3814>`__.
+
+For the mobile application of facial landmark, based on the basic architecture of PFLD model, we have applied the FBNet (Block-wise DNAS) to design an concise model with the trade-off between latency and accuracy. References are listed as below:
+
+* `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__
+* `PFLD: A Practical Facial Landmark Detector <https://arxiv.org/abs/1902.10859>`__
+
+FBNet is a block-wise differentiable NAS method (Block-wise DNAS), where the best candidate building blocks can be chosen by using Gumbel Softmax random sampling and differentiable training. At each layer (or stage) to be searched, the diverse candidate blocks are side by side planned (just like the effectiveness of structural re-parameterization), leading to sufficient pre-training of the supernet. The pre-trained supernet is further sampled for finetuning of the subnet, to achieve better performance.
+
+.. image:: ../../img/fbnet.png
+   :width: 800
+   :align: center
+
+PFLD is a lightweight facial landmark model for realtime application. The architecture of PLFD is firstly simplified for acceleration, by using the stem block of PeleeNet, average pooling with depthwise convolution and eSE module.
+
+To achieve better trade-off between latency and accuracy, the FBNet is further applied on the simplified PFLD for searching the best block at each specific layer. The search space is based on the FBNet space, and optimized for mobile deployment by using the average pooling with depthwise convolution and eSE module etc.
+
+Experiments
+^^^^^^^^^^^
+
+To verify the effectiveness of FBNet applied on PFLD, we choose the open source dataset with 106 landmark points as the benchmark:
+
+* `Grand Challenge of 106-Point Facial Landmark Localization <https://arxiv.org/abs/1905.03469>`__
+
+The baseline model is denoted as MobileNet-V3 PFLD (`Reference baseline <https://github.com/Hsintao/pfld_106_face_landmarks>`__), and the searched model is denoted as Subnet. The experimental results are listed as below, where the latency is tested on Qualcomm 625 CPU (ARMv8):
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Model
+     - Size
+     - Latency
+     - Validation NME
+   * - MobileNet-V3 PFLD
+     - 1.01MB
+     - 10ms
+     - 6.22%
+   * - Subnet
+     - 693KB
+     - 1.60ms
+     - 5.58%
+
+Example
+^^^^^^^
+
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/oneshot/pfld>`__
+
+Please run the following scripts at the example directory.
+
+The Python dependencies used here are listed as below:
+
+.. code-block:: bash
+
+   numpy==1.18.5
+   opencv-python==4.5.1.48
+   torch==1.6.0
+   torchvision==0.7.0
+   onnx==1.8.1
+   onnx-simplifier==0.3.5
+   onnxruntime==1.7.0
+
+To run the tutorial, follow the steps below:
+
+1. **Data Preparation**: Firstly, you should download the dataset `106points dataset <https://drive.google.com/file/d/1I7QdnLxAlyG2Tq3L66QYzGhiBEoVfzKo/view?usp=sharing>`__ to the path ``./data/106points`` . The dataset includes the train-set and test-set:
+
+   .. code-block:: bash
+
+      ./data/106points/train_data/imgs
+      ./data/106points/train_data/list.txt
+      ./data/106points/test_data/imgs
+      ./data/106points/test_data/list.txt
+
+2. **Search**: Based on the architecture of simplified PFLD, the setting of multi-stage search space and hyper-parameters for searching should be firstly configured to construct the supernet. For example,
+
+   .. code-block:: python
+
+      from lib.builder import search_space
+      from lib.ops import PRIMITIVES
+      from lib.supernet import PFLDInference, AuxiliaryNet
+      from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig
+
+      # configuration of hyper-parameters
+      # search_space defines the multi-stage search space
+      nas_config = NASConfig(
+         model_dir="./ckpt_save",
+         nas_lr=0.01,
+         mode="mul",
+         alpha=0.25,
+         beta=0.6,
+         search_space=search_space,
+      )
+      # lookup table to manage the information
+      lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+      # created supernet
+      pfld_backbone = PFLDInference(lookup_table)
+
+   After creation of the supernet with the specification of search space and hyper-parameters, we can run below command to start searching and training of the supernet:
+
+   .. code-block:: bash
+
+      python train.py --dev_id ^0,1^ --snapshot ^./ckpt_save^ --data_root ^./data/106points^
+
+   The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/supernet/checkpoint_best.pth``.
+
+3. **Finetune**: After pre-training of the supernet, we can run below command to sample the subnet and conduct the finetuning:
+
+   .. code-block:: bash
+
+      python retrain.py --dev_id ^0,1^ --snapshot ^./ckpt_save^ --data_root ^./data/106points^ \
+                        --supernet ^./ckpt_save/supernet/checkpoint_best.pth^
+
+   The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/subnet/checkpoint_best.pth``.
+
+4. **Export**: After the finetuning of subnet, we can run below command to export the ONNX model:
+
+   .. code-block:: bash
+
+      python export.py --supernet ^./ckpt_save/supernet/checkpoint_best.pth^ \
+                       --resume ^./ckpt_save/subnet/checkpoint_best.pth^
+
+   ONNX model is saved as ``./output/subnet.onnx``, which can be further converted to the mobile inference engine by using `MNN <https://github.com/alibaba/MNN>`__ .
+   The checkpoints of pre-trained supernet and subnet are offered as below:
+
+   * `Supernet <https://drive.google.com/file/d/1TCuWKq8u4_BQ84BWbHSCZ45N3JGB9kFJ/view?usp=sharing>`__
+   * `Subnet <https://drive.google.com/file/d/160rkuwB7y7qlBZNM3W_T53cb6MQIYHIE/view?usp=sharing>`__
+   * `ONNX model <https://drive.google.com/file/d/1s-v-aOiMv0cqBspPVF3vSGujTbn_T_Uo/view?usp=sharing>`__
+
+.. _spos-strategy:
+
+SPOS
+----
+
+Proposed in `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ is a one-shot NAS method that addresses the difficulties in training One-Shot NAS models by constructing a simplified supernet trained with an uniform path sampling method, so that all underlying architectures (and their weights) get trained fully and equally. An evolutionary algorithm is then applied to efficiently search for the best-performing architectures without any fine tuning.
+
+Implementation on NNI is based on `official repo <https://github.com/megvii-model/SinglePathOneShot>`__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.SinglePathTrainer
+
+Examples
+^^^^^^^^
+
+Here is a use case, which is the search space in paper. However, we applied latency limit instead of flops limit to perform the architecture search phase.
+
+:githublink:`Example code <examples/nas/oneshot/spos>`
+
+**Requirements:** Prepare ImageNet in the standard format (follow the script `here <https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4>`__). Linking it to ``data/imagenet`` will be more convenient. Download the checkpoint file from `here <https://1drv.ms/u/s!Am_mmG2-KsrnajesvSdfsq_cN48?e=aHVppN>`__ (maintained by `Megvii <https://github.com/megvii-model>`__) if you don't want to retrain the supernet. Put ``checkpoint-150000.pth.tar`` under ``data`` directory. After preparation, it's expected to have the following code structure:
+
+.. code-block:: bash
+
+   spos
+   ├── architecture_final.json
+   ├── blocks.py
+   ├── data
+   │   ├── imagenet
+   │   │   ├── train
+   │   │   └── val
+   │   └── checkpoint-150000.pth.tar
+   ├── network.py
+   ├── readme.md
+   ├── supernet.py
+   ├── evaluation.py
+   ├── search.py
+   └── utils.py
+
+Then follow the 3 steps:
+
+1. **Train Supernet**:
+
+   .. code-block:: bash
+
+      python supernet.py
+
+   This will export the checkpoint to ``checkpoints`` directory, for the next step.
+
+   .. note:: The data loading used in the official repo is `slightly different from usual <https://github.com/megvii-model/SinglePathOneShot/issues/5>`__, as they use BGR tensor and keep the values between 0 and 255 intentionally to align with their own DL framework. The option ``--spos-preprocessing`` will simulate the behavior used originally and enable you to use the checkpoints pretrained.
+
+2. **Evolution Search**: Single Path One-Shot leverages evolution algorithm to search for the best architecture. In the paper, the search module, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
+   In this example, it will inherit the ``state_dict`` of supernet from `./data/checkpoint-150000.pth.tar`, and search the best architecture with the regularized evolution strategy. Search in the supernet with the following command
+
+   .. code-block:: bash
+
+      python search.py
+
+   NNI support a latency filter to filter unsatisfied model from search phase. Latency is predicted by Microsoft nn-Meter (https://github.com/microsoft/nn-Meter). To apply the latency filter, users could run search.py with additional arguments ``--latency-filter``. Here is an example:
+
+   .. code-block:: bash
+
+      python search.py --latency-filter cortexA76cpu_tflite21
+
+   Note that the latency filter is only supported for base execution engine.
+
+   The final architecture exported from every epoch of evolution can be found in ``trials`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/trials``.
+
+3. **Train for Evaluation**:
+
+   .. code-block:: bash
+
+      python evaluation.py
+
+   By default, it will use ``architecture_final.json``. This architecture is provided by the official repo (converted into NNI format). You can use any architecture (e.g., the architecture found in step 2) with ``--fixed-arc`` option.
+
+Known Limitations
+^^^^^^^^^^^^^^^^^
+
+* Block search only. Channel search is not supported yet.
+
+Current Reproduction Results
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reproduction is still undergoing. Due to the gap between official release and original paper, we compare our current results with official repo (our run) and paper.
+
+* Evolution phase is almost aligned with official repo. Our evolution algorithm shows a converging trend and reaches ~65% accuracy at the end of search. Nevertheless, this result is not on par with paper. For details, please refer to `this issue <https://github.com/megvii-model/SinglePathOneShot/issues/6>`__.
+* Retrain phase is not aligned. Our retraining code, which uses the architecture released by the authors, reaches 72.14% accuracy, still having a gap towards 73.61% by official release and 74.3% reported in original paper.
+
+.. _proxylessnas-strategy:
+
+ProxylessNAS
+------------
+
+The paper `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__ removes proxy, it directly learns the architectures for large-scale target tasks and target hardware platforms. They address high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set. Please refer to the paper for the details.
+
+..  autoclass:: nni.retiarii.oneshot.pytorch.ProxylessTrainer
+
+To use ProxylessNAS training/searching approach, users need to specify search space in their model using :doc:`NNI NAS interface </nas/construct_space>`, e.g., ``LayerChoice``, ``InputChoice``. After defining and instantiating the model, the following work can be leaved to ProxylessNasTrainer by instantiating the trainer and passing the model to it.
+
+.. code-block:: python
+
+   trainer = ProxylessTrainer(model,
+                              loss=LabelSmoothingLoss(),
+                              dataset=None,
+                              optimizer=optimizer,
+                              metrics=lambda output, target: accuracy(output, target, topk=(1, 5,)),
+                              num_epochs=120,
+                              log_frequency=10,
+                              grad_reg_loss_type=args.grad_reg_loss_type, 
+                              grad_reg_loss_params=grad_reg_loss_params, 
+                              applied_hardware=args.applied_hardware, dummy_input=(1, 3, 224, 224),
+                              ref_latency=args.reference_latency)
+   trainer.train()
+   trainer.export(args.arch_path)
+
+The complete example code can be found :githublink:`here <examples/nas/oneshot/proxylessnas>`.
+
+Implementation
+^^^^^^^^^^^^^^
+
+The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based. In our current implementation on NNI, gradient descent training approach is supported. The complete support of ProxylessNAS is ongoing.
+
+The official implementation supports different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'.  In NNI repo, the hardware latency prediction is supported by `Microsoft nn-Meter <https://github.com/microsoft/nn-Meter>`__. nn-Meter is an accurate inference latency predictor for DNN models on diverse edge devices. nn-Meter support four hardwares up to now, including ``cortexA76cpu_tflite21``, ``adreno640gpu_tflite21``, ``adreno630gpu_tflite21``, and ``myriadvpu_openvino2019r2``. Users can find more information about nn-Meter on its website. More hardware will be supported in the future. Users could find more details about applying ``nn-Meter`` :doc:`here </nas/hardware_aware_nas>`.
+
+Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, please refer to :githublink:`example code <examples/nas/oneshot/proxylessnas>` for a reference.
+
+.. image:: ../../img/proxylessnas.png
+   :width: 450
+   :align: center
+
+ProxylessNAS training approach is composed of ProxylessLayerChoice and ProxylessNasTrainer. ProxylessLayerChoice instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**, architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The ProxylessLayerChoice class also exposes two member functions, i.e., ``resample``, ``finalize_grad``, for the trainer to control the training of architecture weights.
+
+Reproduction Results
+^^^^^^^^^^^^^^^^^^^^
+
+To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it.
+
+Customization
+-------------
+
+..  autoclass:: nni.retiarii.oneshot.BaseOneShotTrainer
+    :members:
+
+..  autofunction:: nni.retiarii.oneshot.pytorch.utils.replace_layer_choice
+
+..  autofunction:: nni.retiarii.oneshot.pytorch.utils.replace_input_choice
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
+Examples
+========
+
+More examples can be found in our :githublink:`GitHub repository <examples>`.
+
+.. cardlinkitem::
+   :header: HPO Quickstart with PyTorch
+   :description: Use HPO to tune a PyTorch FashionMNIST model
+   :link: tutorials/hpo_quickstart_pytorch/main
+   :image: ../img/thumbnails/hpo-pytorch.svg
+   :background: purple
+   :tags: HPO
+
+.. cardlinkitem::
+   :header: HPO Quickstart with TensorFlow
+   :description: Use HPO to tune a TensorFlow MNIST model
+   :link: tutorials/hpo_quickstart_tensorflow/main
+   :image: ../img/thumbnails/hpo-tensorflow.svg
+   :background: purple
+   :tags: HPO
+
+.. cardlinkitem::
+   :header: HPO using command line tool
+   :description: Run HPO experiment with nnictl
+   :link: tutorials/hpo_nnictl/nnictl
+   :image: ../img/thumbnails/hpo-pytorch.svg
+   :background: purple
+   :tags: HPO
+
+.. cardlinkitem::
+   :header: Hello, NAS!
+   :description: Beginners' NAS tutorial on how to search for neural architectures for MNIST dataset.
+   :link: tutorials/hello_nas
+   :image: ../img/thumbnails/nas-tutorial.svg
+   :background: cyan
+   :tags: NAS
+
+.. cardlinkitem::
+   :header: Use NAS Benchmarks as Datasets
+   :description: Query data from popular NAS benchmarks from our preprocessed benchmark database.
+   :link: tutorials/nasbench_as_dataset
+   :image: ../img/thumbnails/nas-benchmark.svg
+   :background: cyan
+   :tags: NAS
+
+.. cardlinkitem::
+   :header: Get Started with Model Pruning on MNIST
+   :description: Familiarize yourself with pruning to compress your model 
+   :link: tutorials/pruning_quick_start_mnist
+   :image: ../img/thumbnails/pruning-tutorial.svg
+   :background: blue
+   :tags: Compression
+
+.. cardlinkitem::
+   :header: Get Started with Model Quantization on MNIST
+   :description: Familiarize yourself with quantization to compress your model
+   :link: tutorials/quantization_quick_start_mnist
+   :image: ../img/thumbnails/quantization-tutorial.svg
+   :background: indigo
+   :tags: Compression
+
+.. cardlinkitem::
+   :header: Speedup Model with Mask
+   :description: Make your model real smaller and faster with speed-up after pruned by pruner
+   :link: tutorials/pruning_speedup
+   :image: ../img/thumbnails/pruning-speed-up.svg
+   :background: blue
+   :tags: Compression
+
+.. cardlinkitem::
+   :header: Speedup Model with Calibration Config
+   :description: Make your model real smaller and faster with speed-up after quantized by quantizer
+   :link: tutorials/quantization_speedup
+   :image: ../img/thumbnails/quantization-speed-up.svg
+   :background: indigo
+   :tags: Compression
+
+.. cardlinkitem::
+   :header: Pruning Bert on Task MNLI
+   :description: An end to end example for how to using NNI pruning transformer and show the real speedup number
+   :link: tutorials/pruning_bert_glue
+   :image: ../img/thumbnails/pruning-tutorial.svg
+   :background: indigo
+   :tags: Compression
--- a/docs/source/experiment/experiment_management.rst
+++ b/docs/source/experiment/experiment_management.rst
+Experiment Management
+=====================
+
+An experiment can be created with command line tool ``nnictl`` or python APIs. NNI provides both command line tool ``nnictl`` and web Portal to manage the experiments, such as, creating, stopping, resuming, deleting, ranking, and comparing the experiments.
+
+Management with ``nnictl``
+--------------------------
+
+The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., web browser is not supported in your environment).
+
+Management with web portal
+--------------------------
+
+Experiment management on web potral gives an quick overview of all the experiment on users' machine. Users can easily switch to one experiment from this page. Users can refer to the :ref:`exp-manage-webportal` page for details. The experiment management on web portal is still under intensive development to bring more user-friendly features.
\ No newline at end of file
--- a/docs/source/experiment/overview.rst
+++ b/docs/source/experiment/overview.rst
+Overview of NNI Experiment
+==========================
+
+An NNI experiment is a unit of one tuning process. For example, it is one run of hyper-parameter tuning on a specific search space, it is one run of neural architecture search on a search space, or it is one run of automatic model compression on user specified goal on latency and accuracy. Usually, the tuning process requires many trials to explore feasible and potentially good-performing models. Thus, an important component of NNI experiment is **training service**, which is a unified interface to abstract diverse computation resources (e.g., local machine, remote servers, AKS). Users can easily run the tuning process on their prefered computation resource and platform. On the other hand, NNI experiment provides **WebUI** to visualize the tuning process to users.
+
+During developing a DNN model, users need to manage the tuning process, such as, creating an experiment, adjusting an experiment, kill or rerun a trial in an experiment, dumping experiment data for customized analysis. Also, users may create a new experiment for comparison, or concurrently for new model developing tasks. Thus, NNI provides the functionality of **experiment management**. Users can use :doc:`../reference/nnictl` to interact with experiments.
+
+The relation of the components in NNI experiment is illustrated in the following figure. Hyper-parameter optimization (HPO), neural architecture search (NAS), and model compression are three key features in NNI that help users develop and tune their models. Training serivce provides the ability of parallel running trials on available computation resources. WebUI visualizes the tuning process. *nnictl* is for managing the experiments.
+
+.. image:: ../../img/experiment_arch.png
+   :scale: 80 %
+   :align: center
+
+Before reading the following content, you are recommended to go through either :doc:`the quickstart of HPO </tutorials/hpo_quickstart_pytorch/main>` or :doc:`quickstart of NAS </tutorials/hello_nas>` first.
+
+* :doc:`Overview of NNI training service <training_service/overview>`
+* :doc:`Introduction to Web Portal <web_portal/web_portal>`
+* :doc:`Manange Multiple Experiments <experiment_management>`
--- a/docs/source/experiment/toctree.rst
+++ b/docs/source/experiment/toctree.rst
+Experiment
+==========
+
+..  toctree::
+    :maxdepth: 2
+
+    Overview <overview>
+    Training Service <training_service/toctree>
+    Web Portal <web_portal/toctree>
+    Experiment Management <experiment_management>
--- a/docs/source/experiment/training_service/adaptdl.rst
+++ b/docs/source/experiment/training_service/adaptdl.rst
+AdaptDL Training Service
+========================
+
+Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__, which is a resource-adaptive deep learning training and scheduling framework. With AdaptDL training service, your trial program will run as AdaptDL job in Kubernetes cluster.
+AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
+
+.. note:: AdaptDL doesn't support :ref:`reuse mode <training-service-reuse>`.
+
+Prerequisite
+------------
+
+Before starting to use NNI AdaptDL training service, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster.
+
+#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
+#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
+#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
+#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
+#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
+#. Install **NNI**.
+
+Verify the Prerequisites
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+..  code-block:: bash
+
+    nnictl --version
+    # Expected: <version_number>
+
+..  code-block:: bash
+
+    kubectl version
+    # Expected that the kubectl client version matches the server version.
+
+..  code-block:: bash
+
+    kubectl api-versions | grep adaptdl
+    # Expected: adaptdl.petuum.com/v1
+
+Usage
+-----
+
+We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yml>`)
+
+Here is a template configuration specification to use AdaptDL as a training service.
+
+..  code-block:: yaml
+
+    authorName: default
+    experimentName: minimal_adl
+
+    trainingServicePlatform: adl
+    nniManagerIp: 10.1.10.11
+    logCollection: http
+
+    tuner:
+      builtinTunerName: GridSearch
+    searchSpacePath: search_space.json
+
+    trialConcurrency: 2
+    maxTrialNum: 2
+
+    trial:
+      adaptive: false # optional.
+      image: <image_tag>
+      imagePullSecrets:  # optional
+        - name: stagingsecret
+      codeDir: .
+      command: python main.py
+      gpuNum: 1
+      cpuNum: 1  # optional
+      memorySize: 8Gi  # optional
+      nfs: # optional
+        server: 10.20.41.55
+        path: /
+        containerMountPath: /nfs
+      checkpoint: # optional
+        storageClass: dfs
+        storageSize: 1Gi
+
+..  warning::
+    This configuration is written following the specification of `legacy experiment configuration <https://nni.readthedocs.io/en/v2.6/Tutorial/ExperimentConfig.html>`__. It is still supported, and will be updated to the latest version in future release.
+
+The following explains the configuration fields of AdaptDL training service.
+
+* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
+* **nniManagerIp**\ : *Required* to get the correct info and metrics back from the cluster, for ``adl`` training service.
+  IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
+* **logCollection**\ : *Recommended* to set as ``http``. It will collect the trial logs on cluster back to your machine via http.
+* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
+* **trial**\ : It defines the specs of an ``adl`` trial.
+
+  * **namespace**\: (*Optional*\ ) Kubernetes namespace to launch the trials. Default to ``default`` namespace.
+  * **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
+  * **image**\ : Docker image for the trial
+  * **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
+    you need to provide the secret to successfully pull the image.
+  * **codeDir**\ : the working directory of the container. ``.`` means the default working directory defined by the image.
+  * **command**\ : the bash command to start the trial
+  * **gpuNum**\ : the number of GPUs requested for this trial. It must be non-negative integer.
+  * **cpuNum**\ : (*Optional*\ ) the number of CPUs requested for this trial.  It must be non-negative integer.
+  * **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
+    `default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
+  * **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
+  * **checkpoint** (*Optional*\ ) storage settings for model checkpoints.
+
+    * **storageClass**\ : check `Kubernetes storage documentation <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for how to use the appropriate ``storageClass``.
+    * **storageSize**\ : this value should be large enough to fit your model's checkpoints, or it could cause "disk quota exceeded" error.
+
+More Features
+-------------
+
+NFS Storage
+^^^^^^^^^^^
+
+As you may have noticed in the above configuration spec,
+an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
+
+Note that ``adl`` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
+The ``adl`` training service can then mount it to the kubernetes for every trials, with the proper configurations:
+
+
+* **server**\ : NFS server address, e.g. IP address or domain
+* **path**\ : NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
+* **containerMountPath**\ : In container absolute path to mount the NFS **path** above,
+  so that every trial will have the access to the NFS.
+  In the trial containers, you can access the NFS with this path.
+
+Use cases:
+
+* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
+  and mount it so that it can be shared across multiple trials.
+* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
+  So if you want to export your trained models,
+  you may mount the NFS to the trial to persist and export your trained models.
+
+In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
+
+Monitor via Log Stream
+^^^^^^^^^^^^^^^^^^^^^^
+
+Follow the log streaming of a certain trial:
+
+.. code-block:: bash
+
+   nnictl log trial --trial_id=TRIAL_ID
+
+.. code-block:: bash
+
+   nnictl log trial EXPERIMENT_ID --trial_id=TRIAL_ID
+
+Note that *after* a trial has done and its pod has been deleted,
+no logs can be retrieved then via this command.
+However you may still be able to access the past trial logs
+according to the following approach.
+
+Monitor via TensorBoard
+^^^^^^^^^^^^^^^^^^^^^^^
+
+In the context of NNI, an experiment has multiple trials.
+For easy comparison across trials for a model tuning process,
+we support TensorBoard integration. Here one experiment has
+an independent TensorBoard logging directory thus dashboard.
+
+You can only use the TensorBoard while the monitored experiment is running.
+In other words, it is not supported to monitor stopped experiments.
+
+In the trial container you may have access to two environment variables:
+
+
+* ``ADAPTDL_TENSORBOARD_LOGDIR``\ : the TensorBoard logging directory for the current experiment,
+* ``NNI_TRIAL_JOB_ID``\ : the ``trial`` job id for the current trial.
+
+It is recommended for to have them joined as the directory for trial,
+for example in Python:
+
+.. code-block:: python
+
+   import os
+   tensorboard_logdir = os.path.join(
+       os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
+       os.getenv("NNI_TRIAL_JOB_ID")
+   )
+
+If an experiment is stopped, the data logged here
+(defined by *the above envs* for monitoring with the following commands)
+will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
+to export it and view the TensorBoard locally.
+
+With the above setting, you can monitor the experiment easily
+via TensorBoard by
+
+.. code-block:: bash
+
+   nnictl tensorboard start
+
+If having multiple experiment running at the same time, you may use
+
+.. code-block:: bash
+
+   nnictl tensorboard start EXPERIMENT_ID
+
+It will provide you the web url to access the tensorboard.
+
+Note that you have the flexibility to set up the local ``--port``
+for the TensorBoard.
--- a/docs/source/experiment/training_service/aml.rst
+++ b/docs/source/experiment/training_service/aml.rst
+AML Training Service
+====================
+
+To run your trials on `AzureML <https://azure.microsoft.com/en-us/services/machine-learning/>`__, you can use AML training service. AML training service can programmatically submit runs to AzureML platform and collect their metrics.
+
+Prerequisite
+------------
+
+1. Create an Azure account/subscription using this `link <https://azure.microsoft.com/en-us/free/services/machine-learning/>`__. If you already have an Azure account/subscription, skip this step.
+2. Install the Azure CLI on your machine, follow the install guide `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__.
+3. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
+
+   .. code-block:: bash
+
+      az login
+
+4. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download ``config.json`` which will be used later.
+
+   .. image:: ../../../img/aml_workspace.png
+
+5. Create an AML cluster as the compute target.
+
+   .. image:: ../../../img/aml_cluster.png
+
+6. Open a command line and install AML package environment.
+
+   .. code-block:: bash
+
+      python3 -m pip install azureml
+      python3 -m pip install azureml-sdk
+
+Usage
+-----
+
+We show an example configuration here with YAML (Python configuration should be similar).
+
+.. code-block:: yaml
+
+   trialConcurrency: 1
+   maxTrialNumber: 10
+   ...
+   trainingService:
+     platform: aml
+     dockerImage: msranni/nni
+     subscriptionId: ${your subscription ID}
+     resourceGroup: ${your resource group}
+     workspaceName: ${your workspace name}
+     computeTarget: ${your compute target}
+
+Configuration References
+------------------------
+
+Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports the following additional configurations.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Field name
+     - Description
+   * - dockerImage
+     - Required field. The docker image name used in job. If you don't want to build your own, NNI has provided a docker image `msranni/nni <https://hub.docker.com/r/msranni/nni>`__, which is up-to-date with every NNI release.
+   * - subscriptionId
+     - Required field. The subscription id of your account, can be found in ``config.json`` described above.
+   * - resourceGroup
+     - Required field. The resource group of your account, can be found in ``config.json`` described above.
+   * - workspaceName
+     - Required field. The workspace name of your account, can be found in ``config.json`` described above.
+   * - computeTarget
+     - Required field. The compute cluster name you want to use in your AML workspace. See `reference <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ and Step 5 above.
+   * - maxTrialNumberPerGpu
+     - Optional field. Default 1. Used to specify the max concurrency trial number on a GPU device.
+   * - useActiveGpu
+     - Optional field. Default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. See :doc:`local` for details.
+
+Monitor your trial on the cloud by using AML studio
+---------------------------------------------------
+
+To see your trial job's detailed status on the cloud, you need to visit your studio which you create at Step 5 above. Once the job completes, go to the **Outputs + logs** tab. There you can see a ``70_driver_log.txt`` file, This file contains the standard output from a run and can be useful when you're debugging remote runs in the cloud. Learn more about aml from `here <https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world>`__.
--- a/docs/source/experiment/training_service/customize.rst
+++ b/docs/source/experiment/training_service/customize.rst
+Customize a Training Service
+============================
+
+Overview
+--------
+
+TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
+
+System architecture
+-------------------
+
+
+.. image:: ../../../img/NNIDesign.jpg
+   :target: ../../../img/NNIDesign.jpg
+   :alt: 
+
+
+The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./kubeflow` and :doc:`./frameworkcontroller`.
+
+In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
+
+Folder structure of code
+------------------------
+
+NNI's folder structure is shown below:
+
+.. code-block:: text
+
+   nni
+     |- deployment
+     |- docs
+     |- examaples
+     |- src
+     | |- nni_manager
+     | | |- common
+     | | |- config
+     | | |- core
+     | | |- coverage
+     | | |- dist
+     | | |- rest_server
+     | | |- training_service
+     | | | |- common
+     | | | |- kubernetes
+     | | | |- local
+     | | | |- pai
+     | | | |- remote_machine
+     | | | |- test
+     | |- sdk
+     | |- webui
+     |- test
+     |- tools
+     | |-nni_annotation
+     | |-nni_cmd
+     | |-nni_gpu_tool
+     | |-nni_trial_tool
+
+``nni/src/`` folder stores the most source code of NNI. The code in this folder is related to NNIManager, TrainingService, SDK, WebUI and other modules. Users could find the abstract class of TrainingService in ``nni/src/nni_manager/common/trainingService.ts`` file, and they should put their own implemented TrainingService in ``nni/src/nni_manager/training_service`` folder. If users have implemented their own TrainingService code, they should also supplement the unit test of the code, and place them in ``nni/src/nni_manager/training_service/test`` folder.
+
+Function annotation of TrainingService
+--------------------------------------
+
+.. code-block:: typescript
+
+   abstract class TrainingService {
+       public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
+       public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
+       public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
+       public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
+       public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
+       public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
+       public abstract get isMultiPhaseJobSupported(): boolean;
+       public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
+       public abstract setClusterMetadata(key: string, value: string): Promise<void>;
+       public abstract getClusterMetadata(key: string): Promise<string>;
+       public abstract cleanUp(): Promise<void>;
+       public abstract run(): Promise<void>;
+   }
+
+The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
+
+**setClusterMetadata(key: string, value: string)**
+
+ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
+
+.. code-block:: typescript
+
+   export class RemoteMachineMeta {
+       public readonly ip : string;
+       public readonly port : number;
+       public readonly username : string;
+       public readonly passwd?: string;
+       public readonly sshKeyPath?: string;
+       public readonly passphrase?: string;
+       public gpuSummary : GPUSummary | undefined;
+       /* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
+       public gpuReservation : Map<number, string>;
+
+       constructor(ip : string, port : number, username : string, passwd : string,
+           sshKeyPath : string, passphrase : string) {
+           this.ip = ip;
+           this.port = port;
+           this.username = username;
+           this.passwd = passwd;
+           this.sshKeyPath = sshKeyPath;
+           this.passphrase = passphrase;
+           this.gpuReservation = new Map<number, string>();
+       }
+   }
+
+The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
+
+**getClusterMetadata(key: string)**
+
+This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
+
+**submitTrialJob(form: JobApplicationForm)**
+
+SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
+
+.. code-block:: typescript
+
+   interface TrialJobDetail {
+       readonly id: string;
+       readonly status: TrialJobStatus;
+       readonly submitTime: number;
+       readonly startTime?: number;
+       readonly endTime?: number;
+       readonly tags?: string[];
+       readonly url?: string;
+       readonly workingDirectory: string;
+       readonly form: JobApplicationForm;
+       readonly sequenceId: number;
+       isEarlyStopped?: boolean;
+   }
+
+According to different kinds of implementation, users could put the job detail into a job queue, and keep  fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
+
+**cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)**
+
+If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
+
+**updateTrialJob(trialJobId: string, form: JobApplicationForm)**
+
+This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to ``RUNNING``\ , ``SUCCEED``\ , ``FAILED`` etc.
+
+**getTrialJob(trialJobId: string)**
+
+This function returns a trialJob detail instance according to trialJobId.
+
+**listTrialJobs()**
+
+Users should put all of trial job detail information into a list, and return the list.
+
+**addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
+
+NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
+
+**removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
+
+Close the EventEmitter.
+
+**run()**
+
+The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
+
+**cleanUp()**
+
+This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
+
+TrialKeeper tool
+----------------
+
+NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in ``nni/tools/nni_trial_tool``. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
+
+The running architecture of TrialKeeper is show as follow:
+
+
+.. image:: ../../../img/trialkeeper.jpg
+   :target: ../../../img/trialkeeper.jpg
+   :alt: 
+
+
+When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in ``nni/src/nni_manager/training_service/common/clusterJobRestServer.ts``.
+
+Reference
+---------
+
+The guideline of how to contribute, please refer to :doc:`/notes/contributing`.
--- a/docs/source/experiment/training_service/frameworkcontroller.rst
+++ b/docs/source/experiment/training_service/frameworkcontroller.rst
+FrameworkController Training Service
+====================================
+
+NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__,
+called frameworkcontroller mode.
+FrameworkController is built to orchestrate all kinds of applications on Kubernetes,
+you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator.
+Now you can use FrameworkController as the training service to run NNI experiment.
+
+Prerequisite for on-premises Kubernetes Service
+-----------------------------------------------
+
+1. A **Kubernetes** cluster using Kubernetes 1.8 or later.
+   Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes.
+2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server.
+   By default, NNI manager will use ``~/.kube/config`` as kubeconfig file's path.
+   You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable.
+   Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__
+   to learn more about kubeconfig.
+3. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__
+   to configure **Nvidia device plugin for Kubernetes**.
+4. Prepare a **NFS server** and export a general purpose mount
+   (we recommend to map your NFS server path in ``root_squash option``,
+   otherwise permission issue may raise when NNI copies files to NFS.
+   Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is),
+   or **Azure File Storage**.
+5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment.
+   Run this command to install NFSv4 client:
+
+.. code-block:: bash
+
+    apt install nfs-common
+
+6. Install **NNI**:
+
+.. code-block:: bash
+
+    python -m pip install nni
+
+Prerequisite for Azure Kubernetes Service
+-----------------------------------------
+
+1. NNI support FrameworkController based on Azure Kubernetes Service,
+   follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
+2. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**.
+   Use ``az login`` to set azure account, and connect kubectl client to AKS,
+   refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
+3. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__
+   to create azure file storage account.
+   If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
+4. To access Azure storage service, NNI need the access key of the storage account,
+   and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key.
+   Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account.
+   Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
+
+Setup FrameworkController
+-------------------------
+
+Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run>`__
+to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
+If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController,
+and then pass the name of the FrameworkController service account to the NNI Experiment Config.
+If the k8s cluster enforces Authorization, you also need to create a ServiceAccount with granted permission for FrameworkController.
+
+Design
+------
+
+Please refer the design of :doc:`Kubeflow training service <kubeflow>`,
+FrameworkController training service pipeline is similar.
+
+Example
+-------
+
+The FrameworkController config format is:
+
+.. code-block:: python
+
+    from nni.experiment import (
+        Experiment,
+        FrameworkAttemptCompletionPolicy,
+        FrameworkControllerRoleConfig,
+        K8sNfsConfig,
+    )
+
+    experiment = Experiment('frameworkcontroller')
+    experiment.config.trial_code_directory = '.'
+    experiment.config.search_space = search_space
+    experiment.config.tuner.name = 'TPE'
+    experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
+    experiment.config.max_trial_number = 10
+    experiment.config.trial_concurrency = 2
+
+    experiment.config.training_service.storage = K8sNfsConfig()
+    experiment.config.training_service.storage.server = '10.20.30.40'
+    experiment.config.training_service.storage.path = '/mnt/nfs/nni'
+    experiment.config.training_service.task_roles = [FrameworkControllerRoleConfig()]
+    experiment.config.training_service.task_roles[0].name = 'worker'
+    experiment.config.training_service.task_roles[0].task_number = 1
+    experiment.config.training_service.task_roles[0].command = 'python3 model.py'
+    experiment.config.training_service.task_roles[0].gpuNumber = 1
+    experiment.config.training_service.task_roles[0].cpuNumber = 1
+    experiment.config.training_service.task_roles[0].memorySize = '4g'
+    experiment.config.training_service.task_roles[0].framework_attempt_completion_policy = \
+        FrameworkAttemptCompletionPolicy(min_failed_task_count = 1, min_succeed_task_count = 1)
+        
+If you use Azure Kubernetes Service, you should set storage config as follows:
+
+.. code-block:: python
+
+    experiment.config.training_service.storage = K8sAzureStorageConfig()
+    experiment.config.training_service.storage.azure_account = 'your_storage_account_name'
+    experiment.config.training_service.storage.azure_share = 'your_azure_share_name'
+    experiment.config.training_service.storage.key_vault_name = 'your_vault_name'
+    experiment.config.training_service.storage.key_vault_key = 'your_secret_name'
+
+If you set `ServiceAccount <https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/>`__ in your k8s,
+please set ``serviceAccountName`` in your config:
+
+.. code-block:: python
+
+    experiment.config.training_service.service_account_name = 'frameworkcontroller'
+
+The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config,
+you could refer the `Tensorflow example of FrameworkController
+<https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__
+for deep understanding.
+
+Once it's ready, run:
+
+.. code-block:: python
+
+    experiment.run(8080)
+
+Notice: In frameworkcontroller mode,
+NNIManager will start a rest server and listen on a port which is your NNI web portal's port plus 1.
+For example, if your web portal port is ``8080``, the rest server will listen on ``8081``,
+to receive metrics from trial job running in Kubernetes.
+So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.