Merge pull request #233 from microsoft/master

merge master

Merge pull request #233 from microsoft/master
merge master
aa316742 · SparkSnail · GitHub · 3fe117f0 · 24fa4619 · aa316742
Unverified Commit aa316742 authored Feb 21, 2020 by SparkSnail Committed by GitHub Feb 21, 2020
20 changed files
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@ The tool manages automated machine learning (AutoML) experiments, **dispatches a
 * Researchers and data scientists who want to easily **implement and experiement new AutoML algorithms**, may it be: hyperparameter tuning algorithm, neural architect search algorithm or model compression algorithm.
 * ML Platform owners who want to **support AutoML in their platform**.

-### **NNI v1.3 has been released! &nbsp;<a href="#nni-released-reminder"><img width="48" src="docs/img/release_icon.png"></a>**
+### **NNI v1.4 has been released! &nbsp;<a href="#nni-released-reminder"><img width="48" src="docs/img/release_icon.png"></a>**

 ## **NNI capabilities in a glance**
 NNI provides CommandLine Tool as well as an user friendly WebUI to manage training experiements. With the extensible API, you can customize your own AutoML algorithms and training services. To make it easy for new users, NNI also provides a set of build-in stat-of-the-art AutoML algorithms and out of box support for popular training platforms. 
@@ -177,9 +177,9 @@ Within the following table, we summarized the current NNI capabilities, we are g
      </td>
     <td style="border-top:#FF0000 solid 0px;">
      <ul>
-        <li><a href="docs/en_US/sdk_reference.rst">Python API</a></li>
+        <li><a href="https://nni.readthedocs.io/en/latest/autotune_ref.html#trial">Python API</a></li>
        <li><a href="docs/en_US/Tutorial/AnnotationSpec.md">NNI Annotation</a></li>
-         <li><a href="docs/en_US/Tutorial/Installation.md">Supported OS</a></li>
+         <li><a href="https://nni.readthedocs.io/en/latest/installation.html">Supported OS</a></li>
      </ul>
      </td>
       <td style="border-top:#FF0000 solid 0px;">
@@ -216,9 +216,9 @@ Windows
 python -m pip install --upgrade nni
 ```

-If you want to try latest code, please [install NNI](docs/en_US/Tutorial/Installation.md) from source code.
+If you want to try latest code, please [install NNI](https://nni.readthedocs.io/en/latest/installation.html) from source code.

-For detail system requirements of NNI, please refer to [here](docs/en_US/Tutorial/Installation.md#system-requirements).
+For detail system requirements of NNI, please refer to [here](https://nni.readthedocs.io/en/latest/Tutorial/InstallationLinux.html#system-requirements) for Linux & macOS, and [here](https://nni.readthedocs.io/en/latest/Tutorial/InstallationWin.html#system-requirements) for Windows.

 Note:

@@ -233,7 +233,7 @@ The following example is built on TensorFlow 1.x. Make sure **TensorFlow 1.x is
 * Download the examples via clone the source code.

  ```bash
-  git clone -b v1.3 https://github.com/Microsoft/nni.git
+  git clone -b v1.4 https://github.com/Microsoft/nni.git
  ```

 * Run the MNIST example.

--- a/README_zh_CN.md
+++ b/README_zh_CN.md
@@ -172,9 +172,9 @@ NNI 提供命令行工具以及友好的 WebUI 来管理训练的 Experiment。
      </td>
     <td style="border-top:#FF0000 solid 0px;">
      <ul>
-        <li><a href="docs/zh_CN/sdk_reference.rst">Python API</a></li>
+        <li><a href="https://nni.readthedocs.io/zh/latest/autotune_ref.html#trial">Python API</a></li>
        <li><a href="docs/zh_CN/Tutorial/AnnotationSpec.md">NNI Annotation</a></li>
-         <li><a href="docs/zh_CN/Tutorial/Installation.md">支持的操作系统</a></li>
+         <li><a href="https://nni.readthedocs.io/zh/latest/installation.html">支持的操作系统</a></li>
      </ul>
      </td>
       <td style="border-top:#FF0000 solid 0px;">
@@ -211,9 +211,9 @@ Windows
 python -m pip install --upgrade nni
 ```

-如果想要尝试最新代码，可通过源代码[安装 NNI](docs/zh_CN/Tutorial/Installation.md)。
+如果想试试最新代码，可参考从源代码[安装 NNI](https://nni.readthedocs.io/zh/latest/installation.html)。

-有关 NNI 的详细系统要求，参考[这里](docs/zh_CN/Tutorial/Installation.md#system-requirements)。
+Linux 和 macOS 下 NNI 系统需求[参考这里](https://nni.readthedocs.io/zh/latest/Tutorial/InstallationLinux.html#system-requirements) ，Windows [参考这里](https://nni.readthedocs.io/zh/latest/Tutorial/InstallationWin.html#system-requirements)。

 注意：


--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -26,8 +26,8 @@ jobs:
      yarn eslint
    displayName: 'Run eslint'
  - script: |
-      python3 -m pip install torch==0.4.1 --user
-      python3 -m pip install torchvision==0.2.1 --user
+      python3 -m pip install torch==1.2.0 --user
+      python3 -m pip install torchvision==0.4.0 --user
      python3 -m pip install tensorflow==1.13.1 --user
      python3 -m pip install keras==2.1.6 --user
      python3 -m pip install gym onnx --user
@@ -91,8 +91,8 @@ jobs:
      echo "##vso[task.setvariable variable=PATH]${HOME}/Library/Python/3.7/bin:${PATH}"
    displayName: 'Install nni toolkit via source code'
  - script: |
-      python3 -m pip install torch==0.4.1 --user
-      python3 -m pip install torchvision==0.2.1 --user
+      python3 -m pip install torch==1.2.0 --user
+      python3 -m pip install torchvision==0.4.0 --user
      python3 -m pip install tensorflow==1.13.1 --user
      ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
      brew install swig@3
@@ -131,7 +131,7 @@ jobs:
  - script: |
      python -m pip install scikit-learn==0.20.0 --user
      python -m pip install keras==2.1.6 --user
-      python -m pip install https://download.pytorch.org/whl/cu90/torch-0.4.1-cp36-cp36m-win_amd64.whl --user
+      python -m pip install torch===1.2.0 torchvision===0.4.1 -f https://download.pytorch.org/whl/torch_stable.html --user
      python -m pip install torchvision --user
      python -m pip install tensorflow==1.13.1 --user
    displayName: 'Install dependencies'

--- a/deployment/docker/Dockerfile
+++ b/deployment/docker/Dockerfile
@@ -52,7 +52,7 @@ RUN python3 -m pip --no-cache-dir install Keras==2.1.6
 # PyTorch
 #
 RUN python3 -m pip --no-cache-dir install torch==1.2.0
-RUN python3 -m pip install torchvision==0.4.0
+RUN python3 -m pip install torchvision==0.5.0

 #
 # sklearn 0.20.0

--- a/docs/en_US/Compressor/ModelSpeedup.md
+++ b/docs/en_US/Compressor/ModelSpeedup.md
+# Speed up Masked Model
+
+*This feature is still in Alpha version.*
+
+## Introduction
+
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+
+## Design and Implementation
+
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use `jit.trace` to obtain the model grpah for PyTorch.
+
+For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
+
+## Usage
+
+```python
+from nni.compression.speedup.torch import ModelSpeedup
+# model: the model you want to speed up
+# dummy_input: dummy input of the model, given to `jit.trace`
+# masks_file: the mask file created by pruning algorithms
+m_speedup = ModelSpeedup(model, dummy_input.to(device), masks_file)
+m_speedup.speedup_model()
+dummy_input = dummy_input.to(device)
+start = time.time()
+out = model(dummy_input)
+print('elapsed time: ', time.time() - start)
+```
+For complete examples please refer to [the code](https://github.com/microsoft/nni/tree/master/examples/model_compress/model_speedup.py)
+
+NOTE: The current implementation only works on torch 1.3.1 and torchvision 0.4.2
+
+## Limitations
+
+Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
+
+For PyTorch we can only replace modules, if functions in `forward` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
+
+## Speedup Results of Examples
+
+The code of these experiments can be found [here](https://github.com/microsoft/nni/tree/master/examples/model_compress/model_speedup.py).
+
+### slim pruner example
+
+on one V100 GPU,
+input tensor: `torch.randn(64, 3, 32, 32)`
+
+|Times| Mask Latency| Speedup Latency |
+|---|---|---|
+| 1 | 0.01197 | 0.005107 |
+| 2 | 0.02019 | 0.008769 |
+| 4 | 0.02733 | 0.014809 |
+| 8 | 0.04310 | 0.027441 |
+| 16 | 0.07731 | 0.05008 |
+| 32 | 0.14464 | 0.10027 |
+
+### fpgm pruner example
+
+on cpu,
+input tensor: `torch.randn(64, 1, 28, 28)`,
+too large variance
+
+|Times| Mask Latency| Speedup Latency |
+|---|---|---|
+| 1 | 0.01383 | 0.01839 |
+| 2 | 0.01167 | 0.003558 |
+| 4 | 0.01636 | 0.01088 |
+| 40 | 0.14412 | 0.08268 |
+| 40 | 1.29385 | 0.14408 |
+| 40 | 0.41035 | 0.46162 |
+| 400 | 6.29020 | 5.82143 |
+
+### l1filter pruner example
+
+on one V100 GPU,
+input tensor: `torch.randn(64, 3, 32, 32)`
+
+|Times| Mask Latency| Speedup Latency |
+|---|---|---|
+| 1 | 0.01026 | 0.003677 |
+| 2 | 0.01657 | 0.008161 |
+| 4 | 0.02458 | 0.020018 |
+| 8 | 0.03498 | 0.025504 |
+| 16 | 0.06757 | 0.047523 |
+| 32 | 0.10487 | 0.086442 |
+
+### APoZ pruner example
+
+on one V100 GPU,
+input tensor: `torch.randn(64, 3, 32, 32)`
+
+|Times| Mask Latency| Speedup Latency |
+|---|---|---|
+| 1 | 0.01389 | 0.004208 |
+| 2 | 0.01628 | 0.008310 |
+| 4 | 0.02521 | 0.014008 |
+| 8 | 0.03386 | 0.023923 |
+| 16 | 0.06042 | 0.046183 |
+| 32 | 0.12421 | 0.087113 |
\ No newline at end of file
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
 # Model Compression with NNI
 As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem. 

-We are glad to announce the alpha release for model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.
+We are glad to introduce model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.

 NNI provides an easy-to-use toolkit to help user design and use compression algorithms. It currently supports PyTorch with unified interface. For users to compress their models, they only need to add several lines in their code. There are some popular model compression algorithms built-in in NNI. Users could further use NNI's auto tuning power to find the best compressed model, which is detailed in [Auto Model Compression](./AutoCompression.md). On the other hand, users could easily customize their new compression algorithms using NNI's interface, refer to the tutorial [here](#customize-new-compression-algorithms).

@@ -335,9 +335,9 @@ class YourQuantizer(Quantizer):
 If you do not customize `QuantGrad`, the default backward is Straight-Through Estimator. 
 _Coming Soon_ ...

-## **Reference and Feedback**
+## Reference and Feedback
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
 * To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;
-* To know more about [Feature Engineering with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/FeatureEngineering/Overview.md);
-* To know more about [NAS with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/NAS/Overview.md);
-* To know more about [Hyperparameter Tuning with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Tuner/BuiltinTuner.md);
+* To know more about [Feature Engineering with NNI](../FeatureEngineering/Overview.md);
+* To know more about [NAS with NNI](../NAS/Overview.md);
+* To know more about [Hyperparameter Tuning with NNI](../Tuner/BuiltinTuner.md);
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
+# Quick Start to Compress a Model
+
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use slim pruner as an example to show the usage. The complete code of this example can be found [here](https://github.com/microsoft/nni/blob/master/examples/model_compress/slim_torch_cifar10.py).
+
+## Write configuration
+
+Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the `BatchNorm2d`s to sparsity 0.7 while keeping other layers unpruned.
+
+```python
+configure_list = [{
+    'sparsity': 0.7,
+    'op_types': ['BatchNorm2d'],
+}]
+```
+
+The specification of configuration can be found [here](Overview.md#user-configuration-for-a-compression-algorithm). Note that different pruners may have their own defined fields in configuration, for exmaple `start_epoch` in AGP pruner. Please refer to each pruner's [usage](Overview.md#supported-algorithms) for details, and adjust the configuration accordingly.
+
+## Choose a compression algorithm
+
+Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke `compress()` to compress your model.
+
+```python
+pruner = SlimPruner(model, configure_list)
+model = pruner.compress()
+```
+
+Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.
+
+## Export compression result
+
+After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
+
+```python
+pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
+```
+
+## Speed up the model
+
+Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking `apply_compression_results` on your model, your model becomes a smaller one with shorter inference latency.
+
+```python
+from nni.compression.torch import apply_compression_results
+apply_compression_results(model, 'mask_vgg19_cifar10.pth')
+```
+
+Please refer to [here](ModelSpeedup.md) for detailed description.
\ No newline at end of file
--- a/docs/en_US/FeatureEngineering/Overview.md
+++ b/docs/en_US/FeatureEngineering/Overview.md
@@ -6,11 +6,17 @@ For now, we support the following feature selector:
 - [GradientFeatureSelector](./GradientFeatureSelector.md)
 - [GBDTSelector](./GBDTSelector.md)

+These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).

-# How to use?
+In addition, those selector only for feature selection. If you want to:
+1) generate high-order combined features on nni while doing feature selection;
+2) leverage your distributed resources;
+you could try this [example](https://github.com/microsoft/nni/tree/master/examples/feature_engineering/auto-feature-engineering).
+
+## How to use?

 ```python
-from nni.feature_engineering.gradient_selector import GradientFeatureSelector
+from nni.feature_engineering.gradient_selector import FeatureGradientSelector
 # from nni.feature_engineering.gbdt_selector import GBDTSelector

 # load data
@@ -18,7 +24,7 @@ from nni.feature_engineering.gradient_selector import GradientFeatureSelector
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

 # initlize a selector
-fgs = GradientFeatureSelector(...)
+fgs = FeatureGradientSelector(...)
 # fit data
 fgs.fit(X_train, y_train)
 # get improtant features
@@ -30,7 +36,7 @@ print(fgs.get_selected_features(...))

 When using the built-in Selector, you first need to `import` a feature selector, and `initialize` it. You could call the function `fit` in the selector to pass the data to the selector. After that, you could use `get_seleteced_features` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it. 

-# How to customize?
+## How to customize?

 NNI provides _state-of-the-art_ feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.

@@ -239,7 +245,7 @@ print("Pipeline Score: ", pipeline.score(X_train, y_train))

 ```

-# Benchmark
+## Benchmark

 `Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.

@@ -257,7 +263,7 @@ The dataset of benchmark could be download in [here](https://www.csie.ntu.edu.tw

 The code could be refenrence `/examples/feature_engineering/gradient_feature_selector/benchmark_test.py`.

-## **Reference and Feedback**
+## Reference and Feedback
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
 * To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;
 * To know more about [Neural Architecture Search with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/NAS/Overview.md);

--- a/docs/en_US/NAS/Advanced.md
+++ b/docs/en_US/NAS/Advanced.md
+# Customize a NAS Algorithm
+
+## Extend the Ability of One-Shot Trainers
+
+Users might want to do multiple things if they are using the trainers on real tasks, for example, distributed training, half-precision training, logging periodically, writing tensorboard, dumping checkpoints and so on. As mentioned previously, some trainers do have support for some of the items listed above; others might not. Generally, there are two recommended ways to add anything you want to an existing trainer: inherit an existing trainer and override, or copy an existing trainer and modify.
+
+Either way, you are walking into the scope of implementing a new trainer. Basically, implementing a one-shot trainer is no different from any traditional deep learning trainer, except that a new concept called mutator will reveal itself. So that the implementation will be different in at least two places:
+
+* Initialization
+
+```python
+model = Model()
+mutator = MyMutator(model)
+```
+
+* Training
+
+```python
+for _ in range(epochs):
+    for x, y in data_loader:
+        mutator.reset()  # reset all the choices in model
+        out = model(x)  # like traditional model
+        loss = criterion(out, y)
+        loss.backward()
+        # no difference below
+```
+
+To demonstrate what mutators are for, we need to know how one-shot NAS normally works. Usually, one-shot NAS "co-optimize model weights and architecture weights". It repeatedly: sample an architecture or combination of several architectures from the supernet, train the chosen architectures like traditional deep learning model, update the trained parameters to the supernet, and use the metrics or loss as some signal to guide the architecture sampler. The mutator, is the architecture sampler here, often defined to be another deep-learning model. Therefore, you can treat it as any model, by defining parameters in it and optimizing it with optimizers. One mutator is initialized with exactly one model. Once a mutator is binded to a model, it cannot be rebinded to another model.
+
+`mutator.reset()` is the core step. That's where all the choices in the model are finalized. The reset result will be always effective, until the next reset flushes the data. After the reset, the model can be seen as a traditional model to do forward-pass and backward-pass.
+
+Finally, mutators provide a method called `mutator.export()` that export a dict with architectures to the model. Note that currently this dict this a mapping from keys of mutables to tensors of selection. So in order to dump to json, users need to convert the tensors explicitly into python list.
+
+Meanwhile, NNI provides some useful tools so that users can implement trainers more easily. See [Trainers](./NasReference.md#trainers) for details.
+
+## Implement New Mutators
+
+To start with, here is the pseudo-code that demonstrates what happens on `mutator.reset()` and `mutator.export()`.
+
+```python
+def reset(self):
+    self.apply_on_model(self.sample_search())
+```
+
+```python
+def export(self):
+    return self.sample_final()
+```
+
+On reset, a new architecture is sampled with `sample_search()` and applied on the model. Then the model is trained for one or more steps in search phase. On export, a new architecture is sampled with `sample_final()` and **do nothing to the model**. This is either for checkpoint or exporting the final architecture.
+
+The requirements of return values of `sample_search()` and `sample_final()` are the same: a mapping from mutable keys to tensors. The tensor can be either a BoolTensor (true for selected, false for negative), or a FloatTensor which applies weight on each candidate. The selected branches will then be computed (in `LayerChoice`, modules will be called; in `InputChoice`, it's just tensors themselves), and reduce with the reduction operation specified in the choices. For most algorithms only worrying about the former part, here is an example of your mutator implementation.
+
+```python
+class RandomMutator(Mutator):
+    def __init__(self, model):
+        super().__init__(model)  # don't forget to call super
+        # do something else
+
+    def sample_search(self):
+        result = dict()
+        for mutable in self.mutables:  # this is all the mutable modules in user model
+            # mutables share the same key will be de-duplicated
+            if isinstance(mutable, LayerChoice):
+                # decided that this mutable should choose `gen_index`
+                gen_index = np.random.randint(mutable.length)
+                result[mutable.key] = torch.tensor([i == gen_index for i in range(mutable.length)], 
+                                                   dtype=torch.bool)
+            elif isinstance(mutable, InputChoice):
+                if mutable.n_chosen is None:  # n_chosen is None, then choose any number
+                    result[mutable.key] = torch.randint(high=2, size=(mutable.n_candidates,)).view(-1).bool()
+                # else do something else
+        return result
+
+    def sample_final(self):
+        return self.sample_search()  # use the same logic here. you can do something different
+```
+
+The complete example of random mutator can be found [here](https://github.com/microsoft/nni/blob/master/src/sdk/pynni/nni/nas/pytorch/random/mutator.py).
+
+For advanced usages, e.g., users want to manipulate the way modules in `LayerChoice` are executed, they can inherit `BaseMutator`, and overwrite `on_forward_layer_choice` and `on_forward_input_choice`, which are the callback implementation of `LayerChoice` and `InputChoice` respectively. Users can still use property `mutables` to get all `LayerChoice` and `InputChoice` in the model code. For details, please refer to [reference](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/nas/pytorch) here to learn more.
+
+```eval_rst
+.. tip::
+    A useful application of random mutator is for debugging. Use
+   
+    .. code-block:: python
+
+        mutator = RandomMutator(model)
+        mutator.reset()
+
+    will immediately set one possible candidate in the search space as the active one.
+```
+
+## Implemented a Distributed NAS Tuner
+
+Before learning how to write a one-shot NAS tuner, users should first learn how to write a general tuner. read [Customize Tuner](../Tuner/CustomizeTuner.md) for tutorials.
+
+When users call "[nnictl ss_gen](../Tutorial/Nnictl.md)" to generate search space file, a search space file like this will be generated:
+
+```json
+{
+    "key_name": {
+        "_type": "layer_choice",
+        "_value": ["op1_repr", "op2_repr", "op3_repr"]
+    },
+    "key_name": {
+        "_type": "input_choice",
+        "_value": {
+            "candidates": ["in1_key", "in2_key", "in3_key"],
+            "n_chosen": 1
+        }
+    }
+}
+```
+
+This is the exact search space tuners will receive in `update_search_space`. It's then tuners' responsibility to interpret the search space and generate new candidates in `generate_parameters`. A valid "parameters" will be in the following format:
+
+```json
+{
+    "key_name": {
+        "_value": "op1_repr",
+        "_idx": 0
+    },
+    "key_name": {
+        "_value": ["in2_key"],
+        "_idex": [1]
+    }
+}
+```
+
+Send it through `generate_parameters`, and the tuner would look like any HPO tuner. Refer to [SPOS](./SPOS.md) example code for an example.
\ No newline at end of file
--- a/docs/en_US/NAS/CDARTS.md
+++ b/docs/en_US/NAS/CDARTS.md
@@ -46,16 +46,12 @@ bash run_retrain_cifar.sh
 ..  autoclass:: nni.nas.pytorch.cdarts.CdartsTrainer
    :members:

-    .. automethod:: __init__
-
 ..  autoclass:: nni.nas.pytorch.cdarts.RegularizedDartsMutator
    :members:

 ..  autoclass:: nni.nas.pytorch.cdarts.DartsDiscreteMutator
    :members:

-    .. automethod:: __init__
-
 ..  autoclass:: nni.nas.pytorch.cdarts.RegularizedMutatorParallel
    :members:
 ```
--- a/docs/en_US/NAS/DARTS.md
+++ b/docs/en_US/NAS/DARTS.md
@@ -43,8 +43,10 @@ python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
 ..  autoclass:: nni.nas.pytorch.darts.DartsTrainer
    :members:

-    .. automethod:: __init__
-
 ..  autoclass:: nni.nas.pytorch.darts.DartsMutator
    :members:
 ```
+
+## Limitations
+
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
--- a/docs/en_US/NAS/ENAS.md
+++ b/docs/en_US/NAS/ENAS.md
@@ -37,10 +37,6 @@ python3 search.py -h
 ..  autoclass:: nni.nas.pytorch.enas.EnasTrainer
    :members:

-    .. automethod:: __init__
-
 ..  autoclass:: nni.nas.pytorch.enas.EnasMutator
    :members:
-
-    .. automethod:: __init__
 ```
--- a/docs/en_US/NAS/NasGuide.md
+++ b/docs/en_US/NAS/NasGuide.md
+# Guide: Using NAS on NNI
+
+```eval_rst
+.. contents::
+
+.. Note:: The APIs are in an experimental stage. The current programing interface is subject to change.
+```
+
+![](../../img/nas_abstract_illustration.png)
+
+Modern Neural Architecture Search (NAS) methods usually incorporate [three dimensions][1]: search space, search strategy, and performance estimation strategy. Search space often contains a limited neural network architectures to explore, while search strategy samples architectures from search space, gets estimations of their performance, and evolves itself. Ideally, search strategy should find the best architecture in the search space and report it to users. After users obtain such "best architecture", many methods use a "retrain step", which trains the network with the same pipeline as any traditional model.
+
+## Implement a Search Space
+
+Assuming now we've got a baseline model, what should we do to be empowered with NAS? Take [MNIST on PyTorch](https://github.com/pytorch/examples/blob/master/mnist/main.py) as an example, the code might look like this:
+
+```python
+from nni.nas.pytorch import mutables
+
+class Net(nn.Module):
+    def __init__(self):
+        super(Net, self).__init__()
+        self.conv1 = mutables.LayerChoice([
+            nn.Conv2d(1, 32, 3, 1),
+            nn.Conv2d(1, 32, 5, 3)
+        ])  # try 3x3 kernel and 5x5 kernel
+        self.conv2 = nn.Conv2d(32, 64, 3, 1)
+        self.dropout1 = nn.Dropout2d(0.25)
+        self.dropout2 = nn.Dropout2d(0.5)
+        self.fc1 = nn.Linear(9216, 128)
+        self.fc2 = nn.Linear(128, 10)
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = F.relu(x)
+        # ... same as original ...
+        return output
+```
+
+The example above adds an option of choosing conv5x5 at conv1. The modification is as simple as declaring a `LayerChoice` with original conv3x3 and a new conv5x5 as its parameter. That's it! You don't have to modify the forward function in anyway. You can imagine conv1 as any another module without NAS.
+
+So how about the possibilities of connections? This can be done by `InputChoice`. To allow for a skipconnection on an MNIST example, we add another layer called conv3. In the following example, a possible connection from conv2 is added to the output of conv3.
+
+```python
+from nni.nas.pytorch import mutables
+
+class Net(nn.Module):
+    def __init__(self):
+        # ... same ...
+        self.conv2 = nn.Conv2d(32, 64, 3, 1)
+        self.conv3 = nn.Conv2d(64, 64, 1, 1)
+        # declaring that there is exactly one candidate to choose from
+        # search strategy will choose one or None
+        self.skipcon = mutables.InputChoice(n_candidates=1)
+        # ... same ...
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x0 = self.skipcon([x])  # choose one or none from [x]
+        x = self.conv3(x)
+        if x0 is not None:  # skipconnection is open
+            x += x0
+        x = F.max_pool2d(x, 2)
+        # ... same ...
+        return output
+```
+
+Input choice can be thought of as a callable module that receives a list of tensors and output the concatenation/sum/mean of some of them (sum by default), or `None` if none is selected. Like layer choices, input choices should be **initialized in `__init__` and called in `forward`**. We will see later that this is to allow search algorithms to identify these choices, and do necessary preparation.
+
+`LayerChoice` and `InputChoice` are both **mutables**. Mutable means "changeable". As opposed to traditional deep learning layers/modules which have fixed operation type once defined, models with mutables are essentially a series of possible models.
+
+Users can specify a **key** for each mutable. By default NNI will assign one for you that is globally unique, but in case users want to share choices (for example, there are two `LayerChoice` with the same candidate operations, and you want them to have the same choice, i.e., if first one chooses the i-th op, the second one also chooses the i-th op), they can give them the same key. The key marks the identity for this choice, and will be used in dumped checkpoint. So if you want to increase the readability of your exported architecture, manually assigning keys to each mutable would be a good idea. For advanced usage on mutables, see [Mutables](./NasReference.md#mutables).
+
+## Use a Search Algorithm
+
+Different in how the search space is explored and trials are spawned, there are at least two different ways users can do search. One runs NAS distributedly, which can be as naive as enumerating all the architectures and training each one from scratch, or leveraging more advanced technique, such as [SMASH][8], [ENAS][2], [DARTS][1], [FBNet][3], [ProxylessNAS][4], [SPOS][5], [Single-Path NAS][6],  [Understanding One-shot][7] and [GDAS][9]. Since training many different architectures are known to be expensive, another family of methods, called one-shot NAS, builds a supernet containing every candidate in the search space as its subnetwork, and in each step a subnetwork or combination of several subnetworks is trained.
+
+Currently, several one-shot NAS methods have been supported on NNI. For example, `DartsTrainer` which uses SGD to train architecture weights and model weights iteratively, `ENASTrainer` which [uses a controller to train the model][2]. New and more efficient NAS trainers keep emerging in research community.
+
+### One-Shot NAS
+
+Each one-shot NAS implements a trainer, which users can find detailed usages in the description of each algorithm. Here is a simple example, demonstrating how users can use `EnasTrainer`.
+
+```python
+# this is exactly same as traditional model training
+model = Net()
+dataset_train = CIFAR10(root="./data", train=True, download=True, transform=train_transform)
+dataset_valid = CIFAR10(root="./data", train=False, download=True, transform=valid_transform)
+criterion = nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), 0.05, momentum=0.9, weight_decay=1.0E-4)
+
+# use NAS here
+def top1_accuracy(output, target):
+    # this is the function that computes the reward, as required by ENAS algorithm
+    batch_size = target.size(0)
+    _, predicted = torch.max(output.data, 1)
+    return (predicted == target).sum().item() / batch_size
+
+def metrics_fn(output, target):
+    # metrics function receives output and target and computes a dict of metrics
+    return {"acc1": reward_accuracy(output, target)}
+
+from nni.nas.pytorch import enas
+trainer = enas.EnasTrainer(model,
+                           loss=criterion,
+                           metrics=metrics_fn,
+                           reward_function=top1_accuracy,
+                           optimizer=optimizer,
+                           batch_size=128
+                           num_epochs=10,  # 10 epochs
+                           dataset_train=dataset_train,
+                           dataset_valid=dataset_valid,
+                           log_frequency=10)  # print log every 10 steps
+trainer.train()  # training
+trainer.export(file="model_dir/final_architecture.json")  # export the final architecture to file
+```
+
+Users can directly run their training file by `python3 train.py`, without `nnictl`. After training, users could export the best one of the found models through `trainer.export()`.
+
+Normally, the trainer exposes a few arguments that you can customize, for example, loss function, metrics function, optimizer, and datasets. These should satisfy the needs from most usages, and we do our best to make sure our built-in trainers work on as many models, tasks and datasets as possible. But there is no guarantee. For example, some trainers have assumption that the task has to be a classification task; some trainers might have a different definition of "epoch" (e.g., an ENAS epoch = some child steps + some controller steps); most trainers do not have support for distributed training: they won't wrap your model with `DataParallel` or `DistributedDataParallel` to do that. So after a few tryouts, if you want to actually use the trainers on your very customized applications, you might very soon need to [customize your trainer](#extend-the-ability-of-one-shot-trainers).
+
+### Distributed NAS
+
+Neural architecture search is originally executed by running each child model independently as a trial job. We also support this searching approach, and it naturally fits in NNI hyper-parameter tuning framework, where tuner generates child model for next trial and trials run in training service.
+
+To use this mode, there is no need to change the search space expressed with NNI NAS API (i.e., `LayerChoice`, `InputChoice`, `MutableScope`). After the model is initialized, apply the function `get_and_apply_next_architecture` on the model. One-shot NAS trainers are not used in this mode. Here is a simple example:
+
+```python
+model = Net()
+
+# get the chosen architecture from tuner and apply it on model
+get_and_apply_next_architecture(model)
+train(model)  # your code for training the model
+acc = test(model)  # test the trained model
+nni.report_final_result(acc)  # report the performance of the chosen architecture
+```
+
+The search space should be generated and sent to tuner. As with NNI NAS API the search space is embedded in user code, users could use "[nnictl ss_gen](../Tutorial/Nnictl.md)" to generate search space file. Then, put the path of the generated search space in the field `searchSpacePath` of `config.yml`. The other fields in `config.yml` can be filled by referring [this tutorial](../Tutorial/QuickStart.md).
+
+You could use [NNI tuners](../Tuner/BuiltinTuner.md) to do the search. Currently, only PPO Tuner supports NAS search space.
+
+We support standalone mode for easy debugging, where you could directly run the trial command without launching an NNI experiment. This is for checking whether your trial code can correctly run. The first candidate(s) are chosen for `LayerChoice` and `InputChoice` in this standalone mode.
+
+A complete example can be found [here](https://github.com/microsoft/nni/tree/master/examples/nas/classic_nas/config_nas.yml).
+
+### Retrain with Exported Architecture
+
+After the searching phase, it's time to train the architecture found. Unlike many open-source NAS algorithms who write a whole new model specifically for retraining. We found that searching model and retraining model are usual very similar, and therefore you can construct your final model with the exact model code. For example
+
+```python
+model = Net()
+apply_fixed_architecture(model, "model_dir/final_architecture.json")
+```
+
+The JSON is simply a mapping from mutable keys to one-hot or multi-hot representation of choices. For example
+
+```json
+{
+    "LayerChoice1": [false, true, false, false],
+    "InputChoice2": [true, true, false]
+}
+```
+
+After applying, the model is then fixed and ready for a final training. The model works as a single model, although it might contain more parameters than expected. This comes with pros and cons. The good side is, you can directly load the checkpoint dumped from supernet during search phase and start retrain from there. However, this is also a model with redundant parameters, which may cause problems when trying to count the number of parameters in model. For deeper reasons and possible workaround, see [Trainers](./NasReference.md#retrain).
+
+Also refer to [DARTS](./DARTS.md) for example code of retraining.
+
+[1]: https://arxiv.org/abs/1808.05377
+[2]: https://arxiv.org/abs/1802.03268
+[3]: https://arxiv.org/abs/1812.03443
+[4]: https://arxiv.org/abs/1812.00332
+[5]: https://arxiv.org/abs/1904.00420
+[6]: https://arxiv.org/abs/1904.02877
+[7]: http://proceedings.mlr.press/v80/bender18a
+[8]: https://arxiv.org/abs/1708.05344
+[9]: https://arxiv.org/abs/1910.04465
\ No newline at end of file
--- a/docs/en_US/NAS/NasInterface.md
+++ b/docs/en_US/NAS/NasInterface.md
-# NNI NAS Programming Interface
-
-We are trying to support various NAS algorithms with unified programming interface, and it's still in experimental stage. It means the current programing interface might be updated in future.
-
-## Programming interface for user model
-
-The programming interface of designing and searching a model is often demanded in two scenarios.
-
-1. When designing a neural network, there may be multiple operation choices on a layer, sub-model, or connection, and it's undetermined which one or combination performs  best. So, it needs an easy way to express the candidate layers or sub-models.
-2. When applying NAS on a neural network, it needs an unified way to express the search space of architectures, so that it doesn't need to update trial code for different searching algorithms.
-
-
-For expressing neural architecture search space in user code, we provide the following APIs (take PyTorch as example):
-
-```python
-# in PyTorch module class
-def __init__(self):
-    ...
-    # choose one ``op`` from ``ops``, for PyTorch this is a module.
-    # op_candidates: for PyTorch ``ops`` is a list of modules, for tensorflow it is a list of keras layers.
-    # key: the name of this ``LayerChoice`` instance
-    self.one_layer = nni.nas.pytorch.LayerChoice([
-        PoolBN('max', channels, 3, stride, 1, affine=False),
-        PoolBN('avg', channels, 3, stride, 1, affine=False),
-        FactorizedReduce(channels, channels, affine=False),
-        SepConv(channels, channels, 3, stride, 1, affine=False),
-        DilConv(channels, channels, 3, stride, 2, 2, affine=False)],
-        key="layer_name")
-    ...
-
-def forward(self, x):
-    ...
-    out = self.one_layer(x)
-    ...
-```
-This is for users to specify multiple candidate operations for a layer, one operation will be chosen at last. `key` is the identifier of the layer,it could be used to share choice between multiple `LayerChoice`. For example, there are two `LayerChoice` with the same candidate operations, and you want them to have the same choice (i.e., if first one chooses the `i`th op, the second one also chooses the `i`th op), give them the same key.
-
-```python
-def __init__(self):
-    ...
-    # choose ``n_selected`` from ``n_candidates`` inputs.
-    # n_candidates: the number of candidate inputs
-    # n_chosen: the number of chosen inputs
-    # key: the name of this ``InputChoice`` instance
-    self.input_switch = nni.nas.pytorch.InputChoice(
-        n_candidates=3,
-        n_chosen=1,
-        key="switch_name")
-    ...
-
-def forward(self, x):
-    ...
-    out = self.input_switch([in_tensor1, in_tensor2, in_tensor3])
-    ...
-```
-`InputChoice` is a PyTorch module, in init, it needs meta information, for example, from how many input candidates to choose how many inputs, and the name of this initialized `InputChoice`. The real candidate input tensors can only be obtained in `forward` function. In the `forward` function, the `InputChoice` module you create in `__init__` (e.g., `self.input_switch`) is called with real candidate input tensors.
-
-Some [NAS trainers](#one-shot-training-mode) need to know the source layer the input tensors, thus, we add one input argument `choose_from` in `InputChoice` to indicate the source layer of each candidate input. `choose_from` is a list of string, each element is `key` of `LayerChoice` and `InputChoice` or the name of a module (refer to [the code](https://github.com/microsoft/nni/blob/master/src/sdk/pynni/nni/nas/pytorch/mutables.py) for more details).
-
-
-Besides `LayerChoice` and `InputChoice`, we also provide `MutableScope` which allows users to label a sub-network, thus, could provide more semantic information (e.g., the structure of the network) to NAS trainers. Here is an example:
-```python
-class Cell(mutables.MutableScope):
-    def __init__(self, scope_name):
-        super().__init__(scope_name)
-        self.layer1 = nni.nas.pytorch.LayerChoice(...)
-        self.layer2 = nni.nas.pytorch.LayerChoice(...)
-        self.layer3 = nni.nas.pytorch.LayerChoice(...)
-        ...
-```
-The three `LayerChoice` (`layer1`, `layer2`, `layer3`) are included in the `MutableScope` named `scope_name`. NAS trainer could get this hierarchical structure.
-
-
-## Two training modes
-
-After writing your model with search space embedded in the model using the above APIs, the next step is finding the best model from the search space. There are two training modes: [one-shot training mode](#one-shot-training-mode) and [classic distributed search](#classic-distributed-search).
-
-### One-shot training mode
-
-Similar to optimizers of deep learning models, the procedure of finding the best model from search space can be viewed as a type of optimizing process, we call it `NAS trainer`. There have been several NAS trainers, for example, `DartsTrainer` which uses SGD to train architecture weights and model weights iteratively, `ENASTrainer` which uses a controller to train the model. New and more efficient NAS trainers keep emerging in research community.
-
-NNI provides some popular NAS trainers, to use a NAS trainer, users could initialize a trainer after the model is defined:
-
-```python
-# create a DartsTrainer
-trainer = DartsTrainer(model,
-                       loss=criterion,
-                       metrics=lambda output, target: accuracy(output, target, topk=(1,)),
-                       optimizer=optim,
-                       num_epochs=args.epochs,
-                       dataset_train=dataset_train,
-                       dataset_valid=dataset_valid,)
-# finding the best model from search space
-trainer.train()
-# export the best found model
-trainer.export(file='./chosen_arch')
-```
-
-Different trainers could have different input arguments depending on their algorithms. Please refer to [each trainer's code](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/nas/pytorch) for detailed arguments. After training, users could export the best one of the found models through `trainer.export()`. No need to start an NNI experiment through `nnictl`.
-
-The supported trainers can be found [here](Overview.md#supported-one-shot-nas-algorithms). A very simple example using NNI NAS API can be found [here](https://github.com/microsoft/nni/tree/master/examples/nas/simple/train.py).
-
-### Classic distributed search
-
-Neural architecture search is originally executed by running each child model independently as a trial job. We also support this searching approach, and it naturally fits in NNI hyper-parameter tuning framework, where tuner generates child model for next trial and trials run in training service.
-
-For using this mode, no need to change the search space expressed with NNI NAS API (i.e., `LayerChoice`, `InputChoice`, `MutableScope`). After the model is initialized, apply the function `get_and_apply_next_architecture` on the model. One-shot NAS trainers are not used in this mode. Here is a simple example:
-```python
-class Net(nn.Module):
-    # defined model with LayerChoice and InputChoice
-    ...
-
-model = Net()
-# get the chosen architecture from tuner and apply it on model
-get_and_apply_next_architecture(model)
-# your code for training the model
-train(model)
-# test the trained model
-acc = test(model)
-# report the performance of the chosen architecture
-nni.report_final_result(acc)
-```
-
-The search space should be automatically generated and sent to tuner. As with NNI NAS API the search space is embedded in user code, users could use "[nnictl ss_gen](../Tutorial/Nnictl.md)" to generate search space file. Then, put the path of the generated search space in the field `searchSpacePath` of `config.yml`. The other fields in `config.yml` can be filled by referring [this tutorial](../Tutorial/QuickStart.md).
-
-You could use [NNI tuners](../Tuner/BuiltinTuner.md) to do the search.
-
-We support standalone mode for easy debugging, where you could directly run the trial command without launching an NNI experiment. This is for checking whether your trial code can correctly run. The first candidate(s) are chosen for `LayerChoice` and `InputChoice` in this standalone mode.
-
-The complete example code can be found [here](https://github.com/microsoft/nni/tree/master/examples/nas/classic_nas/config_nas.yml).
-
-## Programming interface for NAS algorithm
-
-We also provide simple interface for users to easily implement a new NAS trainer on NNI.
-
-### Implement a new NAS trainer on NNI
-
-To implement a new NAS trainer, users basically only need to implement two classes by inheriting `BaseMutator` and `BaseTrainer` respectively.
-
-In `BaseMutator`, users need to overwrite `on_forward_layer_choice` and `on_forward_input_choice`, which are the implementation of `LayerChoice` and `InputChoice` respectively. Users could use property `mutables` to get all `LayerChoice` and `InputChoice` in the model code. Then users need to implement a new trainer, which instantiates the new mutator and implement the training logic. For details, please read [the code](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/nas/pytorch) and the supported trainers, for example, [DartsTrainer](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/nas/pytorch/darts).
-
-### Implement an NNI tuner for NAS
-
-NNI tuner for NAS takes the auto generated search space. The search space format of `LayerChoice` and `InputChoice` is shown below:
-```json
-{
-    "key_name": {
-        "_type": "layer_choice",
-        "_value": ["op1_repr", "op2_repr", "op3_repr"]
-    },
-    "key_name": {
-        "_type": "input_choice",
-        "_value": {
-            "candidates": ["in1_key", "in2_key", "in3_key"],
-            "n_chosen": 1
-        }
-    }
-}
-```
-
-Correspondingly, the generate architecture is in the following format:
-```json
-{
-    "key_name": {
-        "_value": "op1_repr",
-        "_idx": 0
-    },
-    "key_name": {
-        "_value": ["in2_key"],
-        "_idex": [1]
-    }
-}
-```
--- a/docs/en_US/NAS/NasReference.md
+++ b/docs/en_US/NAS/NasReference.md
+# NAS Reference
+
+```eval_rst
+.. contents::
+```
+
+## Mutables
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.mutables.Mutable
+    :members:
+
+..  autoclass:: nni.nas.pytorch.mutables.LayerChoice
+    :members:
+
+..  autoclass:: nni.nas.pytorch.mutables.InputChoice
+    :members:
+
+..  autoclass:: nni.nas.pytorch.mutables.MutableScope
+    :members:
+```
+
+### Utilities
+
+```eval_rst
+..  autofunction:: nni.nas.pytorch.utils.global_mutable_counting
+```
+
+## Mutators
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.base_mutator.BaseMutator
+    :members:
+
+..  autoclass:: nni.nas.pytorch.mutator.Mutator
+    :members:
+```
+
+### Random Mutator
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.random.RandomMutator
+    :members:
+```
+
+### Utilities
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.utils.StructuredMutableTreeNode
+    :members:
+```
+
+## Trainers
+
+### Trainer
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.base_trainer.BaseTrainer
+    :members:
+
+..  autoclass:: nni.nas.pytorch.trainer.Trainer
+    :members:
+```
+
+### Retrain
+
+```eval_rst
+..  autofunction:: nni.nas.pytorch.fixed.apply_fixed_architecture
+
+..  autoclass:: nni.nas.pytorch.fixed.FixedArchitecture
+    :members:
+```
+
+### Distributed NAS
+
+```eval_rst
+..  autofunction:: nni.nas.pytorch.classic_nas.get_and_apply_next_architecture
+
+..  autoclass:: nni.nas.pytorch.classic_nas.mutator.ClassicMutator
+    :members:
+```
+
+### Callbacks
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.callbacks.Callback
+    :members:
+
+..  autoclass:: nni.nas.pytorch.callbacks.LRSchedulerCallback
+    :members:
+
+..  autoclass:: nni.nas.pytorch.callbacks.ArchitectureCheckpoint
+    :members:
+
+..  autoclass:: nni.nas.pytorch.callbacks.ModelCheckpoint
+    :members:
+```
+
+### Utilities
+
+```eval_rst
+..  autoclass:: nni.nas.pytorch.utils.AverageMeterGroup
+    :members:
+
+..  autoclass:: nni.nas.pytorch.utils.AverageMeter
+    :members:
+
+..  autofunction:: nni.nas.pytorch.utils.to_device
+```
--- a/docs/en_US/NAS/Overview.md
+++ b/docs/en_US/NAS/Overview.md
@@ -6,11 +6,7 @@ However, it takes great efforts to implement NAS algorithms, and it is hard to r

 With this motivation, our ambition is to provide a unified architecture in NNI, to accelerate innovations on NAS, and apply state-of-art algorithms on real world problems faster.

-With [the unified interface](./NasInterface.md), there are two different modes for the architecture search. [One](#supported-one-shot-nas-algorithms) is the so-called one-shot NAS, where a super-net is built based on search space, and using one shot training to generate good-performing child model. [The other](./NasInterface.md#classic-distributed-search) is the traditional searching approach, where each child model in search space runs as an independent trial, the performance result is sent to tuner and the tuner generates new child model.
-
-* [Supported One-shot NAS Algorithms](#supported-one-shot-nas-algorithms)
-* [Classic Distributed NAS with NNI experiment](./NasInterface.md#classic-distributed-search)
-* [NNI NAS Programming Interface](./NasInterface.md)
+With the unified interface, there are two different modes for the architecture search. [One](#supported-one-shot-nas-algorithms) is the so-called one-shot NAS, where a super-net is built based on search space, and using one shot training to generate good-performing child model. [The other](#supported-distributed-nas-algorithms) is the traditional searching approach, where each child model in search space runs as an independent trial, the performance result is sent to tuner and the tuner generates new child model.

 ## Supported One-shot NAS Algorithms

@@ -23,6 +19,7 @@ NNI supports below NAS algorithms now and is adding more. User can reproduce an
 | [P-DARTS](PDARTS.md) | [Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation](https://arxiv.org/abs/1904.12760) is based on DARTS. It introduces an efficient algorithm which allows the depth of searched architectures to grow gradually during the training procedure. |
 | [SPOS](SPOS.md) | [Single Path One-Shot Neural Architecture Search with Uniform Sampling](https://arxiv.org/abs/1904.00420) constructs a simplified supernet trained with an uniform path sampling method, and applies an evolutionary algorithm to efficiently search for the best-performing architectures. |
 | [CDARTS](CDARTS.md) | [Cyclic Differentiable Architecture Search](https://arxiv.org/abs/****) builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture.|
+| [ProxylessNAS](Proxylessnas.md) | [ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware](https://arxiv.org/abs/1812.00332).|

 One-shot algorithms run **standalone without nnictl**. Only PyTorch version has been implemented. Tensorflow 2.x will be supported in future release.

@@ -33,18 +30,26 @@ Here are some common dependencies to run the examples. PyTorch needs to be above
 * PyTorch 1.2+
 * git

-## Use NNI API
+## Supported Distributed NAS Algorithms
+
+|Name|Brief Introduction of Algorithm|
+|---|---|
+| [SPOS](SPOS.md) | [Single Path One-Shot Neural Architecture Search with Uniform Sampling](https://arxiv.org/abs/1904.00420) constructs a simplified supernet trained with an uniform path sampling method, and applies an evolutionary algorithm to efficiently search for the best-performing architectures. |

-NOTE, we are trying to support various NAS algorithms with unified programming interface, and it's in very experimental stage. It means the current programing interface may be updated in future.
+```eval_rst
+.. Note:: SPOS is a two-stage algorithm, whose first stage is one-shot and second stage is distributed, leveraging result of first stage as a checkpoint.
+```

-### Programming interface
+## Use NNI API

 The programming interface of designing and searching a model is often demanded in two scenarios.

 1. When designing a neural network, there may be multiple operation choices on a layer, sub-model, or connection, and it's undetermined which one or combination performs  best. So, it needs an easy way to express the candidate layers or sub-models.
 2. When applying NAS on a neural network, it needs an unified way to express the search space of architectures, so that it doesn't need to update trial code for different searching algorithms.

-NNI proposed API is [here](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/nas/pytorch). And [here](https://github.com/microsoft/nni/tree/master/examples/nas/naive) is an example of NAS implementation, which bases on NNI proposed interface.
+[Here](./NasGuide.md) is a user guide to get started with using NAS on NNI.
+
+## Reference and Feedback

 [1]: https://arxiv.org/abs/1802.03268
 [2]: https://arxiv.org/abs/1707.07012
@@ -52,9 +57,5 @@ NNI proposed API is [here](https://github.com/microsoft/nni/tree/master/src/sdk/
 [4]: https://arxiv.org/abs/1806.10282
 [5]: https://arxiv.org/abs/1703.01041

-## **Reference and Feedback**
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
-* To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;
-* To know more about [Feature Engineering with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/FeatureEngineering/Overview.md);
-* To know more about [Model Compression with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Compressor/Overview.md);
-* To know more about [Hyperparameter Tuning with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Tuner/BuiltinTuner.md);
+* To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub.
\ No newline at end of file
--- a/docs/en_US/NAS/Proxylessnas.md
+++ b/docs/en_US/NAS/Proxylessnas.md
+# ProxylessNAS on NNI
+
+## Introduction
+
+The paper [ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware](https://arxiv.org/pdf/1812.00332.pdf) removes proxy, it directly learns the architectures for large-scale target tasks and target hardware platforms. They address high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set. Please refer to the paper for the details.
+
+## Usage
+
+To use ProxylessNAS training/searching approach, users need to specify search space in their model using [NNI NAS interface](NasGuide.md), e.g., `LayerChoice`, `InputChoice`. After defining and instantiating the model, the following work can be leaved to ProxylessNasTrainer by instantiating the trainer and passing the model to it.
+```python
+trainer = ProxylessNasTrainer(model,
+                              model_optim=optimizer,
+                              train_loader=data_provider.train,
+                              valid_loader=data_provider.valid,
+                              device=device,
+                              warmup=True,
+                              ckpt_path=args.checkpoint_path,
+                              arch_path=args.arch_path)
+trainer.train()
+trainer.export(args.arch_path)
+```
+The complete example code can be found [here](https://github.com/microsoft/nni/tree/master/examples/nas/proxylessnas).
+
+**Input arguments of ProxylessNasTrainer**
+
+* **model** (*PyTorch model, required*) - The model that users want to tune/search. It has mutables to specify search space.
+* **model_optim** (*PyTorch optimizer, required*) - The optimizer users want to train the model.
+* **device** (*device, required*) - The devices that users provide to do the train/search. The trainer applies data parallel on the model for users.
+* **train_loader** (*PyTorch data loader, required*) - The data loader for training set.
+* **valid_loader** (*PyTorch data loader, required*) - The data loader for validation set.
+* **label_smoothing** (*float, optional, default = 0.1*) - The degree of label smoothing.
+* **n_epochs** (*int, optional, default = 120*) - The number of epochs to train/search.
+* **init_lr** (*float, optional, default = 0.025*) - The initial learning rate for training the model.
+* **binary_mode** (*'two', 'full', or 'full_v2', optional, default = 'full_v2'*) - The forward/backward mode for the binary weights in mutator. 'full' means forward all the candidate ops, 'two' means only forward two sampled ops, 'full_v2' means recomputing the inactive ops during backward.
+* **arch_init_type** (*'normal' or 'uniform', optional, default = 'normal'*) - The way to init architecture parameters.
+* **arch_init_ratio** (*float, optional, default = 1e-3*) - The ratio to init architecture parameters.
+* **arch_optim_lr** (*float, optional, default = 1e-3*) - The learning rate of the architecture parameters optimizer.
+* **arch_weight_decay** (*float, optional, default = 0*) - Weight decay of the architecture parameters optimizer.
+* **grad_update_arch_param_every** (*int, optional, default = 5*) - Update architecture weights every this number of minibatches.
+* **grad_update_steps** (*int, optional, default = 1*) - During each update of architecture weights, the number of steps to train architecture weights.
+* **warmup** (*bool, optional, default = True*) - Whether to do warmup.
+* **warmup_epochs** (*int, optional, default = 25*) - The number of epochs to do during warmup.
+* **arch_valid_frequency** (*int, optional, default = 1*) - The frequency of printing validation result.
+* **load_ckpt** (*bool, optional, default = False*) - Whether to load checkpoint.
+* **ckpt_path** (*str, optional, default = None*) - checkpoint path, if load_ckpt is True, ckpt_path cannot be None.
+* **arch_path** (*str, optional, default = None*) - The path to store chosen architecture.
+
+
+## Implementation
+
+The implementation on NNI is based on the [offical implementation](https://github.com/mit-han-lab/ProxylessNAS). The official implementation supports two training approaches: gradient descent and RL based, and support different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'. In our current implementation on NNI, gradient descent training approach is supported, but has not supported different hardwares. The complete support is ongoing.
+
+Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in [example code](https://github.com/microsoft/nni/tree/master/examples/nas/proxylessnas) using [NNI NAS interface](NasGuide.md), and put the training approach in [SDK](https://github.com/microsoft/nni/tree/master/src/sdk/pynni/nni/nas/pytorch/proxylessnas).
+
+![](../../img/proxylessnas.png)
+
+ProxylessNAS training approach is composed of ProxylessNasMutator and ProxylessNasTrainer. ProxylessNasMutator instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**, architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The mutator also exposes two member functions, i.e., `arch_requires_grad`, `arch_disable_grad`, for the trainer to control the training of architecture weights.
+
+ProxylessNasMutator also implements the forward logic of the mutables (i.e., LayerChoice).
+
+## Reproduce Results
+
+To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it. The test accuracy of the found architecture is top1: 72.31, top5: 90.26.
--- a/docs/en_US/NAS/QuickStart.md
+++ b/docs/en_US/NAS/QuickStart.md
+# NAS Quick Start
+
+The NAS feature provided by NNI has two key components: APIs for expressing search space, and NAS training approaches. The former is for users to easily specify a class of models (i.e., the candidate models specified by search space) which may perform well. The latter is for users to easily apply state-of-the-art NAS training approaches on their own model.
+
+Here we use a simple example to demonstrate how to tune your model architecture with NNI NAS APIs step by step. The complete code of this example can be found [here](https://github.com/microsoft/nni/tree/master/examples/nas/naive).
+
+## Write your model with NAS APIs
+
+Instead of writing a concrete neural model, you can write a class of neural models using two NAS APIs `LayerChoice` and `InputChoice`. For example, you think either of two operations might work in the first convolution layer, then you can get one from them using `LayerChoice` as shown by `self.conv1` in the code. Similarly, the second convolution layer `self.conv2` also chooses one from two operations. To this line, four candidate neural networks are specified. `self.skipconnect` uses `InputChoice` to specify two choices, i.e., adding skip connection or not.
+
+```python
+import torch.nn as nn
+from nni.nas.pytorch.mutables import LayerChoice, InputChoice
+
+class Net(nn.Module):
+    def __init__(self):
+        super(Net, self).__init__()
+        self.conv1 = LayerChoice([nn.Conv2d(3, 6, 3, padding=1), nn.Conv2d(3, 6, 5, padding=2)])
+        self.pool = nn.MaxPool2d(2, 2)
+        self.conv2 = LayerChoice([nn.Conv2d(6, 16, 3, padding=1), nn.Conv2d(6, 16, 5, padding=2)])
+        self.conv3 = nn.Conv2d(16, 16, 1)
+
+        self.skipconnect = InputChoice(n_candidates=1)
+        self.bn = nn.BatchNorm2d(16)
+
+        self.gap = nn.AdaptiveAvgPool2d(4)
+        self.fc1 = nn.Linear(16 * 4 * 4, 120)
+        self.fc2 = nn.Linear(120, 84)
+        self.fc3 = nn.Linear(84, 10)
+```
+
+For detailed description of `LayerChoice` and `InputChoice`, please refer to [the guidance](NasGuide.md)
+
+## Choose a NAS trainer
+
+After the model is instantiated, it is time to train the model using NAS trainer. Different trainers use different approaches to search for the best one from a class of neural models that you specified. NNI provides popular NAS training approaches, such as DARTS, ENAS. Here we use `DartsTrainer` as an example below. After the trainer is instantiated, invoke `trainer.train()` to do the search.
+
+```python
+trainer = DartsTrainer(net,
+                       loss=criterion,
+                       metrics=accuracy,
+                       optimizer=optimizer,
+                       num_epochs=2,
+                       dataset_train=dataset_train,
+                       dataset_valid=dataset_valid,
+                       batch_size=64,
+                       log_frequency=10)
+trainer.train()
+```
+
+## Export the best model
+
+After the search (i.e., `trainer.train()`) is done, we want to get the best performing model, then simply call `trainer.export("final_arch.json")` to export the found neural architecture to a file.
+
+## NAS visualization
+
+We are working on visualization of NAS and will release soon.
+
+## Retrain the exported best model
+
+It is simple to retrain the found (exported) neural architecture. Step one, instantiate the model you defined above. Step two, invoke `apply_fixed_architecture` on the model. Then the model becomes the found (exported) one, you can use traditional model training to train this model.
+
+```python
+model = Net()
+apply_fixed_architecture(model, "final_arch.json")
+```
--- a/docs/en_US/NAS/SPOS.md
+++ b/docs/en_US/NAS/SPOS.md
@@ -93,17 +93,11 @@ By default, it will use `architecture_final.json`. This architecture is provided
 ..  autoclass:: nni.nas.pytorch.spos.SPOSEvolution
    :members:

-    .. automethod:: __init__
-
 ..  autoclass:: nni.nas.pytorch.spos.SPOSSupernetTrainer
    :members:

-    .. automethod:: __init__
-
 ..  autoclass:: nni.nas.pytorch.spos.SPOSSupernetTrainingMutator
    :members:
-
-    .. automethod:: __init__
 ```

 ## Known Limitations

--- a/docs/en_US/Release.md
+++ b/docs/en_US/Release.md
 # ChangeLog

+## Release 1.4 - 2/19/2020
+
+### Major Features
+
+#### Neural Architecture Search
+* Support [C-DARTS](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/CDARTS.md) algorithm and add [the example](https://github.com/microsoft/nni/tree/v1.4/examples/nas/cdarts) using it
+* Support a preliminary version of [ProxylessNAS](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/Proxylessnas.md) and the corresponding [example](https://github.com/microsoft/nni/tree/v1.4/examples/nas/proxylessnas)
+* Add unit tests for the NAS framework
+
+#### Model Compression
+* Support DataParallel for compressing models, and provide [an example](https://github.com/microsoft/nni/blob/v1.4/examples/model_compress/multi_gpu.py) of using DataParallel
+* Support [model speedup](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/ModelSpeedup.md) for compressed models, in Alpha version
+
+#### Training Service
+* Support complete PAI configurations by allowing users to specify PAI config file path
+* Add example config yaml files for the new PAI mode (i.e., paiK8S)
+* Support deleting experiments using sshkey in remote mode (thanks external contributor @tyusr)
+
+#### WebUI
+* WebUI refactor: adopt fabric framework
+
+#### Others
+* Support running [NNI experiment at foreground](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Tutorial/Nnictl.md#manage-an-experiment), i.e., `--foreground` argument in `nnictl create/resume/view`
+* Support canceling the trials in UNKNOWN state
+* Support large search space whose size could be up to 50mb (thanks external contributor @Sundrops)
+
+### Documentation
+* Improve [the index structure](https://nni.readthedocs.io/en/latest/) of NNI readthedocs
+* Improve [documentation for NAS](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/NasGuide.md)
+* Improve documentation for [the new PAI mode](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrainingService/PaiMode.md)
+* Add QuickStart guidance for [NAS](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/QuickStart.md) and [model compression](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/QuickStart.md)
+* Improve documentation for [the supported EfficientNet](https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrialExample/EfficientNet.md)
+
+### Bug Fixes
+* Correctly support NaN in metric data, JSON compliant
+* Fix the out-of-range bug of `randint` type in search space
+* Fix the bug of wrong tensor device when exporting onnx model in model compression
+* Fix incorrect handling of nnimanagerIP in the new PAI mode (i.e., paiK8S)
+
+
 ## Release 1.3 - 12/30/2019

 ### Major Features
@@ -213,7 +253,7 @@

 ### Major Features

-* [Support NNI on Windows](Tutorial/NniOnWindows.md)
+* [Support NNI on Windows](Tutorial/InstallationWin.md)
  * NNI running on windows for local mode
 * [New advisor: BOHB](Tuner/BohbAdvisor.md)
  * Support a new advisor BOHB, which is a robust and efficient hyperparameter tuning algorithm, combines the advantages of Bayesian optimization and Hyperband