Merge dev-nas-tuner back to master (#1531)

* PPO tuner for NAS, supports NNI's NAS interface (#1380)

Merge dev-nas-tuner back to master (#1531)
* PPO tuner for NAS, supports NNI's NAS interface (#1380)
55f48d27 · QuanluZhang · GitHub · 7246593f · 55f48d27 · 55f48d27
Unverified Commit 55f48d27 authored Sep 16, 2019 by QuanluZhang Committed by GitHub Sep 16, 2019
20 changed files
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -15,7 +15,7 @@ jobs:
    displayName: 'Install nni toolkit via source code'
  - script: |
      python3 -m pip install flake8 --user
-      IGNORE=./tools/nni_annotation/testcase/*:F821,./examples/trials/mnist-nas/*/mnist*.py:F821
+      IGNORE=./tools/nni_annotation/testcase/*:F821,./examples/trials/mnist-nas/*/mnist*.py:F821,./examples/trials/nas_cifar10/src/cifar10/general_child.py:F821
      python3 -m flake8 . --count --per-file-ignores=$IGNORE --select=E9,F63,F72,F82 --show-source --statistics
    displayName: 'Run flake8 tests to find Python syntax errors and undefined names'
  - script: |

--- a/docs/en_US/Tuner/BuiltinTuner.md
+++ b/docs/en_US/Tuner/BuiltinTuner.md
@@ -20,6 +20,7 @@ Currently we support the following algorithms:
 |[__Metis Tuner__](#MetisTuner)|Metis offers the following benefits when it comes to tuning parameters: While most tools only predict the optimal configuration, Metis gives you two outputs: (a) current prediction of optimal configuration, and (b) suggestion for the next trial. No more guesswork. While most tools assume training datasets do not have noisy data, Metis actually tells you if you need to re-sample a particular hyper-parameter. [Reference Paper](https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/)|
 |[__BOHB__](#BOHB)|BOHB is a follow-up work of Hyperband. It targets the weakness of Hyperband that new configurations are generated randomly without leveraging finished trials. For the name BOHB, HB means Hyperband, BO means Bayesian Optimization. BOHB leverages finished trials by building multiple TPE models, a proportion of new configurations are generated through these models. [Reference Paper](https://arxiv.org/abs/1807.01774)|
 |[__GP Tuner__](#GPTuner)|Gaussian Process Tuner is a sequential model-based optimization (SMBO) approach with Gaussian Process as the surrogate. [Reference Paper](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf), [Github Repo](https://github.com/fmfn/BayesianOptimization)|
+|[__PPO Tuner__](#PPOTuner)|PPO Tuner is an Reinforcement Learning tuner based on PPO algorithm. [Reference Paper](https://arxiv.org/abs/1707.06347)|
 ## Usage of Built-in Tuners
@@ -38,7 +39,7 @@ Note: Please follow the format when you write your `config.yml` file. Some built
 TPE, as a black-box optimization, can be used in various scenarios and shows good performance in general. Especially when you have limited computation resource and can only try a small number of trials. From a large amount of experiments, we could found that TPE is far better than Random Search. [Detailed Description](./HyperoptTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
@@ -66,7 +67,7 @@ tuner:
 Random search is suggested when each trial does not take too long (e.g., each trial can be completed very soon, or early stopped by assessor quickly), and you have enough computation resource. Or you want to uniformly explore the search space. Random Search could be considered as baseline of search algorithm. [Detailed Description](./HyperoptTuner.md)
-**Requirement of classArg:**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
@@ -91,7 +92,7 @@ tuner:
 Anneal is suggested when each trial does not take too long, and you have enough computation resource(almost same with Random Search). Or the variables in search space could be sample from some prior distribution. [Detailed Description](./HyperoptTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
@@ -117,7 +118,7 @@ tuner:
 Its requirement of computation resource is relatively high. Specifically, it requires large initial population to avoid falling into local optimum. If your trial is short or leverages assessor, this tuner is a good choice. And, it is more suggested when your trial code supports weight transfer, that is, the trial could inherit the converged weights from its parent(s). This can greatly speed up the training progress. [Detailed Description](./EvolutionTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
@@ -156,7 +157,7 @@ nnictl package install --name=SMAC
 Similar to TPE, SMAC is also a black-box tuner which can be tried in various scenarios, and is suggested when computation resource is limited. It is optimized for discrete hyperparameters, thus, suggested when most of your hyperparameters are discrete. [Detailed Description](./SmacTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
@@ -243,7 +244,7 @@ tuner:
 It is suggested when you have limited computation resource but have relatively large search space. It performs well in the scenario that intermediate result (e.g., accuracy) can reflect good or bad of final result (e.g., accuracy) to some extent. [Detailed Description](./HyperbandAdvisor.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
 * **R** (*int, optional, default = 60*) - the maximum budget given to a trial (could be the number of mini-batches or epochs) can be allocated to a trial. Each trial should use TRIAL_BUDGET to control how long it runs.
@@ -277,7 +278,7 @@ NetworkMorphism requires [PyTorch](https://pytorch.org/get-started/locally) and
 It is suggested that you want to apply deep learning methods to your task (your own dataset) but you have no idea of how to choose or design a network. You modify the [example](https://github.com/Microsoft/nni/tree/master/examples/trials/network_morphism/cifar10/cifar10_keras.py) to fit your own dataset and your own data augmentation method. Also you can change the batch size, learning rate or optimizer. It is feasible for different tasks to find a good network architecture. Now this tuner only supports the computer vision domain. [Detailed Description](./NetworkmorphismTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
 * **task** (*('cv'), optional, default = 'cv'*) - The domain of experiment, for now, this tuner only supports the computer vision(cv) domain.
@@ -313,7 +314,7 @@ Note that the only acceptable types of search space are `choice`, `quniform`, `u
 Similar to TPE and SMAC, Metis is a black-box tuner. If your system takes a long time to finish each trial, Metis is more favorable than other approaches such as random search. Furthermore, Metis provides guidance on the subsequent trial. Here is an [example](https://github.com/Microsoft/nni/tree/master/examples/trials/auto-gbdt/search_space_metis.json) about the use of Metis. User only need to send the final result like `accuracy` to tuner, by calling the NNI SDK. [Detailed Description](./MetisTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*'maximize' or 'minimize', optional, default = 'maximize'*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
@@ -347,7 +348,7 @@ nnictl package install --name=BOHB
 Similar to Hyperband, it is suggested when you have limited computation resource but have relatively large search space. It performs well in the scenario that intermediate result (e.g., accuracy) can reflect good or bad of final result (e.g., accuracy) to some extent. In this case, it may converges to a better configuration due to Bayesian optimization usage. [Detailed Description](./BohbAdvisor.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', tuners will target to maximize metrics. If 'minimize', tuner will target to minimize metrics.
 * **min_budget** (*int, optional, default = 1*) - The smallest budget assign to a trial job, (budget could be the number of mini-batches or epochs). Needs to be positive.
@@ -386,7 +387,7 @@ Note that the only acceptable types of search space are `choice`, `randint`, `un
 As a strategy in Sequential Model-based Global Optimization(SMBO) algorithm, GP Tuner uses a proxy optimization problem (finding the maximum of the acquisition function) that, albeit still a hard problem, is cheaper (in the computational sense) and common tools can be employed. Therefore GP Tuner is most adequate for situations where the function to be optimized is a very expensive endeavor. GP can be used when the computation resource is limited. While GP Tuner has a computational cost that grows at *O(N^3)* due to the requirement of inverting the Gram matrix, so it's not suitable when lots of trials are needed. [Detailed Description](./GPTuner.md)
-**Requirement of classArg**
+**Requirement of classArgs**
 * **optimize_mode** (*'maximize' or 'minimize', optional, default = 'maximize'*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
 * **utility** (*'ei', 'ucb' or 'poi', optional, default = 'ei'*) - The kind of utility function(acquisition function). 'ei', 'ucb' and 'poi' corresponds to 'Expected Improvement', 'Upper Confidence Bound' and 'Probability of Improvement' respectively. 
@@ -415,3 +416,39 @@ tuner:
    selection_num_warm_up: 100000
    selection_num_starting_points: 250
 ```
+<a name="PPOTuner"></a>
+![](https://placehold.it/15/1589F0/000000?text=+) `PPO Tuner`
+> Built-in Tuner Name: **PPOTuner**
+Note that the only acceptable type of search space is `mutable_layer`. `optional_input_size` can only be 0, 1, or [0, 1].
+**Suggested scenario**
+PPOTuner is a Reinforcement Learning tuner based on PPO algorithm. When you are using NNI NAS interface in your trial code to do neural architecture search, PPOTuner is recommended. It has relatively high data efficiency but is suggested when you have large amount of computation resource. You could try it on very simple task, such as the [mnist-nas](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nas) example. [Detailed Description](./PPOTuner.md)
+**Requirement of classArgs**
+* **optimize_mode** (*'maximize' or 'minimize'*) - If 'maximize', the tuner will target to maximize metrics. If 'minimize', the tuner will target to minimize metrics.
+* **trials_per_update** (*int, optional, default = 20*) - The number of trials to be used for one update. This number is recommended to be larger than `trialConcurrency` and `trialConcurrency` be a aliquot devisor of  `trials_per_update`. Note that trials_per_update should be divisible by minibatch_size.
+* **epochs_per_update** (*int, optional, default = 4*) - The number of epochs for one update.
+* **minibatch_size** (*int, optional, default = 4*) - Mini-batch size (i.e., number of trials for a mini-batch) for the update. Note that, trials_per_update should be divisible by minibatch_size.
+* **ent_coef** (*float, optional, default = 0.0*) - Policy entropy coefficient in the optimization objective.
+* **lr** (*float, optional, default = 3e-4*) - Learning rate of the model (lstm network), constant.
+* **vf_coef** (*float, optional, default = 0.5*) - Value function loss coefficient in the optimization objective.
+* **max_grad_norm** (*float, optional, default = 0.5*) - Gradient norm clipping coefficient.
+* **gamma** (*float, optional, default = 0.99*) - Discounting factor.
+* **lam** (*float, optional, default = 0.95*) - Advantage estimation discounting factor (lambda in the paper).
+* **cliprange** (*float, optional, default = 0.2*) - Cliprange in the PPO algorithm, constant.
+**Usage example**
+```yaml
+# config.yml
+tuner:
+  builtinTunerName: PPOTuner
+  classArgs:
+    optimize_mode: maximize
+```
\ No newline at end of file
--- a/docs/en_US/Tuner/PPOTuner.md
+++ b/docs/en_US/Tuner/PPOTuner.md
+PPO Tuner on NNI
+===
+## PPOTuner
+This is a tuner generally for NNI's NAS interface, it uses [ppo algorithm](https://arxiv.org/abs/1707.06347). The implementation inherits the main logic of the implementation [here](https://github.com/openai/baselines/tree/master/baselines/ppo2) (i.e., ppo2 from OpenAI), and is adapted for NAS scenario.
+It could successfully tune the [mnist-nas example](https://github.com/microsoft/nni/tree/master/examples/trials/mnist-nas), and has the following result:
+![](../../img/ppo_mnist.png)
+We also tune [the macro search space for image classification in the enas paper](https://github.com/microsoft/nni/tree/master/examples/trials/nas_cifar10) (with limited epoch number for each trial, i.e., 8 epochs), which is implemented using the NAS interface and tuned with PPOTuner. Use Figure 7 in the [enas paper](https://arxiv.org/pdf/1802.03268.pdf) to show how the search space looks like
+![](../../img/enas_search_space.png)
+The figure above is a chosen architecture, we use it to show how the search space looks like. Each square is a layer whose operation can be chosen from 6 operations. Each dash line is a skip connection, each square layer could choose 0 or 1 skip connection getting the output of a previous layer. __Note that__ in original macro search space each square layer could choose any number of skip connections, while in our implementation it is only allowed to choose 0 or 1.
+The result is shown in figure below (with the experiment config [here](https://github.com/microsoft/nni/blob/master/examples/trials/nas_cifar10/config_ppo.yml)):
+![](../../img/ppo_cifar10.png)
--- a/docs/img/enas_search_space.png
+++ b/docs/img/enas_search_space.png
--- a/docs/img/ppo_cifar10.png
+++ b/docs/img/ppo_cifar10.png
--- a/docs/img/ppo_mnist.png
+++ b/docs/img/ppo_mnist.png
--- a/examples/trials/mnist-nas/classic_mode/config_hpo.yml
+++ b/examples/trials/mnist-nas/classic_mode/config_hpo.yml
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 1h
+maxTrialNum: 10
+#choice: local, remote, pai
+trainingServicePlatform: local
+#choice: true, false
+useAnnotation: true
+tuner:
+  builtinTunerName: TPE
+trial:
+  command: python3 mnist.py --batch_num 200
+  codeDir: .
+  gpuNum: 0
+  nasMode: classic_mode
--- a/examples/trials/mnist-nas/config_ppo.yml
+++ b/examples/trials/mnist-nas/config_ppo.yml
+authorName: NNI-example
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 100h
+maxTrialNum: 10000
+#choice: local, remote, pai
+trainingServicePlatform: local
+#choice: true, false
+useAnnotation: true
+tuner:
+  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
+  #SMAC, PPO (SMAC and PPO should be installed through nnictl)
+  builtinTunerName: PPOTuner
+  classArgs:
+    optimize_mode: maximize
+trial:
+  command: python3 mnist.py
+  codeDir: .
+  gpuNum: 0
--- a/examples/trials/nas_cifar10/README.md
+++ b/examples/trials/nas_cifar10/README.md
@@ -2,7 +2,14 @@
 ===	
 Now we have an NAS example [NNI-NAS-Example](https://github.com/Crysple/NNI-NAS-Example) run in NNI using NAS interface from our contributors.	
+We have included its trial code in this folder, and provided example config files to show how to use PPO tuner to tune the trial code.
+> Download data
+- `cd data && . download.sh`
+- `tar xzf cifar-10-python.tar.gz && mv cifar-batches cifar10`
 Thanks our lovely contributors. 	
 And welcome more and more people to join us!
\ No newline at end of file
--- a/examples/trials/nas_cifar10/config_pai_ppo.yml
+++ b/examples/trials/nas_cifar10/config_pai_ppo.yml
+authorName: Unknown
+experimentName: enas_macro
+trialConcurrency: 20
+maxExecDuration: 2400h
+maxTrialNum: 20000
+#choice: local, remote
+trainingServicePlatform: pai
+#choice: true, false
+useAnnotation: true
+multiPhase: false
+versionCheck: false
+nniManagerIp: 0.0.0.0
+tuner:
+  builtinTunerName: PPOTuner
+  classArgs:
+    optimize_mode: maximize
+    trials_per_update: 60
+    epochs_per_update: 20
+    minibatch_size: 6
+trial:
+  command: sh ./macro_cifar10_pai.sh
+  codeDir: ./
+  gpuNum: 1
+  cpuNum: 1
+  memoryMB: 8196
+  image: msranni/nni:latest
+  virtualCluster: nni
+paiConfig:
+  userName: your_account
+  passWord: your_pwd
+  host: 0.0.0.0
--- a/examples/trials/nas_cifar10/config_ppo.yml
+++ b/examples/trials/nas_cifar10/config_ppo.yml
+authorName: Unknown
+experimentName: enas_macro
+trialConcurrency: 4
+maxExecDuration: 2400h
+maxTrialNum: 20000
+#choice: local, remote
+trainingServicePlatform: local
+#choice: true, false
+useAnnotation: true
+multiPhase: false
+tuner:
+  builtinTunerName: PPOTuner
+  classArgs:
+    optimize_mode: maximize
+    trials_per_update: 60
+    epochs_per_update: 12
+    minibatch_size: 10
+trial:
+  command: sh ./macro_cifar10.sh
+  codeDir: ./
+  gpuNum: 1
--- a/examples/trials/nas_cifar10/data/download.sh
+++ b/examples/trials/nas_cifar10/data/download.sh
+wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
--- a/examples/trials/nas_cifar10/macro_cifar10.sh
+++ b/examples/trials/nas_cifar10/macro_cifar10.sh
+#!/bin/bash
+set -e
+export PYTHONPATH="$(pwd)"
+python3 src/cifar10/nni_child_cifar10.py \
+  --data_format="NCHW" \
+  --search_for="macro" \
+  --reset_output_dir \
+  --data_path="data/cifar10" \
+  --output_dir="outputs" \
+  --train_data_size=45000 \
+  --batch_size=100 \
+  --num_epochs=8 \
+  --log_every=50 \
+  --eval_every_epochs=1 \
+  --child_use_aux_heads \
+  --child_num_layers=12 \
+  --child_out_filters=36 \
+  --child_l2_reg=0.0002 \
+  --child_num_branches=6 \
+  --child_num_cell_layers=5 \
+  --child_keep_prob=0.50 \
+  --child_drop_path_keep_prob=0.60 \
+  --child_lr_cosine \
+  --child_lr_max=0.05 \
+  --child_lr_min=0.001 \
+  --child_lr_T_0=10 \
+  --child_lr_T_mul=2 \
+  --child_mode="subgraph" \
+  "$@"
--- a/examples/trials/nas_cifar10/macro_cifar10_pai.sh
+++ b/examples/trials/nas_cifar10/macro_cifar10_pai.sh
+#!/bin/bash
+set -e
+export PYTHONPATH="$(pwd)"
+python3 src/cifar10/nni_child_cifar10.py \
+  --data_format="NCHW" \
+  --search_for="macro" \
+  --reset_output_dir \
+  --data_path="data/cifar10" \
+  --output_dir="outputs" \
+  --train_data_size=45000 \
+  --batch_size=100 \
+  --num_epochs=30 \
+  --log_every=50 \
+  --eval_every_epochs=1 \
+  --child_use_aux_heads \
+  --child_num_layers=12 \
+  --child_out_filters=36 \
+  --child_l2_reg=0.0002 \
+  --child_num_branches=6 \
+  --child_num_cell_layers=5 \
+  --child_keep_prob=0.50 \
+  --child_drop_path_keep_prob=0.60 \
+  --child_lr_cosine \
+  --child_lr_max=0.05 \
+  --child_lr_min=0.001 \
+  --child_lr_T_0=10 \
+  --child_lr_T_mul=2 \
+  --child_mode="subgraph" \
+  "$@"
--- a/examples/trials/nas_cifar10/src/__init__.py
+++ b/examples/trials/nas_cifar10/src/__init__.py
--- a/examples/trials/nas_cifar10/src/cifar10/__init__.py
+++ b/examples/trials/nas_cifar10/src/cifar10/__init__.py
--- a/examples/trials/nas_cifar10/src/cifar10/data_utils.py
+++ b/examples/trials/nas_cifar10/src/cifar10/data_utils.py
+import os
+import sys
+import pickle
+import numpy as np
+import tensorflow as tf
+def _read_data(data_path, train_files):
+    """Reads CIFAR-10 format data. Always returns NHWC format.
+    Returns:
+        images: np tensor of size [N, H, W, C]
+        labels: np tensor of size [N]
+    """
+    images, labels = [], []
+    for file_name in train_files:
+        print(file_name)
+        full_name = os.path.join(data_path, file_name)
+        with open(full_name, "rb") as finp:
+            data = pickle.load(finp, encoding='latin1')
+            batch_images = data["data"].astype(np.float32) / 255.0
+            batch_labels = np.array(data["labels"], dtype=np.int32)
+            images.append(batch_images)
+            labels.append(batch_labels)
+    images = np.concatenate(images, axis=0)
+    labels = np.concatenate(labels, axis=0)
+    images = np.reshape(images, [-1, 3, 32, 32])
+    images = np.transpose(images, [0, 2, 3, 1])
+    return images, labels
+def read_data(data_path, num_valids=5000):
+    print("-" * 80)
+    print("Reading data")
+    images, labels = {}, {}
+    train_files = [
+        "data_batch_1",
+        "data_batch_2",
+        "data_batch_3",
+        "data_batch_4",
+        "data_batch_5",
+    ]
+    test_file = [
+        "test_batch",
+    ]
+    images["train"], labels["train"] = _read_data(data_path, train_files)
+    if num_valids:
+        images["valid"] = images["train"][-num_valids:]
+        labels["valid"] = labels["train"][-num_valids:]
+        images["train"] = images["train"][:-num_valids]
+        labels["train"] = labels["train"][:-num_valids]
+    else:
+        images["valid"], labels["valid"] = None, None
+    images["test"], labels["test"] = _read_data(data_path, test_file)
+    print("Prepropcess: [subtract mean], [divide std]")
+    mean = np.mean(images["train"], axis=(0, 1, 2), keepdims=True)
+    std = np.std(images["train"], axis=(0, 1, 2), keepdims=True)
+    print("mean: {}".format(np.reshape(mean * 255.0, [-1])))
+    print("std: {}".format(np.reshape(std * 255.0, [-1])))
+    images["train"] = (images["train"] - mean) / std
+    if num_valids:
+        images["valid"] = (images["valid"] - mean) / std
+    images["test"] = (images["test"] - mean) / std
+    return images, labels
--- a/examples/trials/nas_cifar10/src/cifar10/general_child.py
+++ b/examples/trials/nas_cifar10/src/cifar10/general_child.py
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import tensorflow as tf
+from src.common_ops import create_weight, batch_norm, batch_norm_with_mask, global_avg_pool, conv_op, pool_op
+from src.utils import count_model_params, get_train_ops, get_C, get_strides
+from src.cifar10.models import Model
+class GeneralChild(Model):
+    def __init__(self,
+                 images,
+                 labels,
+                 cutout_size=None,
+                 fixed_arc=None,
+                 out_filters_scale=1,
+                 num_layers=2,
+                 num_branches=6,
+                 out_filters=24,
+                 keep_prob=1.0,
+                 batch_size=32,
+                 clip_mode=None,
+                 grad_bound=None,
+                 l2_reg=1e-4,
+                 lr_init=0.1,
+                 lr_dec_start=0,
+                 lr_dec_every=10000,
+                 lr_dec_rate=0.1,
+                 lr_cosine=False,
+                 lr_max=None,
+                 lr_min=None,
+                 lr_T_0=None,
+                 lr_T_mul=None,
+                 optim_algo=None,
+                 sync_replicas=False,
+                 num_aggregate=None,
+                 num_replicas=None,
+                 data_format="NHWC",
+                 name="child",
+                 mode="subgraph",
+                 *args,
+                 **kwargs
+                 ):
+        super(self.__class__, self).__init__(
+            images,
+            labels,
+            cutout_size=cutout_size,
+            batch_size=batch_size,
+            clip_mode=clip_mode,
+            grad_bound=grad_bound,
+            l2_reg=l2_reg,
+            lr_init=lr_init,
+            lr_dec_start=lr_dec_start,
+            lr_dec_every=lr_dec_every,
+            lr_dec_rate=lr_dec_rate,
+            keep_prob=keep_prob,
+            optim_algo=optim_algo,
+            sync_replicas=sync_replicas,
+            num_aggregate=num_aggregate,
+            num_replicas=num_replicas,
+            data_format=data_format,
+            name=name)
+        self.lr_cosine = lr_cosine
+        self.lr_max = lr_max
+        self.lr_min = lr_min
+        self.lr_T_0 = lr_T_0
+        self.lr_T_mul = lr_T_mul
+        self.out_filters = out_filters * out_filters_scale
+        self.num_layers = num_layers
+        self.mode = mode
+        self.num_branches = num_branches
+        self.fixed_arc = fixed_arc
+        self.out_filters_scale = out_filters_scale
+        pool_distance = self.num_layers // 3
+        self.pool_layers = [pool_distance - 1, 2 * pool_distance - 1]
+    def _factorized_reduction(self, x, out_filters, stride, is_training):
+        """Reduces the shape of x without information loss due to striding."""
+        assert out_filters % 2 == 0, (
+            "Need even number of filters when using this factorized reduction.")
+        if stride == 1:
+            with tf.variable_scope("path_conv"):
+                inp_c = get_C(x, self.data_format)
+                w = create_weight("w", [1, 1, inp_c, out_filters])
+                x = tf.nn.conv2d(x, w, [1, 1, 1, 1], "SAME",
+                                 data_format=self.data_format)
+                x = batch_norm(x, is_training, data_format=self.data_format)
+                return x
+        stride_spec = get_strides(stride, self.data_format)
+        # Skip path 1
+        path1 = tf.nn.avg_pool(
+            x, [1, 1, 1, 1], stride_spec, "VALID", data_format=self.data_format)
+        with tf.variable_scope("path1_conv"):
+            inp_c = get_C(path1, self.data_format)
+            w = create_weight("w", [1, 1, inp_c, out_filters // 2])
+            path1 = tf.nn.conv2d(path1, w, [1, 1, 1, 1], "SAME",
+                                 data_format=self.data_format)
+        # Skip path 2
+        # First pad with 0"s on the right and bottom, then shift the filter to
+        # include those 0"s that were added.
+        if self.data_format == "NHWC":
+            pad_arr = [[0, 0], [0, 1], [0, 1], [0, 0]]
+            path2 = tf.pad(x, pad_arr)[:, 1:, 1:, :]
+            concat_axis = 3
+        else:
+            pad_arr = [[0, 0], [0, 0], [0, 1], [0, 1]]
+            path2 = tf.pad(x, pad_arr)[:, :, 1:, 1:]
+            concat_axis = 1
+        path2 = tf.nn.avg_pool(
+            path2, [1, 1, 1, 1], stride_spec, "VALID", data_format=self.data_format)
+        with tf.variable_scope("path2_conv"):
+            inp_c = get_C(path2, self.data_format)
+            w = create_weight("w", [1, 1, inp_c, out_filters // 2])
+            path2 = tf.nn.conv2d(path2, w, [1, 1, 1, 1], "SAME",
+                                 data_format=self.data_format)
+        # Concat and apply BN
+        final_path = tf.concat(values=[path1, path2], axis=concat_axis)
+        final_path = batch_norm(final_path, is_training,
+                                data_format=self.data_format)
+        return final_path
+    def _model(self, images, is_training, reuse=False):
+        '''Build model'''
+        with tf.variable_scope(self.name, reuse=reuse):
+            layers = []
+            out_filters = self.out_filters
+            with tf.variable_scope("stem_conv"):
+                w = create_weight("w", [3, 3, 3, out_filters])
+                x = tf.nn.conv2d(
+                    images, w, [1, 1, 1, 1], "SAME", data_format=self.data_format)
+                x = batch_norm(x, is_training, data_format=self.data_format)
+                layers.append(x)
+            def add_fixed_pooling_layer(layer_id, layers, out_filters, is_training):
+                '''Add a fixed pooling layer every four layers'''
+                out_filters *= 2
+                with tf.variable_scope("pool_at_{0}".format(layer_id)):
+                    pooled_layers = []
+                    for i, layer in enumerate(layers):
+                        with tf.variable_scope("from_{0}".format(i)):
+                            x = self._factorized_reduction(
+                                layer, out_filters, 2, is_training)
+                        pooled_layers.append(x)
+                    return pooled_layers, out_filters
+            def post_process_out(out, optional_inputs):
+                '''Form skip connection and perform batch norm'''
+                with tf.variable_scope("skip"):
+                    inputs = layers[-1]
+                    if self.data_format == "NHWC":
+                        inp_h = inputs.get_shape()[1].value
+                        inp_w = inputs.get_shape()[2].value
+                        inp_c = inputs.get_shape()[3].value
+                        out.set_shape([None, inp_h, inp_w, out_filters])
+                    elif self.data_format == "NCHW":
+                        inp_c = inputs.get_shape()[1].value
+                        inp_h = inputs.get_shape()[2].value
+                        inp_w = inputs.get_shape()[3].value
+                        out.set_shape([None, out_filters, inp_h, inp_w])
+                    optional_inputs.append(out)
+                    pout = tf.add_n(optional_inputs)
+                    out = batch_norm(pout, is_training,
+                                     data_format=self.data_format)
+                layers.append(out)
+                return out
+            global layer_id
+            layer_id = -1
+            def get_layer_id():
+                global layer_id
+                layer_id += 1
+                return 'layer_' + str(layer_id)
+            def conv3(inputs):
+                # res_layers is pre_layers that are chosen to form skip connection
+                # layers[-1] is always the latest input
+                with tf.variable_scope(get_layer_id()):
+                    with tf.variable_scope('branch_0'):
+                        out = conv_op(
+                            inputs[0][0], 3, is_training, out_filters, out_filters, self.data_format, start_idx=None)
+                    out = post_process_out(out, inputs[1])
+                return out
+            def conv3_sep(inputs):
+                with tf.variable_scope(get_layer_id()):
+                    with tf.variable_scope('branch_1'):
+                        out = conv_op(
+                            inputs[0][0], 3, is_training, out_filters, out_filters, self.data_format, start_idx=None, separable=True)
+                    out = post_process_out(out, inputs[1])
+                return out
+            def conv5(inputs):
+                with tf.variable_scope(get_layer_id()):
+                    with tf.variable_scope('branch_2'):
+                        out = conv_op(
+                            inputs[0][0], 5, is_training, out_filters, out_filters, self.data_format, start_idx=None)
+                    out = post_process_out(out, inputs[1])
+                return out
+            def conv5_sep(inputs):
+                with tf.variable_scope(get_layer_id()):
+                    with tf.variable_scope('branch_3'):
+                        out = conv_op(
+                            inputs[0][0], 5, is_training, out_filters, out_filters, self.data_format, start_idx=None, separable=True)
+                    out = post_process_out(out, inputs[1])
+                return out
+            def avg_pool(inputs):
+                with tf.variable_scope(get_layer_id()):
+                    with tf.variable_scope('branch_4'):
+                        out = pool_op(
+                            inputs[0][0], is_training, out_filters, out_filters, "avg", self.data_format, start_idx=None)
+                    out = post_process_out(out, inputs[1])
+                return out
+            def max_pool(inputs):
+                with tf.variable_scope(get_layer_id()):
+                    with tf.variable_scope('branch_5'):
+                        out = pool_op(
+                            inputs[0][0], is_training, out_filters, out_filters, "max", self.data_format, start_idx=None)
+                    out = post_process_out(out, inputs[1])
+                return out
+            """@nni.mutable_layers(
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs:[x],
+                layer_output: layer_0_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs:[layer_0_out],
+                optional_inputs: [layer_0_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_1_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs:[layer_1_out],
+                optional_inputs: [layer_0_out, layer_1_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_2_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs:[layer_2_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_3_out
+            }
+            )"""
+            layers, out_filters = add_fixed_pooling_layer(
+                3, layers, out_filters, is_training)
+            layer_0_out, layer_1_out, layer_2_out, layer_3_out = layers[-4:]
+            """@nni.mutable_layers(
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_3_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_4_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_4_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_5_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_5_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_6_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_6_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out, layer_6_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_7_out
+            }
+            )"""
+            layers, out_filters = add_fixed_pooling_layer(
+                7, layers, out_filters, is_training)
+            layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out, layer_6_out, layer_7_out = layers[
+                -8:]
+            """@nni.mutable_layers(
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_7_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out, layer_6_out, layer_7_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_8_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_8_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out, layer_6_out, layer_7_out, layer_8_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_9_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs: [layer_9_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out, layer_6_out, layer_7_out, layer_8_out, layer_9_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_10_out
+            },
+            {
+                layer_choice: [conv3(), conv3_sep(), conv5(), conv5_sep(), avg_pool(), max_pool()],
+                fixed_inputs:[layer_10_out],
+                optional_inputs: [layer_0_out, layer_1_out, layer_2_out, layer_3_out, layer_4_out, layer_5_out, layer_6_out, layer_7_out, layer_8_out, layer_9_out, layer_10_out],
+                optional_input_size: [0, 1],
+                layer_output: layer_11_out
+            }
+            )"""
+            x = global_avg_pool(layer_11_out, data_format=self.data_format)
+            if is_training:
+                x = tf.nn.dropout(x, self.keep_prob)
+            with tf.variable_scope("fc"):
+                if self.data_format == "NHWC":
+                    inp_c = x.get_shape()[3].value
+                elif self.data_format == "NCHW":
+                    inp_c = x.get_shape()[1].value
+                else:
+                    raise ValueError(
+                        "Unknown data_format {0}".format(self.data_format))
+                w = create_weight("w", [inp_c, 10])
+                x = tf.matmul(x, w)
+        return x
+    # override
+    def _build_train(self):
+        print("-" * 80)
+        print("Build train graph")
+        logits = self._model(self.x_train, is_training=True)
+        log_probs = tf.nn.sparse_softmax_cross_entropy_with_logits(
+            logits=logits, labels=self.y_train)
+        self.loss = tf.reduce_mean(log_probs)
+        self.train_preds = tf.argmax(logits, axis=1)
+        self.train_preds = tf.to_int32(self.train_preds)
+        self.train_acc = tf.equal(self.train_preds, self.y_train)
+        self.train_acc = tf.to_int32(self.train_acc)
+        self.train_acc = tf.reduce_sum(self.train_acc)
+        tf_variables = [var
+                        for var in tf.trainable_variables() if var.name.startswith(self.name)]
+        self.num_vars = count_model_params(tf_variables)
+        print("Model has {} params".format(self.num_vars))
+        self.global_step = tf.Variable(
+            0, dtype=tf.int32, trainable=False, name="global_step")
+        self.train_op, self.lr, self.grad_norm, self.optimizer = get_train_ops(
+            self.loss,
+            tf_variables,
+            self.global_step,
+            clip_mode=self.clip_mode,
+            grad_bound=self.grad_bound,
+            l2_reg=self.l2_reg,
+            lr_init=self.lr_init,
+            lr_dec_start=self.lr_dec_start,
+            lr_dec_every=self.lr_dec_every,
+            lr_dec_rate=self.lr_dec_rate,
+            lr_cosine=self.lr_cosine,
+            lr_max=self.lr_max,
+            lr_min=self.lr_min,
+            lr_T_0=self.lr_T_0,
+            lr_T_mul=self.lr_T_mul,
+            num_train_batches=self.num_train_batches,
+            optim_algo=self.optim_algo,
+            sync_replicas=False,
+            num_aggregate=self.num_aggregate,
+            num_replicas=self.num_replicas)
+    # override
+    def _build_valid(self):
+        if self.x_valid is not None:
+            print("-" * 80)
+            print("Build valid graph")
+            logits = self._model(self.x_valid, False, reuse=True)
+            self.valid_preds = tf.argmax(logits, axis=1)
+            self.valid_preds = tf.to_int32(self.valid_preds)
+            self.valid_acc = tf.equal(self.valid_preds, self.y_valid)
+            self.valid_acc = tf.to_int32(self.valid_acc)
+            self.valid_acc = tf.reduce_sum(self.valid_acc)
+    # override
+    def _build_test(self):
+        print("-" * 80)
+        print("Build test graph")
+        logits = self._model(self.x_test, False, reuse=True)
+        self.test_preds = tf.argmax(logits, axis=1)
+        self.test_preds = tf.to_int32(self.test_preds)
+        self.test_acc = tf.equal(self.test_preds, self.y_test)
+        self.test_acc = tf.to_int32(self.test_acc)
+        self.test_acc = tf.reduce_sum(self.test_acc)
+    def build_model(self):
+        self._build_train()
+        self._build_valid()
+        self._build_test()
--- a/examples/trials/nas_cifar10/src/cifar10/models.py
+++ b/examples/trials/nas_cifar10/src/cifar10/models.py
+import os
+import sys
+import numpy as np
+import tensorflow as tf
+class Model(object):
+    def __init__(self,
+                 images,
+                 labels,
+                 cutout_size=None,
+                 batch_size=32,
+                 eval_batch_size=100,
+                 clip_mode=None,
+                 grad_bound=None,
+                 l2_reg=1e-4,
+                 lr_init=0.1,
+                 lr_dec_start=0,
+                 lr_dec_every=100,
+                 lr_dec_rate=0.1,
+                 keep_prob=1.0,
+                 optim_algo=None,
+                 sync_replicas=False,
+                 num_aggregate=None,
+                 num_replicas=None,
+                 data_format="NHWC",
+                 name="generic_model",
+                 seed=None,
+                 ):
+        """
+        Args:
+                lr_dec_every: number of epochs to decay
+        """
+        print("-" * 80)
+        print("Build model {}".format(name))
+        self.cutout_size = cutout_size
+        self.batch_size = batch_size
+        self.eval_batch_size = eval_batch_size
+        self.clip_mode = clip_mode
+        self.grad_bound = grad_bound
+        self.l2_reg = l2_reg
+        self.lr_init = lr_init
+        self.lr_dec_start = lr_dec_start
+        self.lr_dec_rate = lr_dec_rate
+        self.keep_prob = keep_prob
+        self.optim_algo = optim_algo
+        self.sync_replicas = sync_replicas
+        self.num_aggregate = num_aggregate
+        self.num_replicas = num_replicas
+        self.data_format = data_format
+        self.name = name
+        self.seed = seed
+        self.global_step = None
+        self.valid_acc = None
+        self.test_acc = None
+        print("Build data ops")
+        with tf.device("/cpu:0"):
+            # training data
+            self.num_train_examples = np.shape(images["train"])[0]
+            self.num_train_batches = (
+                self.num_train_examples + self.batch_size - 1) // self.batch_size
+            x_train, y_train = tf.train.shuffle_batch(
+                [images["train"], labels["train"]],
+                batch_size=self.batch_size,
+                capacity=50000,
+                enqueue_many=True,
+                min_after_dequeue=0,
+                num_threads=16,
+                seed=self.seed,
+                allow_smaller_final_batch=True,
+            )
+            self.lr_dec_every = lr_dec_every * self.num_train_batches
+            def _pre_process(x):
+                x = tf.pad(x, [[4, 4], [4, 4], [0, 0]])
+                x = tf.random_crop(x, [32, 32, 3], seed=self.seed)
+                x = tf.image.random_flip_left_right(x, seed=self.seed)
+                if self.cutout_size is not None:
+                    mask = tf.ones(
+                        [self.cutout_size, self.cutout_size], dtype=tf.int32)
+                    start = tf.random_uniform(
+                        [2], minval=0, maxval=32, dtype=tf.int32)
+                    mask = tf.pad(mask, [[self.cutout_size + start[0], 32 - start[0]],
+                                         [self.cutout_size + start[1], 32 - start[1]]])
+                    mask = mask[self.cutout_size: self.cutout_size + 32,
+                                self.cutout_size: self.cutout_size + 32]
+                    mask = tf.reshape(mask, [32, 32, 1])
+                    mask = tf.tile(mask, [1, 1, 3])
+                    x = tf.where(tf.equal(mask, 0), x=x, y=tf.zeros_like(x))
+                if self.data_format == "NCHW":
+                    x = tf.transpose(x, [2, 0, 1])
+                return x
+            self.x_train = tf.map_fn(_pre_process, x_train, back_prop=False)
+            self.y_train = y_train
+            # valid data
+            self.x_valid, self.y_valid = None, None
+            if images["valid"] is not None:
+                images["valid_original"] = np.copy(images["valid"])
+                labels["valid_original"] = np.copy(labels["valid"])
+                if self.data_format == "NCHW":
+                    images["valid"] = tf.transpose(
+                        images["valid"], [0, 3, 1, 2])
+                self.num_valid_examples = np.shape(images["valid"])[0]
+                self.num_valid_batches = (
+                    (self.num_valid_examples + self.eval_batch_size - 1)
+                    // self.eval_batch_size)
+                self.x_valid, self.y_valid = tf.train.batch(
+                    [images["valid"], labels["valid"]],
+                    batch_size=self.eval_batch_size,
+                    capacity=5000,
+                    enqueue_many=True,
+                    num_threads=1,
+                    allow_smaller_final_batch=True,
+                )
+            # test data
+            if self.data_format == "NCHW":
+                images["test"] = tf.transpose(images["test"], [0, 3, 1, 2])
+            self.num_test_examples = np.shape(images["test"])[0]
+            self.num_test_batches = (
+                (self.num_test_examples + self.eval_batch_size - 1)
+                // self.eval_batch_size)
+            self.x_test, self.y_test = tf.train.batch(
+                [images["test"], labels["test"]],
+                batch_size=self.eval_batch_size,
+                capacity=10000,
+                enqueue_many=True,
+                num_threads=1,
+                allow_smaller_final_batch=True,
+            )
+        # cache images and labels
+        self.images = images
+        self.labels = labels
+    def eval_once(self, sess, eval_set, child_model, verbose=False):
+        """Expects self.acc and self.global_step to be defined.
+        Args:
+                sess: tf.Session() or one of its wrap arounds.
+                feed_dict: can be used to give more information to sess.run().
+                eval_set: "valid" or "test"
+        """
+        assert self.global_step is not None
+        global_step = sess.run(self.global_step)
+        print("Eval at {}".format(global_step))
+        if eval_set == "valid":
+            assert self.x_valid is not None
+            assert self.valid_acc is not None
+            num_examples = self.num_valid_examples
+            num_batches = self.num_valid_batches
+            acc_op = self.valid_acc
+        elif eval_set == "test":
+            assert self.test_acc is not None
+            num_examples = self.num_test_examples
+            num_batches = self.num_test_batches
+            acc_op = self.test_acc
+        else:
+            raise NotImplementedError("Unknown eval_set '{}'".format(eval_set))
+        total_acc = 0
+        total_exp = 0
+        for batch_id in range(num_batches):
+            acc = sess.run(acc_op)
+            total_acc += acc
+            total_exp += self.eval_batch_size
+            if verbose:
+                sys.stdout.write(
+                    "\r{:<5d}/{:>5d}".format(total_acc, total_exp))
+        if verbose:
+            print("")
+        print("{}_accuracy: {:<6.4f}".format(
+            eval_set, float(total_acc) / total_exp))
+        return float(total_acc) / total_exp
+    def _model(self, images, is_training, reuse=None):
+        raise NotImplementedError("Abstract method")
+    def _build_train(self):
+        raise NotImplementedError("Abstract method")
+    def _build_valid(self):
+        raise NotImplementedError("Abstract method")
+    def _build_test(self):
+        raise NotImplementedError("Abstract method")
--- a/examples/trials/nas_cifar10/src/cifar10/nni_child_cifar10.py
+++ b/examples/trials/nas_cifar10/src/cifar10/nni_child_cifar10.py
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import shutil
+import logging
+import tensorflow as tf
+from src.cifar10.data_utils import read_data
+from src.cifar10.general_child import GeneralChild
+import src.cifar10_flags
+from src.cifar10_flags import FLAGS
+def build_logger(log_name):
+    logger = logging.getLogger(log_name)
+    logger.setLevel(logging.DEBUG)
+    fh = logging.FileHandler(log_name+'.log')
+    fh.setLevel(logging.DEBUG)
+    logger.addHandler(fh)
+    return logger
+logger = build_logger("nni_child_cifar10")
+def build_trial(images, labels, ChildClass):
+    '''Build child class'''
+    child_model = ChildClass(
+        images,
+        labels,
+        use_aux_heads=FLAGS.child_use_aux_heads,
+        cutout_size=FLAGS.child_cutout_size,
+        num_layers=FLAGS.child_num_layers,
+        num_cells=FLAGS.child_num_cells,
+        num_branches=FLAGS.child_num_branches,
+        fixed_arc=FLAGS.child_fixed_arc,
+        out_filters_scale=FLAGS.child_out_filters_scale,
+        out_filters=FLAGS.child_out_filters,
+        keep_prob=FLAGS.child_keep_prob,
+        drop_path_keep_prob=FLAGS.child_drop_path_keep_prob,
+        num_epochs=FLAGS.num_epochs,
+        l2_reg=FLAGS.child_l2_reg,
+        data_format=FLAGS.data_format,
+        batch_size=FLAGS.batch_size,
+        clip_mode="norm",
+        grad_bound=FLAGS.child_grad_bound,
+        lr_init=FLAGS.child_lr,
+        lr_dec_every=FLAGS.child_lr_dec_every,
+        lr_dec_rate=FLAGS.child_lr_dec_rate,
+        lr_cosine=FLAGS.child_lr_cosine,
+        lr_max=FLAGS.child_lr_max,
+        lr_min=FLAGS.child_lr_min,
+        lr_T_0=FLAGS.child_lr_T_0,
+        lr_T_mul=FLAGS.child_lr_T_mul,
+        optim_algo="momentum",
+        sync_replicas=FLAGS.child_sync_replicas,
+        num_aggregate=FLAGS.child_num_aggregate,
+        num_replicas=FLAGS.child_num_replicas
+    )
+    return child_model
+def get_child_ops(child_model):
+    '''Assemble child op to a dict'''
+    child_ops = {
+        "global_step": child_model.global_step,
+        "loss": child_model.loss,
+        "train_op": child_model.train_op,
+        "lr": child_model.lr,
+        "grad_norm": child_model.grad_norm,
+        "train_acc": child_model.train_acc,
+        "optimizer": child_model.optimizer,
+        "num_train_batches": child_model.num_train_batches,
+        "eval_every": child_model.num_train_batches * FLAGS.eval_every_epochs,
+        "eval_func": child_model.eval_once,
+    }
+    return child_ops
+class NASTrial():
+    def __init__(self):
+        images, labels = read_data(FLAGS.data_path, num_valids=0)
+        self.output_dir = os.path.join(os.getenv('NNI_OUTPUT_DIR'), '../..')
+        self.file_path = os.path.join(
+            self.output_dir, 'trainable_variable.txt')
+        self.graph = tf.Graph()
+        with self.graph.as_default():
+            self.child_model = build_trial(images, labels, GeneralChild)
+            self.total_data = {}
+            self.child_model.build_model()
+            self.child_ops = get_child_ops(self.child_model)
+            config = tf.ConfigProto(
+                intra_op_parallelism_threads=0,
+                inter_op_parallelism_threads=0,
+                allow_soft_placement=True)
+            self.sess = tf.train.SingularMonitoredSession(config=config)
+        logger.debug('initlize NASTrial done.')
+    def run_one_step(self):
+        '''Run this model on a batch of data'''
+        run_ops = [
+            self.child_ops["loss"],
+            self.child_ops["lr"],
+            self.child_ops["grad_norm"],
+            self.child_ops["train_acc"],
+            self.child_ops["train_op"],
+        ]
+        loss, lr, gn, tr_acc, _ = self.sess.run(run_ops)
+        global_step = self.sess.run(self.child_ops["global_step"])
+        log_string = ""
+        log_string += "ch_step={:<6d}".format(global_step)
+        log_string += " loss={:<8.6f}".format(loss)
+        log_string += " lr={:<8.4f}".format(lr)
+        log_string += " |g|={:<8.4f}".format(gn)
+        log_string += " tr_acc={:<3d}/{:>3d}".format(tr_acc, FLAGS.batch_size)
+        if int(global_step) % FLAGS.log_every == 0:
+            logger.debug(log_string)
+        return loss, global_step
+    def run(self):
+        '''Run this model according to the `epoch` set in FALGS'''
+        max_acc = 0
+        while True:
+            _, global_step = self.run_one_step()
+            if global_step % self.child_ops['num_train_batches'] == 0:
+                acc = self.child_ops["eval_func"](
+                    self.sess, "test", self.child_model)
+                max_acc = max(max_acc, acc)
+                '''@nni.report_intermediate_result(acc)'''
+            if global_step / self.child_ops['num_train_batches'] >= FLAGS.num_epochs:
+                '''@nni.report_final_result(max_acc)'''
+                break
+def main(_):
+    logger.debug("-" * 80)
+    if not os.path.isdir(FLAGS.output_dir):
+        logger.debug(
+            "Path {} does not exist. Creating.".format(FLAGS.output_dir))
+        os.makedirs(FLAGS.output_dir)
+    elif FLAGS.reset_output_dir:
+        logger.debug(
+            "Path {} exists. Remove and remake.".format(FLAGS.output_dir))
+        shutil.rmtree(FLAGS.output_dir)
+        os.makedirs(FLAGS.output_dir)
+    logger.debug("-" * 80)
+    trial = NASTrial()
+    trial.run()
+if __name__ == "__main__":
+    tf.app.run()