New NAS algorithm: Blockwise DNAS FBNet (#3532)

ac14b9e4 · Yiwu Yao · GitHub · dfe3c27b · ac14b9e4 · ac14b9e4
Unverified Commit ac14b9e4 authored May 21, 2021 by Yiwu Yao Committed by GitHub May 21, 2021
20 changed files
--- a/docs/en_US/NAS/FBNet.rst
+++ b/docs/en_US/NAS/FBNet.rst
+FBNet
+======
+For the mobile application of facial landmark, based on the basic architecture of PFLD model, we have applied the FBNet (Block-wise DNAS) to design an concise model with the trade-off between latency and accuracy. References are listed as below:
+* `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__
+* `PFLD: A Practical Facial Landmark Detector <https://arxiv.org/abs/1902.10859>`__
+FBNet is a block-wise differentiable NAS method (Block-wise DNAS), where the best candidate building blocks can be chosen by using Gumbel Softmax random sampling and differentiable training. At each layer (or stage) to be searched, the diverse candidate blocks are side by side planned (just like the effectiveness of structural re-parameterization), leading to sufficient pre-training of the supernet. The pre-trained supernet is further sampled for finetuning of the subnet, to achieve better performance.
+.. image:: ../../img/fbnet.png
+   :target: ../../img/fbnet.png
+   :alt:
+PFLD is a lightweight facial landmark model for realtime application. The architecture of PLFD is firstly simplified for acceleration, by using the stem block of PeleeNet, average pooling with depthwise convolution and eSE module.
+To achieve better trade-off between latency and accuracy, the FBNet is further applied on the simplified PFLD for searching the best block at each specific layer. The search space is based on the FBNet space, and optimized for mobile deployment by using the average pooling with depthwise convolution and eSE module etc.
+Experiments
+------------
+To verify the effectiveness of FBNet applied on PFLD, we choose the open source dataset with 106 landmark points as the benchmark:
+* `Grand Challenge of 106-Point Facial Landmark Localization <https://arxiv.org/abs/1905.03469>`__
+The baseline model is denoted as MobileNet-V3 PFLD (`Reference baseline <https://github.com/Hsintao/pfld_106_face_landmarks>`__), and the searched model is denoted as Subnet. The experimental results are listed as below, where the latency is tested on Qualcomm 625 CPU (ARMv8):
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Size
+     - Latency
+     - Validation NME
+   * - MobileNet-V3 PFLD
+     - 1.01MB
+     - 10ms
+     - 6.22%
+   * - Subnet
+     - 693KB
+     - 1.60ms
+     - 5.58%
+Example
+--------
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/oneshot/pfld>`__
+Please run the following scripts at the example directory.
+The Python dependencies used here are listed as below:
+.. code-block:: bash
+   numpy==1.18.5
+   opencv-python==4.5.1.48
+   torch==1.6.0
+   torchvision==0.7.0
+   onnx==1.8.1
+   onnx-simplifier==0.3.5
+   onnxruntime==1.7.0
+Data Preparation
+-----------------
+Firstly, you should download the dataset `106points dataset <https://drive.google.com/file/d/1I7QdnLxAlyG2Tq3L66QYzGhiBEoVfzKo/view?usp=sharing>`__ to the path ``./data/106points`` . The dataset includes the train-set and test-set:
+.. code-block:: bash
+   ./data/106points/train_data/imgs
+   ./data/106points/train_data/list.txt
+   ./data/106points/test_data/imgs
+   ./data/106points/test_data/list.txt
+Quik Start
+-----------
+1. Search
+^^^^^^^^^^
+Based on the architecture of simplified PFLD, the setting of multi-stage search space and hyper-parameters for searching should be firstly configured to construct the supernet, as an example:
+.. code-block:: bash
+   from lib.builder import search_space
+   from lib.ops import PRIMITIVES
+   from lib.supernet import PFLDInference, AuxiliaryNet
+   from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig,
+   # configuration of hyper-parameters
+   # search_space defines the multi-stage search space
+   nas_config = NASConfig(
+          model_dir="./ckpt_save",
+          nas_lr=0.01,
+          mode="mul",
+          alpha=0.25,
+          beta=0.6,
+          search_space=search_space,
+      )
+   # lookup table to manage the information
+   lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+   # created supernet
+   pfld_backbone = PFLDInference(lookup_table)
+After creation of the supernet with the specification of search space and hyper-parameters, we can run below command to start searching and training of the supernet:
+.. code-block:: bash
+   python train.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points"
+The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/supernet/checkpoint_best.pth``.
+2. Finetune
+^^^^^^^^^^^^
+After pre-training of the supernet, we can run below command to sample the subnet and conduct the finetuning:
+.. code-block:: bash
+   python retrain.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points" \
+                     --supernet "./ckpt_save/supernet/checkpoint_best.pth"
+The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/subnet/checkpoint_best.pth``.
+3. Export
+^^^^^^^^^^
+After the finetuning of subnet, we can run below command to export the ONNX model:
+.. code-block:: bash
+   python export.py --supernet "./ckpt_save/supernet/checkpoint_best.pth" \
+                    --resume "./ckpt_save/subnet/checkpoint_best.pth"
+ONNX model is saved as ``./output/subnet.onnx``, which can be further converted to the mobile inference engine by using `MNN <https://github.com/alibaba/MNN>`__ .
+The checkpoints of pre-trained supernet and subnet are offered as below:
+* `Supernet <https://drive.google.com/file/d/1TCuWKq8u4_BQ84BWbHSCZ45N3JGB9kFJ/view?usp=sharing>`__
+* `Subnet <https://drive.google.com/file/d/160rkuwB7y7qlBZNM3W_T53cb6MQIYHIE/view?usp=sharing>`__
+* `ONNX model <https://drive.google.com/file/d/1s-v-aOiMv0cqBspPVF3vSGujTbn_T_Uo/view?usp=sharing>`__
--- a/docs/en_US/NAS/Overview.rst
+++ b/docs/en_US/NAS/Overview.rst
@@ -58,6 +58,8 @@ NNI currently supports the one-shot NAS algorithms listed below and is adding mo
     - `Cyclic Differentiable Architecture Search <https://arxiv.org/pdf/2006.10724.pdf>`__ builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture.
   * - `ProxylessNAS <Proxylessnas.rst>`__
     - `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__. It removes proxy, directly learns the architectures for large-scale target tasks and target hardware platforms.
+   * - `FBNet <FBNet.rst>`__
+     - `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__. It is a block-wise differentiable neural network architecture search method with the hardware-aware constraint.
   * - `TextNAS <TextNAS.rst>`__
     - `TextNAS: A Neural Architecture Search Space tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__. It is a neural architecture search algorithm tailored for text representation.
   * - `Cream <Cream.rst>`__

--- a/docs/en_US/NAS/one_shot_nas.rst
+++ b/docs/en_US/NAS/one_shot_nas.rst
@@ -14,5 +14,6 @@ One-shot NAS algorithms leverage weight sharing among models in neural architect
    SPOS <SPOS>
    CDARTS <CDARTS>
    ProxylessNAS <Proxylessnas>
+    FBNet <FBNet>
    TextNAS <TextNAS>
    Cream <Cream>
--- a/docs/img/fbnet.png
+++ b/docs/img/fbnet.png
--- a/examples/nas/oneshot/pfld/__init__.py
+++ b/examples/nas/oneshot/pfld/__init__.py
--- a/examples/nas/oneshot/pfld/datasets.py
+++ b/examples/nas/oneshot/pfld/datasets.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import cv2
+import os
+import numpy as np
+from torch.utils import data
+class PFLDDatasets(data.Dataset):
+    """ Dataset to manage the data loading, augmentation and generation. """
+    def __init__(self, file_list, transforms=None, data_root="", img_size=112):
+        """
+        Parameters
+        ----------
+        file_list : list
+            a list of file path and annotations
+        transforms : function
+            function for data augmentation
+        data_root : str
+            the root path of dataset
+        img_size : int
+            the size of image height or width
+        """
+        self.line = None
+        self.path = None
+        self.img_size = img_size
+        self.land = None
+        self.angle = None
+        self.data_root = data_root
+        self.transforms = transforms
+        with open(file_list, "r") as f:
+            self.lines = f.readlines()
+    def __getitem__(self, index):
+        """ Get the data sample and labels with the index. """
+        self.line = self.lines[index].strip().split()
+        # load image
+        if self.data_root:
+            self.img = cv2.imread(os.path.join(self.data_root, self.line[0]))
+        else:
+            self.img = cv2.imread(self.line[0])
+        # resize
+        self.img = cv2.resize(self.img, (self.img_size, self.img_size))
+        # obtain gt labels
+        self.land = np.asarray(self.line[1: (106 * 2 + 1)], dtype=np.float32)
+        self.angle = np.asarray(self.line[(106 * 2 + 1):], dtype=np.float32)
+        # augmentation
+        if self.transforms:
+            self.img = self.transforms(self.img)
+        return self.img, self.land, self.angle
+    def __len__(self):
+        """ Get the size of dataset. """
+        return len(self.lines)
--- a/examples/nas/oneshot/pfld/export.py
+++ b/examples/nas/oneshot/pfld/export.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import argparse
+import onnx
+import onnxsim
+import os
+import torch
+from lib.builder import search_space
+from lib.ops import PRIMITIVES
+from nni.algorithms.nas.pytorch.fbnet import (
+    LookUpTable,
+    NASConfig,
+    model_init,
+)
+parser = argparse.ArgumentParser(description="Export the ONNX model")
+parser.add_argument("--net", default="subnet", type=str)
+parser.add_argument("--supernet", default="", type=str, metavar="PATH")
+parser.add_argument("--resume", default="", type=str, metavar="PATH")
+parser.add_argument("--num_points", default=106, type=int)
+parser.add_argument("--img_size", default=112, type=int)
+parser.add_argument("--onnx", default="./output/pfld.onnx", type=str)
+parser.add_argument("--onnx_sim", default="./output/subnet.onnx", type=str)
+args = parser.parse_args()
+os.makedirs("./output", exist_ok=True)
+if args.net == "subnet":
+    from lib.subnet import PFLDInference
+else:
+    raise ValueError("Network is not implemented")
+check = torch.load(args.supernet, map_location=torch.device("cpu"))
+sampled_arch = check["arch_sample"]
+nas_config = NASConfig(search_space=search_space)
+lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+pfld_backbone = PFLDInference(lookup_table, sampled_arch, args.num_points)
+pfld_backbone.eval()
+check_sub = torch.load(args.resume, map_location=torch.device("cpu"))
+param_dict = check_sub["pfld_backbone"]
+model_init(pfld_backbone, param_dict)
+print("Convert PyTorch model to ONNX.")
+dummy_input = torch.randn(1, 3, args.img_size, args.img_size)
+input_names = ["input"]
+output_names = ["output"]
+torch.onnx.export(
+    pfld_backbone,
+    dummy_input,
+    args.onnx,
+    verbose=True,
+    input_names=input_names,
+    output_names=output_names,
+)
+print("Check ONNX model.")
+model = onnx.load(args.onnx)
+print("Simplifying the ONNX model.")
+model_opt, check = onnxsim.simplify(args.onnx)
+assert check, "Simplified ONNX model could not be validated"
+onnx.save(model_opt, args.onnx_sim)
+print("Onnx model simplify Ok!")
--- a/examples/nas/oneshot/pfld/lib/__init__.py
+++ b/examples/nas/oneshot/pfld/lib/__init__.py
--- a/examples/nas/oneshot/pfld/lib/builder.py
+++ b/examples/nas/oneshot/pfld/lib/builder.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+search_space = {
+    # multi-stage definition for candidate layers
+    # here two stages are defined for PFLD searching
+    "stages": {
+        "stage_0": {
+            "ops": [
+                "mb_k3_res",
+                "mb_k3_e2_res",
+                "mb_k3_res_d3",
+                "mb_k5_res",
+                "mb_k5_e2_res",
+                "sep_k3",
+                "sep_k5",
+                "gh_k3",
+                "gh_k5",
+            ],
+            "layer_num": 2,
+        },
+        "stage_1": {
+            "ops": [
+                "mb_k3_e2_res",
+                "mb_k3_e4_res",
+                "mb_k3_e2_res_se",
+                "mb_k3_res_d3",
+                "mb_k5_res",
+                "mb_k5_e2_res",
+                "mb_k5_res_se",
+                "mb_k5_e2_res_se",
+                "gh_k5",
+            ],
+            "layer_num": 3,
+        },
+    },
+    # necessary information of layers for NAS
+    # the basic information is as (input_channels, height, width)
+    "input_shape": [
+        (32, 14, 14),
+        (32, 14, 14),
+        (32, 14, 14),
+        (64, 7, 7),
+        (64, 7, 7),
+    ],
+    # output channels for each layer
+    "channel_size": [32, 32, 64, 64, 64],
+    # stride for each layer
+    "strides": [1, 1, 2, 1, 1],
+    # height of feature map for each layer
+    "fm_size": [14, 14, 7, 7, 7],
+}
--- a/examples/nas/oneshot/pfld/lib/ops.py
+++ b/examples/nas/oneshot/pfld/lib/ops.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# Basic primitives as the network path
+PRIMITIVES = {
+    "skip": lambda c_in, c_out, stride, **kwargs: Identity(
+        c_in, c_out, stride, **kwargs
+    ),
+    "conv1x1": lambda c_in, c_out, stride, **kwargs: Conv1x1(
+        c_in, c_out, stride, **kwargs
+    ),
+    "depth_conv": lambda c_in, c_out, stride, **kwargs: DepthConv(
+        c_in, c_out, stride, **kwargs
+    ),
+    "sep_k3": lambda c_in, c_out, stride, **kwargs: SeparableConv(
+        c_in, c_out, stride, **kwargs
+    ),
+    "sep_k5": lambda c_in, c_out, stride, **kwargs: SeparableConv(
+        c_in, c_out, stride, kernel=5, **kwargs
+    ),
+    "gh_k3": lambda c_in, c_out, stride, **kwargs: GhostModule(
+        c_in, c_out, stride, **kwargs
+    ),
+    "gh_k5": lambda c_in, c_out, stride, **kwargs: GhostModule(
+        c_in, c_out, stride, kernel=5, **kwargs
+    ),
+    "mb_k3": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=3, expand=1, **kwargs
+    ),
+    "mb_k3_e2": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=3, expand=2, **kwargs
+    ),
+    "mb_k3_e4": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=3, expand=4, **kwargs
+    ),
+    "mb_k3_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=3, expand=1, res=True, **kwargs
+    ),
+    "mb_k3_e2_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=3, expand=2, res=True, **kwargs
+    ),
+    "mb_k3_e4_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=3, expand=4, res=True, **kwargs
+    ),
+    "mb_k3_d2": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=2,
+        res=False,
+        dilation=2,
+        **kwargs,
+    ),
+    "mb_k3_d3": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=2,
+        res=False,
+        dilation=3,
+        **kwargs,
+    ),
+    "mb_k3_res_d2": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=2,
+        res=True,
+        dilation=2,
+        **kwargs,
+    ),
+    "mb_k3_res_d3": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=2,
+        res=True,
+        dilation=3,
+        **kwargs,
+    ),
+    "mb_k3_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=1,
+        res=True,
+        dilation=1,
+        se=True,
+        **kwargs,
+    ),
+    "mb_k3_e2_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=2,
+        res=True,
+        dilation=1,
+        se=True,
+        **kwargs,
+    ),
+    "mb_k3_e4_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=3,
+        expand=4,
+        res=True,
+        dilation=1,
+        se=True,
+        **kwargs,
+    ),
+    "mb_k5": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=5, expand=1, **kwargs
+    ),
+    "mb_k5_e2": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=5, expand=2, **kwargs
+    ),
+    "mb_k5_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=5, expand=1, res=True, **kwargs
+    ),
+    "mb_k5_e2_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in, c_out, stride, kernel=5, expand=2, res=True, **kwargs
+    ),
+    "mb_k5_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=5,
+        expand=1,
+        res=True,
+        dilation=1,
+        se=True,
+        **kwargs,
+    ),
+    "mb_k5_e2_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
+        c_in,
+        c_out,
+        stride,
+        kernel=5,
+        expand=2,
+        res=True,
+        dilation=1,
+        se=True,
+        **kwargs,
+    ),
+}
+def conv_bn(inp, oup, kernel, stride, pad=1, groups=1):
+    return nn.Sequential(
+        nn.Conv2d(inp, oup, kernel, stride, pad, groups=groups, bias=False),
+        nn.BatchNorm2d(oup),
+        nn.ReLU(inplace=True),
+    )
+class SeparableConv(nn.Module):
+    """Separable convolution."""
+    def __init__(self, in_ch, out_ch, stride=1, kernel=3, fm_size=7):
+        super(SeparableConv, self).__init__()
+        assert stride in [1, 2], "stride should be in [1, 2]"
+        pad = kernel // 2
+        self.conv = nn.Sequential(
+            conv_bn(in_ch, in_ch, kernel, stride, pad=pad, groups=in_ch),
+            conv_bn(in_ch, out_ch, 1, 1, pad=0),
+        )
+    def forward(self, x):
+        return self.conv(x)
+class Conv1x1(nn.Module):
+    """1x1 convolution."""
+    def __init__(self, in_ch, out_ch, stride=1, kernel=1, fm_size=7):
+        super(Conv1x1, self).__init__()
+        assert stride in [1, 2], "stride should be in [1, 2]"
+        padding = kernel // 2
+        self.conv = nn.Sequential(
+            nn.Conv2d(in_ch, out_ch, kernel, stride, padding),
+            nn.ReLU(inplace=True),
+        )
+    def forward(self, x):
+        return self.conv(x)
+class DepthConv(nn.Module):
+    """depth convolution."""
+    def __init__(self, in_ch, out_ch, stride=1, kernel=3, fm_size=7):
+        super(DepthConv, self).__init__()
+        assert stride in [1, 2], "stride should be in [1, 2]"
+        padding = kernel // 2
+        self.conv = nn.Sequential(
+            nn.Conv2d(in_ch, in_ch, kernel, stride, padding, groups=in_ch),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(in_ch, out_ch, 1, 1, 0),
+            nn.ReLU(inplace=True),
+        )
+    def forward(self, x):
+        return self.conv(x)
+class GhostModule(nn.Module):
+    """Gost module."""
+    def __init__(self, in_ch, out_ch, stride=1, kernel=3, fm_size=7):
+        super(GhostModule, self).__init__()
+        mid_ch = out_ch // 2
+        self.primary_conv = conv_bn(in_ch, mid_ch, 1, stride, pad=0)
+        self.cheap_operation = conv_bn(
+            mid_ch, mid_ch, kernel, 1, kernel // 2, mid_ch
+        )
+    def forward(self, x):
+        x1 = self.primary_conv(x)
+        x2 = self.cheap_operation(x1)
+        return torch.cat([x1, x2], dim=1)
+class StemBlock(nn.Module):
+    def __init__(self, in_ch=3, init_ch=32, bottleneck=True):
+        super(StemBlock, self).__init__()
+        self.stem_1 = conv_bn(in_ch, init_ch, 3, 2, 1)
+        mid_ch = int(init_ch // 2) if bottleneck else init_ch
+        self.stem_2a = conv_bn(init_ch, mid_ch, 1, 1, 0)
+        self.stem_2b = SeparableConv(mid_ch, init_ch, 2, 1)
+        self.stem_2p = nn.MaxPool2d(kernel_size=2, stride=2)
+        self.stem_3 = conv_bn(init_ch * 2, init_ch, 1, 1, 0)
+    def forward(self, x):
+        stem_1_out = self.stem_1(x)
+        stem_2a_out = self.stem_2a(stem_1_out)
+        stem_2b_out = self.stem_2b(stem_2a_out)
+        stem_2p_out = self.stem_2p(stem_1_out)
+        out = self.stem_3(torch.cat((stem_2b_out, stem_2p_out), 1))
+        return out, stem_1_out
+class Identity(nn.Module):
+    """ Identity module."""
+    def __init__(self, in_ch, out_ch, stride=1, fm_size=7):
+        super(Identity, self).__init__()
+        self.conv = (
+            conv_bn(in_ch, out_ch, kernel=1, stride=stride, pad=0)
+            if in_ch != out_ch or stride != 1
+            else None
+        )
+    def forward(self, x):
+        if self.conv:
+            out = self.conv(x)
+        else:
+            out = x
+            # Add dropout to avoid overfit on Identity (PDARTS)
+            out = nn.functional.dropout(out, p=0.5)
+        return out
+class Hsigmoid(nn.Module):
+    """Hsigmoid activation function."""
+    def __init__(self, inplace=True):
+        super(Hsigmoid, self).__init__()
+        self.inplace = inplace
+    def forward(self, x):
+        return F.relu6(x + 3.0, inplace=self.inplace) / 6.0
+class eSEModule(nn.Module):
+    """ The improved SE Module."""
+    def __init__(self, channel, fm_size=7, se=True):
+        super(eSEModule, self).__init__()
+        self.se = se
+        if self.se:
+            self.avg_pool = nn.Conv2d(
+                channel, channel, fm_size, 1, 0, groups=channel, bias=False
+            )
+            self.fc = nn.Conv2d(channel, channel, kernel_size=1, padding=0)
+            self.hsigmoid = Hsigmoid()
+    def forward(self, x):
+        if self.se:
+            input = x
+            x = self.avg_pool(x)
+            x = self.fc(x)
+            x = self.hsigmoid(x)
+            return input * x
+        else:
+            return x
+class ChannelShuffle(nn.Module):
+    """Procedure: [N,C,H,W] -> [N,g,C/g,H,W] -> [N,C/g,g,H,w] -> [N,C,H,W]."""
+    def __init__(self, groups):
+        super(ChannelShuffle, self).__init__()
+        self.groups = groups
+    def forward(self, x):
+        if self.groups == 1:
+            return x
+        N, C, H, W = x.size()
+        g = self.groups
+        assert C % g == 0, "group size {} is not for channel {}".format(g, C)
+        return (
+            x.view(N, g, int(C // g), H, W)
+            .permute(0, 2, 1, 3, 4)
+            .contiguous()
+            .view(N, C, H, W)
+        )
+class MBBlock(nn.Module):
+    """The Inverted Residual Block, with channel shuffle or eSEModule."""
+    def __init__(
+        self,
+        in_ch,
+        out_ch,
+        stride=1,
+        kernel=3,
+        expand=1,
+        res=False,
+        dilation=1,
+        se=False,
+        fm_size=7,
+        group=1,
+        mid_ch=-1,
+    ):
+        super(MBBlock, self).__init__()
+        assert stride in [1, 2], "stride should be in [1, 2]"
+        assert kernel in [3, 5], "kernel size should be in [3, 5]"
+        assert dilation in [1, 2, 3, 4], "dilation should be in [1, 2, 3, 4]"
+        assert group in [1, 2], "group should be in [1, 2]"
+        self.use_res_connect = res and (stride == 1)
+        padding = kernel // 2 + (dilation - 1)
+        mid_ch = mid_ch if mid_ch > 0 else (in_ch * expand)
+        # Basic Modules
+        conv_layer = nn.Conv2d
+        norm_layer = nn.BatchNorm2d
+        activation_layer = nn.ReLU
+        channel_suffle = ChannelShuffle
+        se_layer = eSEModule
+        self.ir_block = nn.Sequential(
+            # pointwise convolution
+            conv_layer(in_ch, mid_ch, 1, 1, 0, bias=False, groups=group),
+            norm_layer(mid_ch),
+            activation_layer(inplace=True),
+            # channel shuffle if necessary
+            channel_suffle(group),
+            # depthwise convolution
+            conv_layer(
+                mid_ch,
+                mid_ch,
+                kernel,
+                stride,
+                padding=padding,
+                dilation=dilation,
+                groups=mid_ch,
+                bias=False,
+            ),
+            norm_layer(mid_ch),
+            # eSEModule if necessary
+            se_layer(mid_ch, fm_size, se),
+            activation_layer(inplace=True),
+            # pointwise convolution
+            conv_layer(mid_ch, out_ch, 1, 1, 0, bias=False, groups=group),
+            norm_layer(out_ch),
+        )
+    def forward(self, x):
+        if self.use_res_connect:
+            return x + self.ir_block(x)
+        else:
+            return self.ir_block(x)
+class SingleOperation(nn.Module):
+    """Single operation for sampled path."""
+    def __init__(self, layers_configs, stage_ops, sampled_op=""):
+        """
+        Parameters
+        ----------
+        layers_configs : list
+            the layer config: [input_channel, output_channel, stride, height]
+        stage_ops : dict
+            the pairs of op name and layer operator
+        sampled_op : str
+            the searched layer name
+        """
+        super(SingleOperation, self).__init__()
+        fm = {"fm_size": layers_configs[3]}
+        ops_names = [op_name for op_name in stage_ops]
+        sampled_op = sampled_op if sampled_op else ops_names[0]
+        # define the single op
+        self.op = stage_ops[sampled_op](*layers_configs[0:3], **fm)
+    def forward(self, x):
+        return self.op(x)
+def choice_blocks(layers_configs, stage_ops):
+    """
+    Create list of layer candidates for NNI one-shot NAS.
+    Parameters
+    ----------
+    layers_configs : list
+        the layer config: [input_channel, output_channel, stride, height]
+    stage_ops : dict
+        the pairs of op name and layer operator
+    Returns
+    -------
+    output: list
+        list of layer operators
+    """
+    ops_names = [op for op in stage_ops]
+    fm = {"fm_size": layers_configs[3]}
+    op_list = [stage_ops[op](*layers_configs[0:3], **fm) for op in ops_names]
+    return op_list
--- a/examples/nas/oneshot/pfld/lib/subnet.py
+++ b/examples/nas/oneshot/pfld/lib/subnet.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import torch
+import torch.nn as nn
+from lib.ops import (
+    MBBlock,
+    SeparableConv,
+    SingleOperation,
+    StemBlock,
+    conv_bn,
+)
+from torch.nn import init
+INIT_CH = 16
+class PFLDInference(nn.Module):
+    """ The subnet with the architecture of PFLD. """
+    def __init__(self, lookup_table, sampled_ops, num_points=106):
+        """
+        Parameters
+        ----------
+        lookup_table : class
+            to manage the candidate ops, layer information and layer perf
+        sampled_ops : list of str
+            the searched layer names of the subnet
+        num_points : int
+            the number of landmarks for prediction
+        """
+        super(PFLDInference, self).__init__()
+        stage_names = [stage_name for stage_name in lookup_table.layer_num]
+        stage_n = [lookup_table.layer_num[stage] for stage in stage_names]
+        self.stem = StemBlock(init_ch=INIT_CH, bottleneck=False)
+        self.block4_1 = MBBlock(INIT_CH, 32, stride=2, mid_ch=32)
+        stages_0 = [
+            SingleOperation(
+                lookup_table.layer_configs[layer_id],
+                lookup_table.lut_ops[stage_names[0]],
+                sampled_ops[layer_id],
+            )
+            for layer_id in range(stage_n[0])
+        ]
+        stages_1 = [
+            SingleOperation(
+                lookup_table.layer_configs[layer_id],
+                lookup_table.lut_ops[stage_names[1]],
+                sampled_ops[layer_id],
+            )
+            for layer_id in range(stage_n[0], stage_n[0] + stage_n[1])
+        ]
+        blocks = stages_0 + stages_1
+        self.blocks = nn.Sequential(*blocks)
+        self.avg_pool1 = nn.Conv2d(
+            INIT_CH, INIT_CH, 9, 8, 1, groups=INIT_CH, bias=False
+        )
+        self.avg_pool2 = nn.Conv2d(32, 32, 3, 2, 1, groups=32, bias=False)
+        self.block6_1 = nn.Conv2d(96 + INIT_CH, 64, 1, 1, 0, bias=False)
+        self.block6_2 = MBBlock(64, 64, res=True, se=True, mid_ch=128)
+        self.block6_3 = SeparableConv(64, 128, 1)
+        self.conv7 = nn.Conv2d(128, 128, 7, 1, 0, groups=128, bias=False)
+        self.fc = nn.Conv2d(128, num_points * 2, 1, 1, 0, bias=True)
+        # init params
+        self.init_params()
+    def init_params(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                init.kaiming_normal_(m.weight, mode="fan_out")
+                if m.bias is not None:
+                    init.constant_(m.bias, 0)
+            elif isinstance(m, nn.BatchNorm2d):
+                init.constant_(m.weight, 1)
+                init.constant_(m.bias, 0)
+            elif isinstance(m, nn.Linear):
+                init.normal_(m.weight, std=0.001)
+                if m.bias is not None:
+                    init.constant_(m.bias, 0)
+    def forward(self, x):
+        """
+        Parameters
+        ----------
+        x : tensor
+            input image
+        Returns
+        -------
+        output: tensor
+            the predicted landmarks
+        output: tensor
+            the intermediate features
+        """
+        x, y1 = self.stem(x)
+        out1 = x
+        x = self.block4_1(x)
+        for i, block in enumerate(self.blocks):
+            x = block(x)
+            if i == 1:
+                y2 = x
+            elif i == 4:
+                y3 = x
+        y1 = self.avg_pool1(y1)
+        y2 = self.avg_pool2(y2)
+        multi_scale = torch.cat([y3, y2, y1], 1)
+        y = self.block6_1(multi_scale)
+        y = self.block6_2(y)
+        y = self.block6_3(y)
+        y = self.conv7(y)
+        landmarks = self.fc(y)
+        return landmarks, out1
+class AuxiliaryNet(nn.Module):
+    """ AuxiliaryNet to predict pose angles. """
+    def __init__(self):
+        super(AuxiliaryNet, self).__init__()
+        self.conv1 = conv_bn(INIT_CH, 64, 3, 2)
+        self.conv2 = conv_bn(64, 64, 3, 1)
+        self.conv3 = conv_bn(64, 32, 3, 2)
+        self.conv4 = conv_bn(32, 64, 7, 1)
+        self.max_pool1 = nn.MaxPool2d(3)
+        self.fc1 = nn.Linear(64, 32)
+        self.fc2 = nn.Linear(32, 3)
+    def forward(self, x):
+        """
+        Parameters
+        ----------
+        x : tensor
+            input intermediate features
+        Returns
+        -------
+        output: tensor
+            the predicted pose angles
+        """
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        x = self.max_pool1(x)
+        x = x.view(x.size(0), -1)
+        x = self.fc1(x)
+        x = self.fc2(x)
+        return x
--- a/examples/nas/oneshot/pfld/lib/supernet.py
+++ b/examples/nas/oneshot/pfld/lib/supernet.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import torch
+import torch.nn as nn
+from lib.ops import (
+    MBBlock,
+    SeparableConv,
+    StemBlock,
+    choice_blocks,
+    conv_bn,
+)
+from nni.nas.pytorch import mutables
+from torch.nn import init
+INIT_CH = 16
+class PFLDInference(nn.Module):
+    """ PFLD model for facial landmark."""
+    def __init__(self, lookup_table, num_points=106):
+        """
+        Parameters
+        ----------
+        lookup_table : class
+            to manage the candidate ops, layer information and layer perf
+        num_points : int
+            the number of landmarks for prediction
+        """
+        super(PFLDInference, self).__init__()
+        stage_names = [stage for stage in lookup_table.layer_num]
+        stage_lnum = [lookup_table.layer_num[stage] for stage in stage_names]
+        self.stem = StemBlock(init_ch=INIT_CH, bottleneck=False)
+        self.block4_1 = MBBlock(INIT_CH, 32, stride=2, mid_ch=32)
+        stages_0 = [
+            mutables.LayerChoice(
+                choice_blocks(
+                    lookup_table.layer_configs[layer_id],
+                    lookup_table.lut_ops[stage_names[0]],
+                )
+            )
+            for layer_id in range(stage_lnum[0])
+        ]
+        stages_1 = [
+            mutables.LayerChoice(
+                choice_blocks(
+                    lookup_table.layer_configs[layer_id],
+                    lookup_table.lut_ops[stage_names[1]],
+                )
+            )
+            for layer_id in range(stage_lnum[0], stage_lnum[0] + stage_lnum[1])
+        ]
+        blocks = stages_0 + stages_1
+        self.blocks = nn.Sequential(*blocks)
+        self.avg_pool1 = nn.Conv2d(
+            INIT_CH, INIT_CH, 9, 8, 1, groups=INIT_CH, bias=False
+        )
+        self.avg_pool2 = nn.Conv2d(32, 32, 3, 2, 1, groups=32, bias=False)
+        self.block6_1 = nn.Conv2d(96 + INIT_CH, 64, 1, 1, 0, bias=False)
+        self.block6_2 = MBBlock(64, 64, res=True, se=True, mid_ch=128)
+        self.block6_3 = SeparableConv(64, 128, 1)
+        self.conv7 = nn.Conv2d(128, 128, 7, 1, 0, groups=128, bias=False)
+        self.fc = nn.Conv2d(128, num_points * 2, 1, 1, 0, bias=True)
+        # init params
+        self.init_params()
+    def init_params(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                init.kaiming_normal_(m.weight, mode="fan_out")
+                if m.bias is not None:
+                    init.constant_(m.bias, 0)
+            elif isinstance(m, nn.BatchNorm2d):
+                init.constant_(m.weight, 1)
+                init.constant_(m.bias, 0)
+            elif isinstance(m, nn.Linear):
+                init.normal_(m.weight, std=0.001)
+                if m.bias is not None:
+                    init.constant_(m.bias, 0)
+    def forward(self, x):
+        """
+        Parameters
+        ----------
+        x : tensor
+            input image
+        Returns
+        -------
+        output: tensor
+            the predicted landmarks
+        output: tensor
+            the intermediate features
+        """
+        x, y1 = self.stem(x)
+        out1 = x
+        x = self.block4_1(x)
+        for i, block in enumerate(self.blocks):
+            x = block(x)
+            if i == 1:
+                y2 = x
+            elif i == 4:
+                y3 = x
+        y1 = self.avg_pool1(y1)
+        y2 = self.avg_pool2(y2)
+        multi_scale = torch.cat([y3, y2, y1], 1)
+        y = self.block6_1(multi_scale)
+        y = self.block6_2(y)
+        y = self.block6_3(y)
+        y = self.conv7(y)
+        landmarks = self.fc(y)
+        return landmarks, out1
+class AuxiliaryNet(nn.Module):
+    """ AuxiliaryNet to predict pose angles. """
+    def __init__(self):
+        super(AuxiliaryNet, self).__init__()
+        self.conv1 = conv_bn(INIT_CH, 64, 3, 2)
+        self.conv2 = conv_bn(64, 64, 3, 1)
+        self.conv3 = conv_bn(64, 32, 3, 2)
+        self.conv4 = conv_bn(32, 64, 7, 1)
+        self.max_pool1 = nn.MaxPool2d(3)
+        self.fc1 = nn.Linear(64, 32)
+        self.fc2 = nn.Linear(32, 3)
+    def forward(self, x):
+        """
+        Parameters
+        ----------
+        x : tensor
+            input intermediate features
+        Returns
+        -------
+        output: tensor
+            the predicted pose angles
+        """
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        x = self.max_pool1(x)
+        x = x.view(x.size(0), -1)
+        x = self.fc1(x)
+        x = self.fc2(x)
+        return x
--- a/examples/nas/oneshot/pfld/lib/trainer.py
+++ b/examples/nas/oneshot/pfld/lib/trainer.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import os
+import time
+import torch
+import numpy as np
+from nni.algorithms.nas.pytorch.fbnet import FBNetTrainer
+from nni.nas.pytorch.utils import AverageMeter
+from .utils import accuracy
+class PFLDTrainer(FBNetTrainer):
+    def __init__(
+        self,
+        model,
+        auxiliarynet,
+        model_optim,
+        criterion,
+        device,
+        device_ids,
+        config,
+        lookup_table,
+        train_loader,
+        valid_loader,
+        n_epochs=300,
+        load_ckpt=False,
+        arch_path=None,
+        logger=None,
+    ):
+        """
+        Parameters
+        ----------
+        model : pytorch model
+            the user model, which has mutables
+        auxiliarynet : pytorch model
+            the auxiliarynet to regress angle
+        model_optim : pytorch optimizer
+            the user defined optimizer
+        criterion : pytorch loss
+            the main task loss
+        device : pytorch device
+            the devices to train/search the model
+        device_ids : list of int
+            the indexes of devices used for training
+        config : class
+            configuration object for fbnet training
+        lookup_table : class
+            lookup table object for fbnet training
+        train_loader : pytorch data loader
+            data loader for the training set
+        valid_loader : pytorch data loader
+            data loader for the validation set
+        n_epochs : int
+            number of epochs to train/search
+        load_ckpt : bool
+            whether load checkpoint
+        arch_path : str
+            the path to store chosen architecture
+        logger : logger
+            the logger
+        """
+        super(PFLDTrainer, self).__init__(
+            model,
+            model_optim,
+            criterion,
+            device,
+            device_ids,
+            lookup_table,
+            train_loader,
+            valid_loader,
+            n_epochs,
+            load_ckpt,
+            arch_path,
+            logger,
+        )
+        # DataParallel of the AuxiliaryNet to PFLD
+        self.auxiliarynet = auxiliarynet
+        self.auxiliarynet = torch.nn.DataParallel(
+            self.auxiliarynet, device_ids=device_ids
+        )
+        self.auxiliarynet.to(device)
+    def _validate(self):
+        """
+        Do validation. During validation, LayerChoices use the mixed-op.
+        Returns
+        -------
+        float, float
+            average loss, average nme
+        """
+        # test on validation set under eval mode
+        self.model.eval()
+        self.auxiliarynet.eval()
+        losses, nme = list(), list()
+        batch_time = AverageMeter("batch_time")
+        end = time.time()
+        with torch.no_grad():
+            for i, (img, land_gt, angle_gt) in enumerate(self.valid_loader):
+                img = img.to(self.device, non_blocking=True)
+                landmark_gt = land_gt.to(self.device, non_blocking=True)
+                angle_gt = angle_gt.to(self.device, non_blocking=True)
+                landmark, _ = self.model(img)
+                # compute the l2 loss
+                landmark = landmark.squeeze()
+                l2_diff = torch.sum((landmark_gt - landmark) ** 2, axis=1)
+                loss = torch.mean(l2_diff)
+                losses.append(loss.cpu().detach().numpy())
+                # compute the accuracy
+                landmark = landmark.cpu().detach().numpy()
+                landmark = landmark.reshape(landmark.shape[0], -1, 2)
+                landmark_gt = landmark_gt.cpu().detach().numpy()
+                landmark_gt = landmark_gt.reshape(landmark_gt.shape[0], -1, 2)
+                _, nme_i = accuracy(landmark, landmark_gt)
+                for item in nme_i:
+                    nme.append(item)
+                # measure elapsed time
+                batch_time.update(time.time() - end)
+                end = time.time()
+        self.logger.info("===> Evaluate:")
+        self.logger.info(
+            "Eval set: Average loss: {:.4f} nme: {:.4f}".format(
+                np.mean(losses), np.mean(nme)
+            )
+        )
+        return np.mean(losses), np.mean(nme)
+    def _train_epoch(self, epoch, optimizer, arch_train=False):
+        """
+        Train one epoch.
+        """
+        # switch to train mode
+        self.model.train()
+        self.auxiliarynet.train()
+        batch_time = AverageMeter("batch_time")
+        data_time = AverageMeter("data_time")
+        losses = AverageMeter("losses")
+        data_loader = self.valid_loader if arch_train else self.train_loader
+        end = time.time()
+        for i, (img, landmark_gt, angle_gt) in enumerate(data_loader):
+            data_time.update(time.time() - end)
+            img = img.to(self.device, non_blocking=True)
+            landmark_gt = landmark_gt.to(self.device, non_blocking=True)
+            angle_gt = angle_gt.to(self.device, non_blocking=True)
+            lands, feats = self.model(img)
+            landmarks = lands.squeeze()
+            angle = self.auxiliarynet(feats)
+            # task loss
+            weighted_loss, l2_loss = self.criterion(
+                landmark_gt, angle_gt, angle, landmarks
+            )
+            loss = l2_loss if arch_train else weighted_loss
+            # hardware-aware loss
+            perf_cost = self._get_perf_cost(requires_grad=True)
+            regu_loss = self.reg_loss(perf_cost)
+            if self.mode.startswith("mul"):
+                loss = loss * regu_loss
+            elif self.mode.startswith("add"):
+                loss = loss + regu_loss
+            # compute gradient and do SGD step
+            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()
+            # measure elapsed time
+            batch_time.update(time.time() - end)
+            end = time.time()
+            # measure accuracy and record loss
+            losses.update(np.squeeze(loss.cpu().detach().numpy()), img.size(0))
+            if i % 10 == 0:
+                batch_log = (
+                    "Train [{0}][{1}]\t"
+                    "Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
+                    "Data {data_time.val:.3f} ({data_time.avg:.3f})\t"
+                    "Loss {losses.val:.4f} ({losses.avg:.4f})".format(
+                        epoch + 1,
+                        i,
+                        batch_time=batch_time,
+                        data_time=data_time,
+                        losses=losses,
+                    )
+                )
+                self.logger.info(batch_log)
+    def _warm_up(self):
+        """
+        Warm up the model, while the architecture weights are not trained.
+        """
+        for epoch in range(self.epoch, self.start_epoch):
+            self.logger.info("\n--------Warmup epoch: %d--------\n", epoch + 1)
+            self._train_epoch(epoch, self.model_optim)
+            # adjust learning rate
+            self.scheduler.step()
+            # validation
+            _, _ = self._validate()
+            if epoch % 10 == 0:
+                filename = os.path.join(
+                    self.config.model_dir, "checkpoint_%s.pth" % epoch
+                )
+                self.save_checkpoint(epoch, filename)
+    def _train(self):
+        """
+        Train the model, it trains model weights and architecute weights.
+        Architecture weights are trained according to the schedule.
+        Before updating architecture weights, ```requires_grad``` is enabled.
+        Then, it is disabled after the updating, in order not to update
+        architecture weights when training model weights.
+        """
+        arch_param_num = self.mutator.num_arch_params()
+        self.logger.info("#arch_params: {}".format(arch_param_num))
+        self.epoch = max(self.start_epoch, self.epoch)
+        ckpt_path = self.config.model_dir
+        choice_names = None
+        val_nme = 1e6
+        for epoch in range(self.epoch, self.n_epochs):
+            # update the weight parameters
+            self.logger.info("\n--------Train epoch: %d--------\n", epoch + 1)
+            self._train_epoch(epoch, self.model_optim)
+            # adjust learning rate
+            self.scheduler.step()
+            # update the architecture parameters
+            self.logger.info("Update architecture parameters")
+            self.mutator.arch_requires_grad()
+            self._train_epoch(epoch, self.arch_optimizer, True)
+            self.mutator.arch_disable_grad()
+            # temperature annealing
+            self.temp = self.temp * self.exp_anneal_rate
+            self.mutator.set_temperature(self.temp)
+            # sample the architecture of sub-network
+            choice_names = self._layer_choice_sample()
+            # validate
+            _, nme = self._validate()
+            if epoch % 10 == 0:
+                filename = os.path.join(ckpt_path, "checkpoint_%s.pth" % epoch)
+                self.save_checkpoint(epoch, filename, choice_names)
+            if nme < val_nme:
+                filename = os.path.join(ckpt_path, "checkpoint_best.pth")
+                self.save_checkpoint(epoch, filename, choice_names)
+                val_nme = nme
+            self.logger.info("Best nme: {:.4f}".format(val_nme))
+    def save_checkpoint(self, epoch, filename, choice_names=None):
+        """
+        Save checkpoint of the whole model.
+        Saving model weights and architecture weights as ```filename```,
+        and saving currently chosen architecture in ```arch_path```.
+        """
+        state = {
+            "pfld_backbone": self.model.state_dict(),
+            "auxiliarynet": self.auxiliarynet.state_dict(),
+            "optim": self.model_optim.state_dict(),
+            "epoch": epoch,
+            "arch_sample": choice_names,
+        }
+        torch.save(state, filename)
+        self.logger.info("Save checkpoint to {0:}".format(filename))
+        if self.arch_path:
+            self.export(self.arch_path)
+    def load_checkpoint(self, filename):
+        """
+        Load the checkpoint from ```filename```.
+        """
+        ckpt = torch.load(filename)
+        self.epoch = ckpt["epoch"]
+        self.model.load_state_dict(ckpt["pfld_backbone"])
+        self.auxiliarynet.load_state_dict(ckpt["auxiliarynet"])
+        self.model_optim.load_state_dict(ckpt["optim"])
--- a/examples/nas/oneshot/pfld/lib/utils.py
+++ b/examples/nas/oneshot/pfld/lib/utils.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import torch
+import numpy as np
+import torch.nn as nn
+def accuracy(preds, target):
+    """
+    Calculate the NME (Normalized Mean Error).
+    Parameters
+    ----------
+    preds : numpy array
+        the predicted landmarks
+    target : numpy array
+        the ground truth of landmarks
+    Returns
+    -------
+    output: float32
+        the nme value
+    output: list
+        the list of l2 distances
+    """
+    N = preds.shape[0]
+    L = preds.shape[1]
+    rmse = np.zeros(N).astype(np.float32)
+    for i in range(N):
+        pts_pred, pts_gt = (
+            preds[i],
+            target[i],
+        )
+        if L == 19:
+            # aflw
+            interocular = 34
+        elif L == 29:
+            # cofw
+            interocular = np.linalg.norm(pts_gt[8] - pts_gt[9])
+        elif L == 68:
+            # interocular
+            interocular = np.linalg.norm(pts_gt[36] - pts_gt[45])
+        elif L == 98:
+            # euclidean dis from left eye to right eye
+            interocular = np.linalg.norm(pts_gt[60] - pts_gt[72])
+        elif L == 106:
+            # euclidean dis from left eye to right eye
+            interocular = np.linalg.norm(pts_gt[35] - pts_gt[93])
+        else:
+            raise ValueError("Number of landmarks is wrong")
+        pred_dis = np.sum(np.linalg.norm(pts_pred - pts_gt, axis=1))
+        rmse[i] = pred_dis / (interocular * L)
+    return np.mean(rmse), rmse
+class PFLDLoss(nn.Module):
+    """Weighted loss of L2 distance with the pose angle for PFLD."""
+    def __init__(self):
+        super(PFLDLoss, self).__init__()
+    def forward(self, landmark_gt, euler_angle_gt, angle, landmarks):
+        """
+        Calculate weighted L2 loss for PFLD.
+        Parameters
+        ----------
+        landmark_gt : tensor
+            the ground truth of landmarks
+        euler_angle_gt : tensor
+            the ground truth of pose angle
+        angle : tensor
+            the predicted pose angle
+        landmarks : float32
+            the predicted landmarks
+        Returns
+        -------
+        output: tensor
+            the weighted L2 loss
+        output: tensor
+            the normal L2 loss
+        """
+        weight_angle = torch.sum(1 - torch.cos(angle - euler_angle_gt), axis=1)
+        l2_distant = torch.sum((landmark_gt - landmarks) ** 2, axis=1)
+        return torch.mean(weight_angle * l2_distant), torch.mean(l2_distant)
+def bounded_regress_loss(
+    landmark_gt, landmarks_t, landmarks_s, reg_m=0.5, br_alpha=0.05
+):
+    """
+    Calculate the Bounded Regression Loss for Knowledge Distillation.
+    Parameters
+    ----------
+    landmark_gt : tensor
+        the ground truth of landmarks
+    landmarks_t : tensor
+        the predicted landmarks of teacher
+    landmarks_s : tensor
+        the predicted landmarks of student
+    reg_m : float32
+        the value to control the regresion constraint
+    br_alpha : float32
+        the balance value for kd loss
+    Returns
+    -------
+    output: tensor
+        the bounded regression loss
+    """
+    l2_dis_s = (landmark_gt - landmarks_s).pow(2).sum(1)
+    l2_dis_s_m = l2_dis_s + reg_m
+    l2_dis_t = (landmark_gt - landmarks_t).pow(2).sum(1)
+    br_loss = l2_dis_s[l2_dis_s_m > l2_dis_t].sum()
+    return br_loss * br_alpha
--- a/examples/nas/oneshot/pfld/retrain.py
+++ b/examples/nas/oneshot/pfld/retrain.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import argparse
+import logging
+import os
+import time
+import torch
+import torchvision
+import numpy as np
+from datasets import PFLDDatasets
+from lib.builder import search_space
+from lib.ops import PRIMITIVES
+from lib.utils import PFLDLoss, accuracy
+from nni.algorithms.nas.pytorch.fbnet import (
+    LookUpTable,
+    NASConfig,
+    supernet_sample,
+)
+from nni.nas.pytorch.utils import AverageMeter
+from torch.utils.data import DataLoader
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def validate(model, auxiliarynet, valid_loader, device, logger):
+    """Do validation."""
+    model.eval()
+    auxiliarynet.eval()
+    losses, nme = list(), list()
+    with torch.no_grad():
+        for i, (img, land_gt, angle_gt) in enumerate(valid_loader):
+            img = img.to(device, non_blocking=True)
+            landmark_gt = land_gt.to(device, non_blocking=True)
+            angle_gt = angle_gt.to(device, non_blocking=True)
+            landmark, _ = model(img)
+            # compute the l2 loss
+            landmark = landmark.squeeze()
+            l2_diff = torch.sum((landmark_gt - landmark) ** 2, axis=1)
+            loss = torch.mean(l2_diff)
+            losses.append(loss.cpu().detach().numpy())
+            # compute the accuracy
+            landmark = landmark.cpu().detach().numpy()
+            landmark = landmark.reshape(landmark.shape[0], -1, 2)
+            landmark_gt = landmark_gt.cpu().detach().numpy()
+            landmark_gt = landmark_gt.reshape(landmark_gt.shape[0], -1, 2)
+            _, nme_i = accuracy(landmark, landmark_gt)
+            for item in nme_i:
+                nme.append(item)
+    logger.info("===> Evaluate:")
+    logger.info(
+        "Eval set: Average loss: {:.4f} nme: {:.4f}".format(
+            np.mean(losses), np.mean(nme)
+        )
+    )
+    return np.mean(losses), np.mean(nme)
+def train_epoch(
+    model,
+    auxiliarynet,
+    criterion,
+    train_loader,
+    device,
+    epoch,
+    optimizer,
+    logger,
+):
+    """Train one epoch."""
+    model.train()
+    auxiliarynet.train()
+    batch_time = AverageMeter("batch_time")
+    data_time = AverageMeter("data_time")
+    losses = AverageMeter("losses")
+    end = time.time()
+    for i, (img, landmark_gt, angle_gt) in enumerate(train_loader):
+        data_time.update(time.time() - end)
+        img = img.to(device, non_blocking=True)
+        landmark_gt = landmark_gt.to(device, non_blocking=True)
+        angle_gt = angle_gt.to(device, non_blocking=True)
+        lands, feats = model(img)
+        landmarks = lands.squeeze()
+        angle = auxiliarynet(feats)
+        # task loss
+        weighted_loss, _ = criterion(
+            landmark_gt, angle_gt, angle, landmarks
+        )
+        loss = weighted_loss
+        # compute gradient and do SGD step
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+        # measure elapsed time
+        batch_time.update(time.time() - end)
+        end = time.time()
+        # measure accuracy and record loss
+        losses.update(np.squeeze(loss.cpu().detach().numpy()), img.size(0))
+        if i % 10 == 0:
+            batch_log = (
+                "Train [{0}][{1}]\t"
+                "Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
+                "Data {data_time.val:.3f} ({data_time.avg:.3f})\t"
+                "Loss {losses.val:.4f} ({losses.avg:.4f})".format(
+                    epoch + 1,
+                    i,
+                    batch_time=batch_time,
+                    data_time=data_time,
+                    losses=losses,
+                )
+            )
+            logger.info(batch_log)
+def save_checkpoint(model, auxiliarynet, optimizer, filename, logger):
+    """Save checkpoint of the whole model."""
+    state = {
+        "pfld_backbone": model.state_dict(),
+        "auxiliarynet": auxiliarynet.state_dict(),
+        "optim": optimizer.state_dict(),
+    }
+    torch.save(state, filename)
+    logger.info("Save checkpoint to {0:}".format(filename))
+def main(args):
+    """ The main function for supernet pre-training and subnet fine-tuning. """
+    logging.basicConfig(
+        format="[%(asctime)s] [p%(process)s] [%(pathname)s\
+            :%(lineno)d] [%(levelname)s] %(message)s",
+        level=logging.INFO,
+        handlers=[
+            logging.FileHandler(args.log_file, mode="w"),
+            logging.StreamHandler(),
+        ],
+    )
+    # print the information of arguments
+    for arg in vars(args):
+        s = arg + ": " + str(getattr(args, arg))
+        logging.info(s)
+    # for 106 landmarks
+    num_points = 106
+    # list of device ids, and the number of workers for data loading
+    device_ids = [int(id) for id in args.dev_id.split(",")]
+    dev_num = len(device_ids)
+    num_workers = 4 * dev_num
+    # import subnet for fine-tuning
+    from lib.subnet import PFLDInference, AuxiliaryNet
+    # the configuration for training control
+    nas_config = NASConfig(
+        model_dir=args.snapshot,
+        search_space=search_space,
+    )
+    # look-up table with information of search space, flops per block, etc.
+    lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+    check = torch.load(args.supernet, map_location=torch.device("cpu"))
+    sampled_arch = check["arch_sample"]
+    logging.info(sampled_arch)
+    # create subnet
+    pfld_backbone = PFLDInference(lookup_table, sampled_arch, num_points)
+    # pre-load the weights from pre-trained supernet
+    state_dict = check["pfld_backbone"]
+    supernet_sample(pfld_backbone, state_dict, sampled_arch, lookup_table)
+    # the auxiliary-net of PFLD to predict the pose angle
+    auxiliarynet = AuxiliaryNet()
+    # DataParallel
+    pfld_backbone = torch.nn.DataParallel(pfld_backbone, device_ids=device_ids)
+    pfld_backbone.to(device)
+    auxiliarynet = torch.nn.DataParallel(auxiliarynet, device_ids=device_ids)
+    auxiliarynet.to(device)
+    # main task loss
+    criterion = PFLDLoss()
+    # optimizer / scheduler for weight train
+    optimizer = torch.optim.RMSprop(
+        [
+            {"params": pfld_backbone.parameters()},
+            {"params": auxiliarynet.parameters()},
+        ],
+        lr=args.base_lr,
+        momentum=0.0,
+        weight_decay=args.weight_decay,
+    )
+    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+        optimizer, T_max=args.end_epoch, last_epoch=-1
+    )
+    # data argmentation and dataloader
+    transform = torchvision.transforms.Compose(
+        [torchvision.transforms.ToTensor()]
+    )
+    # the landmark dataset with 106 points is default used
+    train_dataset = PFLDDatasets(
+        os.path.join(args.data_root, "train_data/list.txt"),
+        transform,
+        data_root=args.data_root,
+        img_size=args.img_size,
+    )
+    dataloader = DataLoader(
+        train_dataset,
+        batch_size=args.train_batchsize,
+        shuffle=True,
+        num_workers=num_workers,
+        pin_memory=True,
+        drop_last=False,
+    )
+    val_dataset = PFLDDatasets(
+        os.path.join(args.data_root, "test_data/list.txt"),
+        transform,
+        data_root=args.data_root,
+        img_size=args.img_size,
+    )
+    val_dataloader = DataLoader(
+        val_dataset,
+        batch_size=args.val_batchsize,
+        shuffle=False,
+        num_workers=num_workers,
+        pin_memory=True,
+    )
+    # start finetune
+    ckpt_path = args.snapshot
+    val_nme = 1e6
+    for epoch in range(0, args.end_epoch):
+        logging.info("\n--------Train epoch: %d--------\n", epoch + 1)
+        # update the weight parameters
+        train_epoch(
+            pfld_backbone,
+            auxiliarynet,
+            criterion,
+            dataloader,
+            device,
+            epoch,
+            optimizer,
+            logging,
+        )
+        # adjust learning rate
+        scheduler.step()
+        # validate
+        _, nme = validate(
+            pfld_backbone, auxiliarynet, val_dataloader, device, logging
+        )
+        if epoch % 10 == 0:
+            filename = os.path.join(ckpt_path, "checkpoint_%s.pth" % epoch)
+            save_checkpoint(
+                pfld_backbone, auxiliarynet, optimizer, filename, logging
+            )
+        if nme < val_nme:
+            filename = os.path.join(ckpt_path, "checkpoint_best.pth")
+            save_checkpoint(
+                pfld_backbone, auxiliarynet, optimizer, filename, logging
+            )
+            val_nme = nme
+        logging.info("Best nme: {:.4f}".format(val_nme))
+def parse_args():
+    """ Parse the user arguments. """
+    parser = argparse.ArgumentParser(description="Finetuning for PFLD")
+    parser.add_argument("--dev_id", dest="dev_id", default="0", type=str)
+    parser.add_argument("--base_lr", default=0.0001, type=int)
+    parser.add_argument("--weight-decay", "--wd", default=1e-6, type=float)
+    parser.add_argument("--img_size", default=112, type=int)
+    parser.add_argument("--supernet", default="", type=str, metavar="PATH")
+    parser.add_argument("--end_epoch", default=300, type=int)
+    parser.add_argument(
+        "--snapshot", default="models", type=str, metavar="PATH"
+    )
+    parser.add_argument("--log_file", default="train.log", type=str)
+    parser.add_argument(
+        "--data_root", default="/dataset", type=str, metavar="PATH"
+    )
+    parser.add_argument("--train_batchsize", default=256, type=int)
+    parser.add_argument("--val_batchsize", default=128, type=int)
+    args = parser.parse_args()
+    args.snapshot = os.path.join(args.snapshot, 'subnet')
+    args.log_file = os.path.join(args.snapshot, "{}.log".format('subnet'))
+    os.makedirs(args.snapshot, exist_ok=True)
+    return args
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/examples/nas/oneshot/pfld/train.py
+++ b/examples/nas/oneshot/pfld/train.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import argparse
+import logging
+import os
+import torch
+import torchvision
+import numpy as np
+from datasets import PFLDDatasets
+from lib.builder import search_space
+from lib.ops import PRIMITIVES
+from lib.trainer import PFLDTrainer
+from lib.utils import PFLDLoss
+from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig
+from torch.utils.data import DataLoader
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def main(args):
+    """ The main function for supernet pre-training and subnet fine-tuning. """
+    logging.basicConfig(
+        format="[%(asctime)s] [p%(process)s] [%(pathname)s\
+            :%(lineno)d] [%(levelname)s] %(message)s",
+        level=logging.INFO,
+        handlers=[
+            logging.FileHandler(args.log_file, mode="w"),
+            logging.StreamHandler(),
+        ],
+    )
+    # print the information of arguments
+    for arg in vars(args):
+        s = arg + ": " + str(getattr(args, arg))
+        logging.info(s)
+    # for 106 landmarks
+    num_points = 106
+    # list of device ids, and the number of workers for data loading
+    device_ids = [int(id) for id in args.dev_id.split(",")]
+    dev_num = len(device_ids)
+    num_workers = 4 * dev_num
+    # random seed
+    manual_seed = 1
+    np.random.seed(manual_seed)
+    torch.manual_seed(manual_seed)
+    torch.cuda.manual_seed_all(manual_seed)
+    # import supernet for block-wise DNAS pre-training
+    from lib.supernet import PFLDInference, AuxiliaryNet
+    # the configuration for training control
+    nas_config = NASConfig(
+        model_dir=args.snapshot,
+        nas_lr=args.theta_lr,
+        mode=args.mode,
+        alpha=args.alpha,
+        beta=args.beta,
+        search_space=search_space,
+    )
+    # look-up table with information of search space, flops per block, etc.
+    lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+    # create supernet
+    pfld_backbone = PFLDInference(lookup_table, num_points)
+    # the auxiliary-net of PFLD to predict the pose angle
+    auxiliarynet = AuxiliaryNet()
+    # main task loss
+    criterion = PFLDLoss()
+    # optimizer for weight train
+    if args.opt == "adam":
+        optimizer = torch.optim.AdamW(
+            [
+                {"params": pfld_backbone.parameters()},
+                {"params": auxiliarynet.parameters()},
+            ],
+            lr=args.base_lr,
+            weight_decay=args.weight_decay,
+        )
+    elif args.opt == "rms":
+        optimizer = torch.optim.RMSprop(
+            [
+                {"params": pfld_backbone.parameters()},
+                {"params": auxiliarynet.parameters()},
+            ],
+            lr=args.base_lr,
+            momentum=0.0,
+            weight_decay=args.weight_decay,
+        )
+    # data argmentation and dataloader
+    transform = torchvision.transforms.Compose(
+        [torchvision.transforms.ToTensor()]
+    )
+    # the landmark dataset with 106 points is default used
+    train_dataset = PFLDDatasets(
+        os.path.join(args.data_root, "train_data/list.txt"),
+        transform,
+        data_root=args.data_root,
+        img_size=args.img_size,
+    )
+    dataloader = DataLoader(
+        train_dataset,
+        batch_size=args.train_batchsize,
+        shuffle=True,
+        num_workers=num_workers,
+        pin_memory=True,
+        drop_last=False,
+    )
+    val_dataset = PFLDDatasets(
+        os.path.join(args.data_root, "test_data/list.txt"),
+        transform,
+        data_root=args.data_root,
+        img_size=args.img_size,
+    )
+    val_dataloader = DataLoader(
+        val_dataset,
+        batch_size=args.val_batchsize,
+        shuffle=False,
+        num_workers=num_workers,
+        pin_memory=True,
+    )
+    # create the trainer, then search/finetune
+    trainer = PFLDTrainer(
+        pfld_backbone,
+        auxiliarynet,
+        optimizer,
+        criterion,
+        device,
+        device_ids,
+        nas_config,
+        lookup_table,
+        dataloader,
+        val_dataloader,
+        n_epochs=args.end_epoch,
+        logger=logging,
+    )
+    trainer.train()
+def parse_args():
+    """ Parse the user arguments. """
+    parser = argparse.ArgumentParser(description="FBNet for PFLD")
+    parser.add_argument("--dev_id", dest="dev_id", default="0", type=str)
+    parser.add_argument("--opt", default="rms", type=str)
+    parser.add_argument("--base_lr", default=0.0001, type=int)
+    parser.add_argument("--weight-decay", "--wd", default=1e-6, type=float)
+    parser.add_argument("--img_size", default=112, type=int)
+    parser.add_argument("--theta-lr", "--tlr", default=0.01, type=float)
+    parser.add_argument(
+        "--mode", default="mul", type=str, choices=["mul", "add"]
+    )
+    parser.add_argument("--alpha", default=0.25, type=float)
+    parser.add_argument("--beta", default=0.6, type=float)
+    parser.add_argument("--end_epoch", default=300, type=int)
+    parser.add_argument(
+        "--snapshot", default="models", type=str, metavar="PATH"
+    )
+    parser.add_argument("--log_file", default="train.log", type=str)
+    parser.add_argument(
+        "--data_root", default="/dataset", type=str, metavar="PATH"
+    )
+    parser.add_argument("--train_batchsize", default=256, type=int)
+    parser.add_argument("--val_batchsize", default=128, type=int)
+    args = parser.parse_args()
+    args.snapshot = os.path.join(args.snapshot, 'supernet')
+    args.log_file = os.path.join(args.snapshot, "{}.log".format('supernet'))
+    os.makedirs(args.snapshot, exist_ok=True)
+    return args
+if __name__ == "__main__":
+    args = parse_args()
+    main(args)
--- a/nni/algorithms/nas/pytorch/fbnet/__init__.py
+++ b/nni/algorithms/nas/pytorch/fbnet/__init__.py
+from __future__ import absolute_import
+from .mutator import FBNetMutator  # noqa: F401
+from .trainer import FBNetTrainer  # noqa: F401
+from .utils import (  # noqa: F401
+    LookUpTable,
+    NASConfig,
+    RegularizerLoss,
+    model_init,
+    supernet_sample,
+)
--- a/nni/algorithms/nas/pytorch/fbnet/mutator.py
+++ b/nni/algorithms/nas/pytorch/fbnet/mutator.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import torch
+from torch import nn as nn
+from torch.nn import functional as F
+import numpy as np
+from nni.nas.pytorch.base_mutator import BaseMutator
+from nni.nas.pytorch.mutables import LayerChoice
+class MixedOp(nn.Module):
+    """
+    This class is to instantiate and manage info of one LayerChoice.
+    It includes architecture weights and member functions for the weights.
+    """
+    def __init__(self, mutable, latency):
+        """
+        Parameters
+        ----------
+        mutable : LayerChoice
+            A LayerChoice in user model
+        latency : List
+            performance cost for each op in mutable
+        """
+        super(MixedOp, self).__init__()
+        self.latency = latency
+        n_choices = len(mutable)
+        self.path_alpha = nn.Parameter(
+            torch.FloatTensor([1.0 / n_choices for i in range(n_choices)])
+        )
+        self.path_alpha.requires_grad = False
+        self.temperature = 1.0
+    def get_path_alpha(self):
+        """Return the architecture parameter."""
+        return self.path_alpha
+    def get_weighted_latency(self):
+        """Return the weighted perf_cost of current mutable."""
+        soft_masks = self.probs_over_ops()
+        weighted_latency = sum(m * l for m, l in zip(soft_masks, self.latency))
+        return weighted_latency
+    def set_temperature(self, temperature):
+        """
+        Set the annealed temperature for gumbel softmax.
+        Parameters
+        ----------
+        temperature : float
+            The annealed temperature for gumbel softmax
+        """
+        self.temperature = temperature
+    def to_requires_grad(self):
+        """Enable gradient calculation."""
+        self.path_alpha.requires_grad = True
+    def to_disable_grad(self):
+        """Disable gradient calculation."""
+        self.path_alpha.requires_grad = False
+    def probs_over_ops(self):
+        """Apply gumbel softmax to generate probability distribution."""
+        return F.gumbel_softmax(self.path_alpha, self.temperature)
+    def forward(self, mutable, x):
+        """
+        Define forward of LayerChoice.
+        Parameters
+        ----------
+        mutable : LayerChoice
+            this layer's mutable
+        x : tensor
+            inputs of this layer, only support one input
+        Returns
+        -------
+        output: tensor
+            output of this layer
+        """
+        candidate_ops = list(mutable)
+        soft_masks = self.probs_over_ops()
+        output = sum(m * op(x) for m, op in zip(soft_masks, candidate_ops))
+        return output
+    @property
+    def chosen_index(self):
+        """
+        choose the op with max prob
+        Returns
+        -------
+        int
+            index of the chosen one
+        """
+        alphas = self.path_alpha.data.detach().cpu().numpy()
+        index = int(np.argmax(alphas))
+        return index
+class FBNetMutator(BaseMutator):
+    """
+    This mutator initializes and operates all the LayerChoices of the supernet.
+    It is for the related trainer to control the training flow of LayerChoices,
+    coordinating with whole training process.
+    """
+    def __init__(self, model, lookup_table):
+        """
+        Init a MixedOp instance for each mutable i.e., LayerChoice.
+        And register the instantiated MixedOp in corresponding LayerChoice.
+        If does not register it in LayerChoice, DataParallel does'nt work then,
+        for architecture weights are not included in the DataParallel model.
+        When MixedOPs are registered, we use ```requires_grad``` to control
+        whether calculate gradients of architecture weights.
+        Parameters
+        ----------
+        model : pytorch model
+            The model that users want to tune,
+            it includes search space defined with nni nas apis
+        lookup_table : class
+            lookup table object to manage model space information,
+            including candidate ops for each stage as the model space,
+            input channels/output channels/stride/fm_size as the layer config,
+            and the performance information for perf_cost accumulation.
+        """
+        super(FBNetMutator, self).__init__(model)
+        self.mutable_list = []
+        # Collect the op names of the candidate ops within each mutable
+        ops_names_mutable = dict()
+        left = 0
+        right = 1
+        for stage_name in lookup_table.layer_num:
+            right = lookup_table.layer_num[stage_name]
+            stage_ops = lookup_table.lut_ops[stage_name]
+            ops_names = [op_name for op_name in stage_ops]
+            for i in range(left, left + right):
+                ops_names_mutable[i] = ops_names
+            left = right
+        # Create the mixed op
+        for i, mutable in enumerate(self.undedup_mutables):
+            ops_names = ops_names_mutable[i]
+            latency_mutable = lookup_table.lut_perf[i]
+            latency = [latency_mutable[op_name] for op_name in ops_names]
+            self.mutable_list.append(mutable)
+            mutable.registered_module = MixedOp(mutable, latency)
+    def on_forward_layer_choice(self, mutable, *args, **kwargs):
+        """
+        Callback of layer choice forward. This function defines the forward
+        logic of the input mutable. So mutable is only interface, its real
+        implementation is defined in mutator.
+        Parameters
+        ----------
+        mutable: LayerChoice
+            forward logic of this input mutable
+        args: list of torch.Tensor
+            inputs of this mutable
+        kwargs: dict
+            inputs of this mutable
+        Returns
+        -------
+        torch.Tensor
+            output of this mutable, i.e., LayerChoice
+        int
+            index of the chosen op
+        """
+        # FIXME: return mask, to be consistent with other algorithms
+        idx = mutable.registered_module.chosen_index
+        return mutable.registered_module(mutable, *args, **kwargs), idx
+    def num_arch_params(self):
+        """
+        The number of mutables, i.e., LayerChoice
+        Returns
+        -------
+        int
+            the number of LayerChoice in user model
+        """
+        return len(self.mutable_list)
+    def get_architecture_parameters(self):
+        """
+        Get all the architecture parameters.
+        yield
+        -----
+        PyTorch Parameter
+            Return path_alpha of the traversed mutable
+        """
+        for mutable in self.undedup_mutables:
+            yield mutable.registered_module.get_path_alpha()
+    def get_weighted_latency(self):
+        """
+        Get the latency weighted by gumbel softmax coefficients.
+        yield
+        -----
+        Tuple
+            Return the weighted_latency of the traversed mutable
+        """
+        for mutable in self.undedup_mutables:
+            yield mutable.registered_module.get_weighted_latency()
+    def set_temperature(self, temperature):
+        """
+        Set the annealed temperature of the op for gumbel softmax.
+        Parameters
+        ----------
+        temperature : float
+            The annealed temperature for gumbel softmax
+        """
+        for mutable in self.undedup_mutables:
+            mutable.registered_module.set_temperature(temperature)
+    def arch_requires_grad(self):
+        """
+        Make architecture weights require gradient
+        """
+        for mutable in self.undedup_mutables:
+            mutable.registered_module.to_requires_grad()
+    def arch_disable_grad(self):
+        """
+        Disable gradient of architecture weights, i.e., does not
+        calculate gradient for them.
+        """
+        for mutable in self.undedup_mutables:
+            mutable.registered_module.to_disable_grad()
+    def sample_final(self):
+        """
+        Generate the final chosen architecture.
+        Returns
+        -------
+        dict
+            the choice of each mutable, i.e., LayerChoice
+        """
+        result = dict()
+        for mutable in self.undedup_mutables:
+            assert isinstance(mutable, LayerChoice)
+            index = mutable.registered_module.chosen_index
+            # pylint: disable=not-callable
+            result[mutable.key] = (
+                F.one_hot(torch.tensor(index), num_classes=len(mutable))
+                .view(-1)
+                .bool(),
+            )
+        return result
--- a/nni/algorithms/nas/pytorch/fbnet/trainer.py
+++ b/nni/algorithms/nas/pytorch/fbnet/trainer.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import json
+import os
+import time
+import torch
+import numpy as np
+from torch.autograd import Variable
+from nni.nas.pytorch.base_trainer import BaseTrainer
+from nni.nas.pytorch.trainer import TorchTensorEncoder
+from nni.nas.pytorch.utils import AverageMeter
+from .mutator import FBNetMutator
+from .utils import RegularizerLoss, accuracy
+class FBNetTrainer(BaseTrainer):
+    def __init__(
+        self,
+        model,
+        model_optim,
+        criterion,
+        device,
+        device_ids,
+        lookup_table,
+        train_loader,
+        valid_loader,
+        n_epochs=120,
+        load_ckpt=False,
+        arch_path=None,
+        logger=None,
+    ):
+        """
+        Parameters
+        ----------
+        model : pytorch model
+            the user model, which has mutables
+        model_optim : pytorch optimizer
+            the user defined optimizer
+        criterion : pytorch loss
+            the main task loss, nn.CrossEntropyLoss() is for classification
+        device : pytorch device
+            the devices to train/search the model
+        device_ids : list of int
+            the indexes of devices used for training
+        lookup_table : class
+            lookup table object for fbnet training
+        train_loader : pytorch data loader
+            data loader for the training set
+        valid_loader : pytorch data loader
+            data loader for the validation set
+        n_epochs : int
+            number of epochs to train/search
+        load_ckpt : bool
+            whether load checkpoint
+        arch_path : str
+            the path to store chosen architecture
+        logger : logger
+            the logger
+        """
+        self.model = model
+        self.model_optim = model_optim
+        self.train_loader = train_loader
+        self.valid_loader = valid_loader
+        self.device = device
+        self.dev_num = len(device_ids)
+        self.n_epochs = n_epochs
+        self.lookup_table = lookup_table
+        self.config = lookup_table.config
+        self.start_epoch = self.config.start_epoch
+        self.temp = self.config.init_temperature
+        self.exp_anneal_rate = self.config.exp_anneal_rate
+        self.mode = self.config.mode
+        self.load_ckpt = load_ckpt
+        self.arch_path = arch_path
+        self.logger = logger
+        # scheduler of learning rate
+        self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+            model_optim, T_max=n_epochs, last_epoch=-1
+        )
+        # init mutator
+        self.mutator = FBNetMutator(model, lookup_table)
+        self.mutator.set_temperature(self.temp)
+        # DataParallel should be put behind the init of mutator
+        self.model = torch.nn.DataParallel(self.model, device_ids=device_ids)
+        self.model.to(device)
+        # build architecture optimizer
+        self.arch_optimizer = torch.optim.AdamW(
+            self.mutator.get_architecture_parameters(),
+            self.config.nas_lr,
+            weight_decay=self.config.nas_weight_decay,
+        )
+        self.reg_loss = RegularizerLoss(config=self.config)
+        self.criterion = criterion
+        self.epoch = 0
+    def _layer_choice_sample(self):
+        """
+        Sample the index of network within layer choice
+        """
+        stages = [stage_name for stage_name in self.lookup_table.layer_num]
+        stage_lnum = [self.lookup_table.layer_num[stage] for stage in stages]
+        # get the choice idx in each layer
+        choice_ids = list()
+        layer_id = 0
+        for param in self.mutator.get_architecture_parameters():
+            param_np = param.cpu().detach().numpy()
+            op_idx = np.argmax(param_np)
+            choice_ids.append(op_idx)
+            self.logger.info(
+                "layer {}: {}, index: {}".format(layer_id, param_np, op_idx)
+            )
+            layer_id += 1
+        # get the arch_sample
+        choice_names = list()
+        layer_id = 0
+        for i, stage_name in enumerate(stages):
+            ops_names = [op for op in self.lookup_table.lut_ops[stage_name]]
+            for j in range(stage_lnum[i]):
+                searched_op = ops_names[choice_ids[layer_id]]
+                choice_names.append(searched_op)
+                layer_id += 1
+        self.logger.info(choice_names)
+        return choice_names
+    def _get_perf_cost(self, requires_grad=True):
+        """
+        Get the accumulated performance cost.
+        """
+        perf_cost = Variable(
+            torch.zeros(1), requires_grad=requires_grad
+        ).to(self.device, non_blocking=True)
+        for latency in self.mutator.get_weighted_latency():
+            perf_cost = perf_cost + latency
+        return perf_cost
+    def _validate(self):
+        """
+        Do validation. During validation, LayerChoices use the mixed-op.
+        Returns
+        -------
+        float, float, float
+            average loss, average top1 accuracy, average top5 accuracy
+        """
+        self.valid_loader.batch_sampler.drop_last = False
+        batch_time = AverageMeter("batch_time")
+        losses = AverageMeter("losses")
+        top1 = AverageMeter("top1")
+        top5 = AverageMeter("top5")
+        # test on validation set under eval mode
+        self.model.eval()
+        end = time.time()
+        with torch.no_grad():
+            for i, (images, labels) in enumerate(self.valid_loader):
+                images = images.to(self.device, non_blocking=True)
+                labels = labels.to(self.device, non_blocking=True)
+                output = self.model(images)
+                loss = self.criterion(output, labels)
+                acc1, acc5 = accuracy(output, labels, topk=(1, 5))
+                losses.update(loss, images.size(0))
+                top1.update(acc1[0], images.size(0))
+                top5.update(acc5[0], images.size(0))
+                # measure elapsed time
+                batch_time.update(time.time() - end)
+                end = time.time()
+                if i % 10 == 0 or i + 1 == len(self.valid_loader):
+                    test_log = (
+                        "Valid" + ": [{0}/{1}]\t"
+                        "Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
+                        "Loss {loss.val:.4f} ({loss.avg:.4f})\t"
+                        "Top-1 acc {top1.val:.3f} ({top1.avg:.3f})\t"
+                        "Top-5 acc {top5.val:.3f} ({top5.avg:.3f})".format(
+                            i,
+                            len(self.valid_loader) - 1,
+                            batch_time=batch_time,
+                            loss=losses,
+                            top1=top1,
+                            top5=top5,
+                        )
+                    )
+                    self.logger.info(test_log)
+        return losses.avg, top1.avg, top5.avg
+    def _train_epoch(self, epoch, optimizer, arch_train=False):
+        """
+        Train one epoch.
+        """
+        batch_time = AverageMeter("batch_time")
+        data_time = AverageMeter("data_time")
+        losses = AverageMeter("losses")
+        top1 = AverageMeter("top1")
+        top5 = AverageMeter("top5")
+        # switch to train mode
+        self.model.train()
+        data_loader = self.valid_loader if arch_train else self.train_loader
+        end = time.time()
+        for i, (images, labels) in enumerate(data_loader):
+            data_time.update(time.time() - end)
+            images = images.to(self.device, non_blocking=True)
+            labels = labels.to(self.device, non_blocking=True)
+            output = self.model(images)
+            loss = self.criterion(output, labels)
+            # hardware-aware loss
+            perf_cost = self._get_perf_cost(requires_grad=True)
+            regu_loss = self.reg_loss(perf_cost)
+            if self.mode.startswith("mul"):
+                loss = loss * regu_loss
+            elif self.mode.startswith("add"):
+                loss = loss + regu_loss
+            # measure accuracy and record loss
+            acc1, acc5 = accuracy(output, labels, topk=(1, 5))
+            losses.update(loss.item(), images.size(0))
+            top1.update(acc1[0].item(), images.size(0))
+            top5.update(acc5[0].item(), images.size(0))
+            # compute gradient and do SGD step
+            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()
+            # measure elapsed time
+            batch_time.update(time.time() - end)
+            end = time.time()
+            if i % 10 == 0:
+                batch_log = (
+                    "Warmup Train [{0}][{1}]\t"
+                    "Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
+                    "Data {data_time.val:.3f} ({data_time.avg:.3f})\t"
+                    "Loss {losses.val:.4f} ({losses.avg:.4f})\t"
+                    "Top-1 acc {top1.val:.3f} ({top1.avg:.3f})\t"
+                    "Top-5 acc {top5.val:.3f} ({top5.avg:.3f})\t".format(
+                        epoch + 1,
+                        i,
+                        batch_time=batch_time,
+                        data_time=data_time,
+                        losses=losses,
+                        top1=top1,
+                        top5=top5,
+                    )
+                )
+                self.logger.info(batch_log)
+    def _warm_up(self):
+        """
+        Warm up the model, while the architecture weights are not trained.
+        """
+        for epoch in range(self.epoch, self.start_epoch):
+            self.logger.info("\n--------Warmup epoch: %d--------\n", epoch + 1)
+            self._train_epoch(epoch, self.model_optim)
+            # adjust learning rate
+            self.scheduler.step()
+            # validation
+            val_loss, val_top1, val_top5 = self._validate()
+            val_log = (
+                "Warmup Valid [{0}/{1}]\t"
+                "loss {2:.3f}\ttop-1 acc {3:.3f}\ttop-5 acc {4:.3f}".format(
+                    epoch + 1, self.warmup_epochs, val_loss, val_top1, val_top5
+                )
+            )
+            self.logger.info(val_log)
+            if epoch % 10 == 0:
+                filename = os.path.join(
+                    self.config.model_dir, "checkpoint_%s.pth" % epoch
+                )
+                self.save_checkpoint(epoch, filename)
+    def _train(self):
+        """
+        Train the model, it trains model weights and architecute weights.
+        Architecture weights are trained according to the schedule.
+        Before updating architecture weights, ```requires_grad``` is enabled.
+        Then, it is disabled after the updating, in order not to update
+        architecture weights when training model weights.
+        """
+        arch_param_num = self.mutator.num_arch_params()
+        self.logger.info("#arch_params: {}".format(arch_param_num))
+        self.epoch = max(self.start_epoch, self.epoch)
+        ckpt_path = self.config.model_dir
+        choice_names = None
+        top1_best = 0.0
+        for epoch in range(self.epoch, self.n_epochs):
+            self.logger.info("\n--------Train epoch: %d--------\n", epoch + 1)
+            # update the weight parameters
+            self._train_epoch(epoch, self.model_optim)
+            # adjust learning rate
+            self.scheduler.step()
+            self.logger.info("Update architecture parameters")
+            # update the architecture parameters
+            self.mutator.arch_requires_grad()
+            self._train_epoch(epoch, self.arch_optimizer, True)
+            self.mutator.arch_disable_grad()
+            # temperature annealing
+            self.temp = self.temp * self.exp_anneal_rate
+            self.mutator.set_temperature(self.temp)
+            # sample the architecture of sub-network
+            choice_names = self._layer_choice_sample()
+            # validate
+            val_loss, val_top1, val_top5 = self._validate()
+            val_log = (
+                "Valid [{0}]\t"
+                "loss {1:.3f}\ttop-1 acc {2:.3f} \ttop-5 acc {3:.3f}".format(
+                    epoch + 1, val_loss, val_top1, val_top5
+                )
+            )
+            self.logger.info(val_log)
+            if epoch % 10 == 0:
+                filename = os.path.join(ckpt_path, "checkpoint_%s.pth" % epoch)
+                self.save_checkpoint(epoch, filename, choice_names)
+            val_top1 = val_top1.cpu().as_numpy()
+            if val_top1 > top1_best:
+                filename = os.path.join(ckpt_path, "checkpoint_best.pth")
+                self.save_checkpoint(epoch, filename, choice_names)
+                top1_best = val_top1
+    def save_checkpoint(self, epoch, filename, choice_names=None):
+        """
+        Save checkpoint of the whole model.
+        Saving model weights and architecture weights as ```filename```,
+        and saving currently chosen architecture in ```arch_path```.
+        """
+        state = {
+            "model": self.model.state_dict(),
+            "optim": self.model_optim.state_dict(),
+            "epoch": epoch,
+            "arch_sample": choice_names,
+        }
+        torch.save(state, filename)
+        self.logger.info("Save checkpoint to {0:}".format(filename))
+        if self.arch_path:
+            self.export(self.arch_path)
+    def load_checkpoint(self, filename):
+        """
+        Load the checkpoint from ```ckpt_path```.
+        """
+        ckpt = torch.load(filename)
+        self.epoch = ckpt["epoch"]
+        self.model.load_state_dict(ckpt["model"])
+        self.model_optim.load_state_dict(ckpt["optim"])
+    def train(self):
+        """
+        Train the whole model.
+        """
+        if self.load_ckpt:
+            ckpt_path = self.config.model_dir
+            filename = os.path.join(ckpt_path, "checkpoint_best.pth")
+            if os.path.exists(filename):
+                self.load_checkpoint(filename)
+        if self.epoch < self.start_epoch:
+            self._warm_up()
+        self._train()
+    def export(self, file_name):
+        """
+        Export the chosen architecture into a file
+        Parameters
+        ----------
+        file_name : str
+            the file that stores exported chosen architecture
+        """
+        exported_arch = self.mutator.sample_final()
+        with open(file_name, "w") as f:
+            json.dump(
+                exported_arch,
+                f,
+                indent=2,
+                sort_keys=True,
+                cls=TorchTensorEncoder,
+            )
+    def validate(self):
+        raise NotImplementedError
+    def checkpoint(self):
+        raise NotImplementedError
--- a/nni/algorithms/nas/pytorch/fbnet/utils.py
+++ b/nni/algorithms/nas/pytorch/fbnet/utils.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+from __future__ import absolute_import, division, print_function
+import gc  # noqa: F401
+import os
+import timeit
+import torch
+import numpy as np
+import torch.nn as nn
+from nni.compression.pytorch.utils.counter import count_flops_params
+LUT_FILE = "lut.npy"
+LUT_PATH = "lut"
+class NASConfig:
+    def __init__(
+        self,
+        perf_metric="flops",
+        lut_load=False,
+        model_dir=None,
+        nas_lr=0.01,
+        nas_weight_decay=5e-4,
+        mode="mul",
+        alpha=0.25,
+        beta=0.6,
+        start_epoch=50,
+        init_temperature=5.0,
+        exp_anneal_rate=np.exp(-0.045),
+        search_space=None,
+    ):
+        # LUT of performance metric
+        # flops means the multiplies, latency means the time cost on platform
+        self.perf_metric = perf_metric
+        assert perf_metric in [
+            "flops",
+            "latency",
+        ], "perf_metric should be ['flops', 'latency']"
+        # wether load or create lut file
+        self.lut_load = lut_load
+        # necessary dirs
+        self.lut_en = model_dir is not None
+        if self.lut_en:
+            self.model_dir = model_dir
+            os.makedirs(model_dir, exist_ok=True)
+            self.lut_path = os.path.join(model_dir, LUT_PATH)
+            os.makedirs(self.lut_path, exist_ok=True)
+        # NAS learning setting
+        self.nas_lr = nas_lr
+        self.nas_weight_decay = nas_weight_decay
+        # hardware-aware loss setting
+        self.mode = mode
+        assert mode in ["mul", "add"], "mode should be ['mul', 'add']"
+        self.alpha = alpha
+        self.beta = beta
+        # NAS training setting
+        self.start_epoch = start_epoch
+        self.init_temperature = init_temperature
+        self.exp_anneal_rate = exp_anneal_rate
+        # definition of search blocks and space
+        self.search_space = search_space
+class RegularizerLoss(nn.Module):
+    """Auxilliary loss for hardware-aware NAS."""
+    def __init__(self, config):
+        """
+        Parameters
+        ----------
+        config : class
+            to manage the configuration for NAS training, and search space etc.
+        """
+        super(RegularizerLoss, self).__init__()
+        self.mode = config.mode
+        self.alpha = config.alpha
+        self.beta = config.beta
+    def forward(self, perf_cost, batch_size=1):
+        """
+        Parameters
+        ----------
+        perf_cost : tensor
+            the accumulated performance cost
+        batch_size : int
+            batch size for normalization
+        Returns
+        -------
+        output: tensor
+            the hardware-aware constraint loss
+        """
+        if self.mode == "mul":
+            log_loss = torch.log(perf_cost / batch_size) ** self.beta
+            return self.alpha * log_loss
+        elif self.mode == "add":
+            linear_loss = (perf_cost / batch_size) ** self.beta
+            return self.alpha * linear_loss
+        else:
+            raise NotImplementedError
+def accuracy(output, target, topk=(1,)):
+    """
+    Computes the precision@k for the specified values of k
+    Parameters
+    ----------
+    output : pytorch tensor
+        output, e.g., predicted value
+    target : pytorch tensor
+        label
+    topk : tuple
+        specify top1 and top5
+    Returns
+    -------
+    list
+        accuracy of top1 and top5
+    """
+    maxk = max(topk)
+    batch_size = target.size(0)
+    _, pred = output.topk(maxk, 1, True, True)
+    pred = pred.t()
+    correct = pred.eq(target.view(1, -1).expand_as(pred))
+    res = []
+    for k in topk:
+        correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
+        res.append(correct_k.mul_(100.0 / batch_size))
+    return res
+def supernet_sample(model, state_dict, sampled_arch=[], lookup_table=None):
+    """
+    Initialize the searched sub-model from supernet.
+    Parameters
+    ----------
+    model : pytorch model
+        the created subnet
+    state_dict : checkpoint
+        the checkpoint of supernet, including the pre-trained params
+    sampled_arch : list of str
+        the searched layer names of the subnet
+    lookup_table : class
+        to manage the candidate ops, layer information and layer performance
+    """
+    replace = list()
+    stages = [stage for stage in lookup_table.layer_num]
+    stage_lnum = [lookup_table.layer_num[stage] for stage in stages]
+    if sampled_arch:
+        layer_id = 0
+        for i, stage in enumerate(stages):
+            ops_names = [op_name for op_name in lookup_table.lut_ops[stage]]
+            for j in range(stage_lnum[i]):
+                searched_op = sampled_arch[layer_id]
+                op_i = ops_names.index(searched_op)
+                replace.append(
+                    [
+                        "blocks.{}.".format(layer_id),
+                        "blocks.{}.op.".format(layer_id),
+                        "blocks.{}.{}.".format(layer_id, op_i),
+                    ]
+                )
+                layer_id += 1
+    model_init(model, state_dict, replace=replace)
+def model_init(model, state_dict, replace=[]):
+    """Initialize the model from state_dict."""
+    prefix = "module."
+    param_dict = dict()
+    for k, v in state_dict.items():
+        if k.startswith(prefix):
+            k = k[7:]
+        param_dict[k] = v
+    for k, (name, m) in enumerate(model.named_modules()):
+        if replace:
+            for layer_replace in replace:
+                assert len(layer_replace) == 3, "The elements should be three."
+                pre_scope, key, replace_key = layer_replace
+                if pre_scope in name:
+                    name = name.replace(key, replace_key)
+        # Copy the state_dict to current model
+        if (name + ".weight" in param_dict) or (
+            name + ".running_mean" in param_dict
+        ):
+            if isinstance(m, nn.BatchNorm2d):
+                shape = m.running_mean.shape
+                if shape == param_dict[name + ".running_mean"].shape:
+                    if m.weight is not None:
+                        m.weight.data = param_dict[name + ".weight"]
+                        m.bias.data = param_dict[name + ".bias"]
+                    m.running_mean = param_dict[name + ".running_mean"]
+                    m.running_var = param_dict[name + ".running_var"]
+            elif isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
+                shape = m.weight.data.shape
+                if shape == param_dict[name + ".weight"].shape:
+                    m.weight.data = param_dict[name + ".weight"]
+                    if m.bias is not None:
+                        m.bias.data = param_dict[name + ".bias"]
+            elif isinstance(m, nn.ConvTranspose2d):
+                m.weight.data = param_dict[name + ".weight"]
+                if m.bias is not None:
+                    m.bias.data = param_dict[name + ".bias"]
+class LookUpTable:
+    """Build look-up table for NAS."""
+    def __init__(self, config, primitives):
+        """
+        Parameters
+        ----------
+        config : class
+            to manage the configuration for NAS training, and search space etc.
+        """
+        self.config = config
+        # definition of search blocks and space
+        self.search_space = config.search_space
+        # layers for NAS
+        self.cnt_layers = len(self.search_space["input_shape"])
+        # constructors for each operation
+        self.lut_ops = {
+            stage_name: {
+                op_name: primitives[op_name]
+                for op_name in self.search_space["stages"][stage_name]["ops"]
+            }
+            for stage_name in self.search_space["stages"]
+        }
+        self.layer_num = {
+            stage_name: self.search_space["stages"][stage_name]["layer_num"]
+            for stage_name in self.search_space["stages"]
+        }
+        # arguments for the ops constructors, input_shapes just for convinience
+        self.layer_configs, self.layer_in_shapes = self._layer_configs()
+        # lookup_table
+        self.perf_metric = config.perf_metric
+        if config.lut_en:
+            self.lut_perf = None
+            self.lut_file = os.path.join(config.lut_path, LUT_FILE)
+            if config.lut_load:
+                self._load_from_file()
+            else:
+                self._create_perfs()
+    def _layer_configs(self):
+        """Generate basic params for different layers."""
+        # layer_configs are : c_in, c_out, stride, fm_size
+        layer_configs = [
+            [
+                self.search_space["input_shape"][layer_id][0],
+                self.search_space["channel_size"][layer_id],
+                self.search_space["strides"][layer_id],
+                self.search_space["fm_size"][layer_id],
+            ]
+            for layer_id in range(self.cnt_layers)
+        ]
+        # layer_in_shapes are (C_in, input_w, input_h)
+        layer_in_shapes = self.search_space["input_shape"]
+        return layer_configs, layer_in_shapes
+    def _create_perfs(self, cnt_of_runs=200):
+        """Create performance cost for each op."""
+        if self.perf_metric == "latency":
+            self.lut_perf = self._calculate_latency(cnt_of_runs)
+        elif self.perf_metric == "flops":
+            self.lut_perf = self._calculate_flops()
+        self._write_lut_to_file()
+    def _calculate_flops(self, eps=0.001):
+        """FLOPs cost."""
+        flops_lut = [{} for i in range(self.cnt_layers)]
+        layer_id = 0
+        for stage_name in self.lut_ops:
+            stage_ops = self.lut_ops[stage_name]
+            ops_num = self.layer_num[stage_name]
+            for _ in range(ops_num):
+                for op_name in stage_ops:
+                    layer_config = self.layer_configs[layer_id]
+                    key_params = {"fm_size": layer_config[3]}
+                    op = stage_ops[op_name](*layer_config[0:3], **key_params)
+                    # measured in Flops
+                    in_shape = self.layer_in_shapes[layer_id]
+                    x = (1, in_shape[0], in_shape[1], in_shape[2])
+                    flops, _, _ = count_flops_params(op, x, verbose=False)
+                    flops = eps if flops == 0.0 else flops
+                    flops_lut[layer_id][op_name] = float(flops)
+                layer_id += 1
+        return flops_lut
+    def _calculate_latency(self, cnt_of_runs):
+        """Latency cost."""
+        LATENCY_BATCH_SIZE = 1
+        latency_lut = [{} for i in range(self.cnt_layers)]
+        layer_id = 0
+        for stage_name in self.lut_ops:
+            stage_ops = self.lut_ops[stage_name]
+            ops_num = self.layer_num[stage_name]
+            for _ in range(ops_num):
+                for op_name in stage_ops:
+                    layer_config = self.layer_configs[layer_id]
+                    key_params = {"fm_size": layer_config[3]}
+                    op = stage_ops[op_name](*layer_config[0:3], **key_params)
+                    input_data = torch.randn(
+                        (LATENCY_BATCH_SIZE, *self.layer_in_shapes[layer_id])
+                    )
+                    globals()["op"], globals()["input_data"] = op, input_data
+                    total_time = timeit.timeit(
+                        "output = op(input_data)",
+                        setup="gc.enable()",
+                        globals=globals(),
+                        number=cnt_of_runs,
+                    )
+                    # measured in micro-second
+                    latency_lut[layer_id][op_name] = (
+                        total_time / cnt_of_runs / LATENCY_BATCH_SIZE * 1e6
+                    )
+                layer_id += 1
+        return latency_lut
+    def _write_lut_to_file(self):
+        """Save lut as numpy file."""
+        np.save(self.lut_file, self.lut_perf)
+    def _load_from_file(self):
+        """Load numpy file."""
+        self.lut_perf = np.load(self.lut_file, allow_pickle=True)