Evaluate current SPOS support (#4322)

5fe29b06 · Jiahang Xu · GitHub · cf7032a5 · 5fe29b06 · 5fe29b06
Unverified Commit 5fe29b06 authored Dec 17, 2021 by Jiahang Xu Committed by GitHub Dec 17, 2021
13 changed files
--- a/docs/en_US/NAS/SPOS.rst
+++ b/docs/en_US/NAS/SPOS.rst
@@ -6,24 +6,23 @@ Introduction
 Proposed in `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ is a one-shot NAS method that addresses the difficulties in training One-Shot NAS models by constructing a simplified supernet trained with an uniform path sampling method, so that all underlying architectures (and their weights) get trained fully and equally. An evolutionary algorithm is then applied to efficiently search for the best-performing architectures without any fine tuning.
-Implementation on NNI is based on `official repo <https://github.com/megvii-model/SinglePathOneShot>`__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase. We have also shown 
+Implementation on NNI is based on `official repo <https://github.com/megvii-model/SinglePathOneShot>`__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase.
 Examples
 --------
-Here is a use case, which is the search space in paper, and the way to use flops limit to perform uniform sampling.
+Here is a use case, which is the search space in paper. However, we applied latency limit instead of flops limit to perform the architecture search phase.
 :githublink:`Example code <examples/nas/oneshot/spos>`
 Requirements
 ^^^^^^^^^^^^
-NVIDIA DALI >= 0.16 is needed as we use DALI to accelerate the data loading of ImageNet. `Installation guide <https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/installation.html>`__
+Prepare ImageNet in the standard format (follow the script `here <https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4>`__\ ). Linking it to ``data/imagenet`` will be more convenient.
-Download the flops lookup table from `here <https://1drv.ms/u/s!Am_mmG2-KsrnajesvSdfsq_cN48?e=aHVppN>`__ (maintained by `Megvii <https://github.com/megvii-model>`__\ ).
+Download the checkpoint file from `here <https://1drv.ms/u/s!Am_mmG2-KsrnajesvSdfsq_cN48?e=aHVppN>`__ (maintained by `Megvii <https://github.com/megvii-model>`__\ ) if you don't want to retrain the supernet.
-Put ``op_flops_dict.pkl`` and ``checkpoint-150000.pth.tar`` (if you don't want to retrain the supernet) under ``data`` directory.
+Put ``checkpoint-150000.pth.tar`` under ``data`` directory.
-Prepare ImageNet in the standard format (follow the script `here <https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4>`__\ ). Linking it to ``data/imagenet`` will be more convenient.
 After preparation, it's expected to have the following code structure:
@@ -32,19 +31,16 @@ After preparation, it's expected to have the following code structure:
   spos
   ├── architecture_final.json
   ├── blocks.py
-   ├── config_search.yml
   ├── data
   │   ├── imagenet
   │   │   ├── train
   │   │   └── val
-   │   └── op_flops_dict.pkl
+   │   └── checkpoint-150000.pth.tar
-   ├── dataloader.py
   ├── network.py
   ├── readme.md
-   ├── scratch.py
   ├── supernet.py
-   ├── tester.py
+   ├── evaluation.py
-   ├── tuner.py
+   ├── search.py
   └── utils.py
 Step 1. Train Supernet
@@ -61,30 +57,18 @@ NOTE: The data loading used in the official repo is `slightly different from usu
 Step 2. Evolution Search
 ^^^^^^^^^^^^^^^^^^^^^^^^
-Single Path One-Shot leverages evolution algorithm to search for the best architecture. The tester, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
+Single Path One-Shot leverages evolution algorithm to search for the best architecture. In the paper, the search module, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
-In order to make the tuner aware of the flops limit and have the ability to calculate the flops, we created a new tuner called ``EvolutionWithFlops`` in ``tuner.py``\ , inheriting the tuner in SDK.
-To have a search space ready for NNI framework, first run
-.. code-block:: bash
-   nnictl ss_gen -t "python tester.py"
-This will generate a file called ``nni_auto_gen_search_space.json``\ , which is a serialized representation of your search space.
-By default, it will use ``checkpoint-150000.pth.tar`` downloaded previously. In case you want to use the checkpoint trained by yourself from the last step, specify ``--checkpoint`` in the command in ``config_search.yml``.
-Then search with evolution tuner.
+In this example, we have an incomplete implementation of the evolution search. The example only support training from scratch. Inheriting weights from pretrained supernet is not supported yet. To search with the regularized evolution strategy, run
 .. code-block:: bash
-   nnictl create --config config_search.yml
+   python search.py
-The final architecture exported from every epoch of evolution can be found in ``checkpoints`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/log``.
+The final architecture exported from every epoch of evolution can be found in ``trials`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/trials``.
-Step 3. Train from Scratch
+Step 3. Train for Evaluation
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. code-block:: bash
@@ -106,7 +90,7 @@ Known Limitations
 * Block search only. Channel search is not supported yet.
-* Only GPU version is provided here.
+* In the search phase, training from the scratch is required. Inheriting weights from supernet is not supported yet.
 Current Reproduction Results
 ----------------------------

--- a/examples/nas/oneshot/spos/README_zh_CN.md
+++ b/examples/nas/oneshot/spos/README_zh_CN.md
--- a/examples/nas/oneshot/spos/architecture_final.json
+++ b/examples/nas/oneshot/spos/architecture_final.json
 {
-  "LayerChoice1": [false, false, true, false],
+    "LayerChoice1": "2", 
-  "LayerChoice2": [false, true, false, false],
+    "LayerChoice2": "1", 
-  "LayerChoice3": [true, false, false, false],
+    "LayerChoice3": "0", 
-  "LayerChoice4": [false, true, false, false],
+    "LayerChoice4": "1", 
-  "LayerChoice5": [false, false, true, false],
+    "LayerChoice5": "2", 
-  "LayerChoice6": [true, false, false, false],
+    "LayerChoice6": "0", 
-  "LayerChoice7": [false, false, true, false],
+    "LayerChoice7": "2", 
-  "LayerChoice8": [true, false, false, false],
+    "LayerChoice8": "0", 
-  "LayerChoice9": [false, false, true, false],
+    "LayerChoice9": "2", 
-  "LayerChoice10": [true, false, false, false],
+    "LayerChoice10": "0", 
-  "LayerChoice11": [false, false, true, false],
+    "LayerChoice11": "2", 
-  "LayerChoice12": [false, false, false, true],
+    "LayerChoice12": "3", 
-  "LayerChoice13": [true, false, false, false],
+    "LayerChoice13": "0", 
-  "LayerChoice14": [true, false, false, false],
+    "LayerChoice14": "0", 
-  "LayerChoice15": [true, false, false, false],
+    "LayerChoice15": "0", 
-  "LayerChoice16": [true, false, false, false],
+    "LayerChoice16": "0", 
-  "LayerChoice17": [false, false, false, true],
+    "LayerChoice17": "3", 
-  "LayerChoice18": [false, false, true, false],
+    "LayerChoice18": "2", 
-  "LayerChoice19": [false, false, false, true],
+    "LayerChoice19": "3", 
-  "LayerChoice20": [false, false, false, true]
+    "LayerChoice20": "3"
-}
+  }
\ No newline at end of file
--- a/examples/nas/oneshot/spos/config_search.yml
+++ b/examples/nas/oneshot/spos/config_search.yml
-authorName: unknown
-experimentName: SPOS Search
-trialConcurrency: 4
-maxExecDuration: 7d
-maxTrialNum: 99999
-trainingServicePlatform: local
-searchSpacePath: nni_auto_gen_search_space.json
-useAnnotation: false
-tuner:
-  codeDir: .
-  classFileName: tuner.py
-  className: EvolutionWithFlops
-trial:
-  command: python tester.py --spos-prep
-  codeDir: .
-  gpuNum: 1
--- a/examples/nas/oneshot/spos/dataloader.py
+++ b/examples/nas/oneshot/spos/dataloader.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-import os
-import nvidia.dali.ops as ops
-import nvidia.dali.types as types
-import torch.utils.data
-from nvidia.dali.pipeline import Pipeline
-from nvidia.dali.plugin.pytorch import DALIClassificationIterator
-class HybridTrainPipe(Pipeline):
-    def __init__(self, batch_size, num_threads, device_id, data_dir, crop, seed=12, local_rank=0, world_size=1,
-                 spos_pre=False):
-        super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=seed + device_id)
-        color_space_type = types.BGR if spos_pre else types.RGB
-        self.input = ops.FileReader(file_root=data_dir, shard_id=local_rank, num_shards=world_size, random_shuffle=True)
-        self.decode = ops.ImageDecoder(device="mixed", output_type=color_space_type)
-        self.res = ops.RandomResizedCrop(device="gpu", size=crop,
-                                         interp_type=types.INTERP_LINEAR if spos_pre else types.INTERP_TRIANGULAR)
-        self.twist = ops.ColorTwist(device="gpu")
-        self.jitter_rng = ops.Uniform(range=[0.6, 1.4])
-        self.cmnp = ops.CropMirrorNormalize(device="gpu",
-                                            output_dtype=types.FLOAT,
-                                            output_layout=types.NCHW,
-                                            image_type=color_space_type,
-                                            mean=0. if spos_pre else [0.485 * 255, 0.456 * 255, 0.406 * 255],
-                                            std=1. if spos_pre else [0.229 * 255, 0.224 * 255, 0.225 * 255])
-        self.coin = ops.CoinFlip(probability=0.5)
-    def define_graph(self):
-        rng = self.coin()
-        self.jpegs, self.labels = self.input(name="Reader")
-        images = self.decode(self.jpegs)
-        images = self.res(images)
-        images = self.twist(images, saturation=self.jitter_rng(),
-                            contrast=self.jitter_rng(), brightness=self.jitter_rng())
-        output = self.cmnp(images, mirror=rng)
-        return [output, self.labels]
-class HybridValPipe(Pipeline):
-    def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size, seed=12, local_rank=0, world_size=1,
-                 spos_pre=False, shuffle=False):
-        super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed=seed + device_id)
-        color_space_type = types.BGR if spos_pre else types.RGB
-        self.input = ops.FileReader(file_root=data_dir, shard_id=local_rank, num_shards=world_size,
-                                    random_shuffle=shuffle)
-        self.decode = ops.ImageDecoder(device="mixed", output_type=color_space_type)
-        self.res = ops.Resize(device="gpu", resize_shorter=size,
-                              interp_type=types.INTERP_LINEAR if spos_pre else types.INTERP_TRIANGULAR)
-        self.cmnp = ops.CropMirrorNormalize(device="gpu",
-                                            output_dtype=types.FLOAT,
-                                            output_layout=types.NCHW,
-                                            crop=(crop, crop),
-                                            image_type=color_space_type,
-                                            mean=0. if spos_pre else [0.485 * 255, 0.456 * 255, 0.406 * 255],
-                                            std=1. if spos_pre else [0.229 * 255, 0.224 * 255, 0.225 * 255])
-    def define_graph(self):
-        self.jpegs, self.labels = self.input(name="Reader")
-        images = self.decode(self.jpegs)
-        images = self.res(images)
-        output = self.cmnp(images)
-        return [output, self.labels]
-class ClassificationWrapper:
-    def __init__(self, loader, size):
-        self.loader = loader
-        self.size = size
-    def __iter__(self):
-        return self
-    def __next__(self):
-        data = next(self.loader)
-        return data[0]["data"], data[0]["label"].view(-1).long().cuda(non_blocking=True)
-    def __len__(self):
-        return self.size
-def get_imagenet_iter_dali(split, image_dir, batch_size, num_threads, crop=224, val_size=256,
-                           spos_preprocessing=False, seed=12, shuffle=False, device_id=None):
-    world_size, local_rank = 1, 0
-    if device_id is None:
-        device_id = torch.cuda.device_count() - 1  # use last gpu
-    if split == "train":
-        pipeline = HybridTrainPipe(batch_size=batch_size, num_threads=num_threads, device_id=device_id,
-                                   data_dir=os.path.join(image_dir, "train"), seed=seed,
-                                   crop=crop, world_size=world_size, local_rank=local_rank,
-                                   spos_pre=spos_preprocessing)
-    elif split == "val":
-        pipeline = HybridValPipe(batch_size=batch_size, num_threads=num_threads, device_id=device_id,
-                                 data_dir=os.path.join(image_dir, "val"), seed=seed,
-                                 crop=crop, size=val_size, world_size=world_size, local_rank=local_rank,
-                                 spos_pre=spos_preprocessing, shuffle=shuffle)
-    else:
-        raise AssertionError
-    pipeline.build()
-    num_samples = pipeline.epoch_size("Reader")
-    return ClassificationWrapper(
-        DALIClassificationIterator(pipeline, size=num_samples, fill_last_batch=split == "train",
-                                   auto_reset=True), (num_samples + batch_size - 1) // batch_size)
--- a/examples/nas/oneshot/spos/scratch.py
+++ b/examples/nas/oneshot/spos/scratch.py
@@ -9,13 +9,14 @@ import random
 import numpy as np
 import torch
 import torch.nn as nn
-from dataloader import get_imagenet_iter_dali
+import torchvision.transforms as transforms
-from nni.nas.pytorch.fixed import apply_fixed_architecture
+import torchvision.datasets as datasets
-from nni.nas.pytorch.utils import AverageMeterGroup
+from nni.retiarii import fixed_arch
+from nni.retiarii.oneshot.pytorch.utils import AverageMeterGroup
 from torch.utils.tensorboard import SummaryWriter
 from network import ShuffleNetV2OneShot
-from utils import CrossEntropyLabelSmooth, accuracy
+from utils import CrossEntropyLabelSmooth, accuracy, ToBGRTensor
 logger = logging.getLogger("nni.spos.scratch")
@@ -26,6 +27,7 @@ def train(epoch, model, criterion, optimizer, loader, writer, args):
    cur_lr = optimizer.param_groups[0]["lr"]
    for step, (x, y) in enumerate(loader):
+        x, y = x.to('cuda'), y.to('cuda')
        cur_step = len(loader) * epoch + step
        optimizer.zero_grad()
        logits = model(x)
@@ -54,6 +56,7 @@ def validate(epoch, model, criterion, loader, writer, args):
    meters = AverageMeterGroup()
    with torch.no_grad():
        for step, (x, y) in enumerate(loader):
+            x, y = x.to('cuda'), y.to('cuda')
            logits = model(x)
            loss = criterion(logits, y)
            metrics = accuracy(logits, y)
@@ -109,9 +112,9 @@ if __name__ == "__main__":
    random.seed(args.seed)
    torch.backends.cudnn.deterministic = True
+    with fixed_arch(args.architecture):
        model = ShuffleNetV2OneShot(affine=True)
    model.cuda()
-    apply_fixed_architecture(model, args.architecture)
    if torch.cuda.device_count() > 1:  # exclude last gpu, saving for data preprocessing on gpu
        model = nn.DataParallel(model, device_ids=list(range(0, torch.cuda.device_count() - 1)))
    criterion = CrossEntropyLabelSmooth(1000, args.label_smoothing)
@@ -128,14 +131,25 @@ if __name__ == "__main__":
        raise ValueError("'%s' not supported." % args.lr_decay)
    writer = SummaryWriter(log_dir=args.tb_dir)
-    train_loader = get_imagenet_iter_dali("train", args.imagenet_dir, args.batch_size, args.workers,
+    if args.spos_preprocessing:
-                                          spos_preprocessing=args.spos_preprocessing)
+        trans = transforms.Compose([
-    val_loader = get_imagenet_iter_dali("val", args.imagenet_dir, args.batch_size, args.workers,
+            transforms.RandomResizedCrop(224),
-                                        spos_preprocessing=args.spos_preprocessing)
+            transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
+            transforms.RandomHorizontalFlip(0.5),
+            ToBGRTensor(),
+        ])
+    else:
+        trans = transforms.Compose([
+            transforms.RandomResizedCrop(224),
+            transforms.ToTensor()
+        ])
+    train_dataset = datasets.ImageNet(args.imagenet_dir, split='train', transform=trans)
+    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, num_workers=args.workers)
+    val_dataset = datasets.ImageNet(args.imagenet_dir, split='val', transform=trans)
+    valid_loader = torch.utils.data.DataLoader(val_dataset, batch_size=args.batch_size, num_workers=args.workers)                      
    for epoch in range(args.epochs):
        train(epoch, model, criterion, optimizer, train_loader, writer, args)
-        validate(epoch, model, criterion, val_loader, writer, args)
+        validate(epoch, model, criterion, valid_loader, writer, args)
        scheduler.step()
        dump_checkpoint(model, epoch, "scratch_checkpoints")

--- a/examples/nas/oneshot/spos/network.py
+++ b/examples/nas/oneshot/spos/network.py
@@ -6,8 +6,8 @@ import pickle
 import re
 import torch
-import torch.nn as nn
+import nni.retiarii.nn.pytorch as nn
-from nni.nas.pytorch import mutables
+from nni.retiarii.nn.pytorch import LayerChoice
 from blocks import ShuffleNetBlock, ShuffleXceptionBlock
@@ -20,23 +20,20 @@ class ShuffleNetV2OneShot(nn.Module):
        'xception_3x3',
    ]
-    def __init__(self, input_size=224, first_conv_channels=16, last_conv_channels=1024, n_classes=1000,
+    def __init__(self, input_size=224, first_conv_channels=16, last_conv_channels=1024,
-                 op_flops_path="./data/op_flops_dict.pkl", affine=False):
+                 n_classes=1000, affine=False):
        super().__init__()
        assert input_size % 32 == 0
-        with open(os.path.join(os.path.dirname(__file__), op_flops_path), "rb") as fp:
-            self._op_flops_dict = pickle.load(fp)
        self.stage_blocks = [4, 4, 8, 4]
        self.stage_channels = [64, 160, 320, 640]
-        self._parsed_flops = dict()
        self._input_size = input_size
        self._feature_map_size = input_size
        self._first_conv_channels = first_conv_channels
        self._last_conv_channels = last_conv_channels
        self._n_classes = n_classes
        self._affine = affine
+        self._layerchoice_count = 0
        # building first layer
        self.first_conv = nn.Sequential(
@@ -75,19 +72,15 @@ class ShuffleNetV2OneShot(nn.Module):
            base_mid_channels = channels // 2
            mid_channels = int(base_mid_channels)  # prepare for scale
-            choice_block = mutables.LayerChoice([
+            self._layerchoice_count += 1
+            choice_block = LayerChoice([
                ShuffleNetBlock(inp, oup, mid_channels=mid_channels, ksize=3, stride=stride, affine=self._affine),
                ShuffleNetBlock(inp, oup, mid_channels=mid_channels, ksize=5, stride=stride, affine=self._affine),
                ShuffleNetBlock(inp, oup, mid_channels=mid_channels, ksize=7, stride=stride, affine=self._affine),
                ShuffleXceptionBlock(inp, oup, mid_channels=mid_channels, stride=stride, affine=self._affine)
-            ])
+            ], label="LayerChoice" + str(self._layerchoice_count))
            result.append(choice_block)
-            # find the corresponding flops
-            flop_key = (inp, oup, mid_channels, self._feature_map_size, self._feature_map_size, stride)
-            self._parsed_flops[choice_block.key] = [
-                self._op_flops_dict["{}_stride_{}".format(k, stride)][flop_key] for k in self.block_keys
-            ]
            if stride == 2:
                self._feature_map_size //= 2
        return result
@@ -104,46 +97,30 @@ class ShuffleNetV2OneShot(nn.Module):
        x = self.classifier(x)
        return x
-    def get_candidate_flops(self, candidate):
-        conv1_flops = self._op_flops_dict["conv1"][(3, self._first_conv_channels,
-                                                    self._input_size, self._input_size, 2)]
-        # Should use `last_conv_channels` here, but megvii insists that it's `n_classes`. Keeping it.
-        # https://github.com/megvii-model/SinglePathOneShot/blob/36eed6cf083497ffa9cfe7b8da25bb0b6ba5a452/src/Supernet/flops.py#L313
-        rest_flops = self._op_flops_dict["rest_operation"][(self.stage_channels[-1], self._n_classes,
-                                                            self._feature_map_size, self._feature_map_size, 1)]
-        total_flops = conv1_flops + rest_flops
-        for k, m in candidate.items():
-            parsed_flops_dict = self._parsed_flops[k]
-            if isinstance(m, dict):  # to be compatible with classical nas format
-                total_flops += parsed_flops_dict[m["_idx"]]
-            else:
-                total_flops += parsed_flops_dict[torch.max(m, 0)[1]]
-        return total_flops
    def _initialize_weights(self):
        for name, m in self.named_modules():
            if isinstance(m, nn.Conv2d):
                if 'first' in name:
-                    nn.init.normal_(m.weight, 0, 0.01)
+                    torch.nn.init.normal_(m.weight, 0, 0.01)
                else:
-                    nn.init.normal_(m.weight, 0, 1.0 / m.weight.shape[1])
+                    torch.nn.init.normal_(m.weight, 0, 1.0 / m.weight.shape[1])
                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
+                    torch.nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                if m.weight is not None:
-                    nn.init.constant_(m.weight, 1)
+                    torch.nn.init.constant_(m.weight, 1)
                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0.0001)
+                    torch.nn.init.constant_(m.bias, 0.0001)
-                nn.init.constant_(m.running_mean, 0)
+                torch.nn.init.constant_(m.running_mean, 0)
            elif isinstance(m, nn.BatchNorm1d):
-                nn.init.constant_(m.weight, 1)
+                torch.nn.init.constant_(m.weight, 1)
                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0.0001)
+                    torch.nn.init.constant_(m.bias, 0.0001)
-                nn.init.constant_(m.running_mean, 0)
+                torch.nn.init.constant_(m.running_mean, 0)
            elif isinstance(m, nn.Linear):
-                nn.init.normal_(m.weight, 0, 0.01)
+                torch.nn.init.normal_(m.weight, 0, 0.01)
                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
+                    torch.nn.init.constant_(m.bias, 0)
 def load_and_parse_state_dict(filepath="./data/checkpoint-150000.pth.tar"):

--- a/examples/nas/oneshot/spos/multi_trial.py
+++ b/examples/nas/oneshot/spos/multi_trial.py
 # This file is to demo the usage of multi-trial NAS in the usage of SPOS search space.
 import click
+import json
 import nni.retiarii.evaluator.pytorch as pl
-import nni.retiarii.nn.pytorch as nn
 import nni.retiarii.strategy as strategy
-import torch
 from nni.retiarii import serialize
-from nni.retiarii.nn.pytorch import LayerChoice
 from nni.retiarii.experiment.pytorch import RetiariiExeConfig, RetiariiExperiment
 from torchvision import transforms
 from torchvision.datasets import CIFAR10
-from blocks import ShuffleNetBlock, ShuffleXceptionBlock
 from nn_meter import load_latency_predictor
+from network import ShuffleNetV2OneShot
-class ShuffleNetV2(nn.Module):
+from utils import get_archchoice_by_model
-    block_keys = [
-        'shufflenet_3x3',
-        'shufflenet_5x5',
-        'shufflenet_7x7',
-        'xception_3x3',
-    ]
-    def __init__(self, input_size=224, first_conv_channels=16, last_conv_channels=1024, n_classes=1000, affine=False):
-        super().__init__()
-        assert input_size % 32 == 0
-        self.stage_blocks = [4, 4, 8, 4]
-        self.stage_channels = [64, 160, 320, 640]
-        self._parsed_flops = dict()
-        self._input_size = input_size
-        self._feature_map_size = input_size
-        self._first_conv_channels = first_conv_channels
-        self._last_conv_channels = last_conv_channels
-        self._n_classes = n_classes
-        self._affine = affine
-        # building first layer
-        self.first_conv = nn.Sequential(
-            nn.Conv2d(3, first_conv_channels, 3, 2, 1, bias=False),
-            nn.BatchNorm2d(first_conv_channels, affine=affine),
-            nn.ReLU(inplace=True),
-        )
-        self._feature_map_size //= 2
-        p_channels = first_conv_channels
-        features = []
-        for num_blocks, channels in zip(self.stage_blocks, self.stage_channels):
-            features.extend(self._make_blocks(num_blocks, p_channels, channels))
-            p_channels = channels
-        self.features = nn.Sequential(*features)
-        self.conv_last = nn.Sequential(
-            nn.Conv2d(p_channels, last_conv_channels, 1, 1, 0, bias=False),
-            nn.BatchNorm2d(last_conv_channels, affine=affine),
-            nn.ReLU(inplace=True),
-        )
-        self.globalpool = nn.AvgPool2d(self._feature_map_size)
-        self.dropout = nn.Dropout(0.1)
-        self.classifier = nn.Sequential(
-            nn.Linear(last_conv_channels, n_classes, bias=False),
-        )
-        self._initialize_weights()
-    def _make_blocks(self, blocks, in_channels, channels):
-        result = []
-        for i in range(blocks):
-            stride = 2 if i == 0 else 1
-            inp = in_channels if i == 0 else channels
-            oup = channels
-            base_mid_channels = channels // 2
-            mid_channels = int(base_mid_channels)  # prepare for scale
-            choice_block = LayerChoice([
-                ShuffleNetBlock(inp, oup, mid_channels=mid_channels, ksize=3, stride=stride, affine=self._affine),
-                ShuffleNetBlock(inp, oup, mid_channels=mid_channels, ksize=5, stride=stride, affine=self._affine),
-                ShuffleNetBlock(inp, oup, mid_channels=mid_channels, ksize=7, stride=stride, affine=self._affine),
-                ShuffleXceptionBlock(inp, oup, mid_channels=mid_channels, stride=stride, affine=self._affine)
-            ])
-            result.append(choice_block)
-            if stride == 2:
-                self._feature_map_size //= 2
-        return result
-    def forward(self, x):
-        bs = x.size(0)
-        x = self.first_conv(x)
-        x = self.features(x)
-        x = self.conv_last(x)
-        x = self.globalpool(x)
-        x = self.dropout(x)
-        x = x.contiguous().view(bs, -1)
-        x = self.classifier(x)
-        return x
-    def _initialize_weights(self):
-        # FIXME this won't work in base engine
-        for name, m in self.named_modules():
-            if isinstance(m, nn.Conv2d):
-                if 'first' in name:
-                    torch.nn.init.normal_(m.weight, 0, 0.01)
-                else:
-                    torch.nn.init.normal_(m.weight, 0, 1.0 / m.weight.shape[1])
-                if m.bias is not None:
-                    torch.nn.init.constant_(m.bias, 0)
-            elif isinstance(m, nn.BatchNorm2d):
-                if m.weight is not None:
-                    torch.nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    torch.nn.init.constant_(m.bias, 0.0001)
-                torch.nn.init.constant_(m.running_mean, 0)
-            elif isinstance(m, nn.BatchNorm1d):
-                torch.nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    torch.nn.init.constant_(m.bias, 0.0001)
-                torch.nn.init.constant_(m.running_mean, 0)
-            elif isinstance(m, nn.Linear):
-                torch.nn.init.normal_(m.weight, 0, 0.01)
-                if m.bias is not None:
-                    torch.nn.init.constant_(m.bias, 0)
 class LatencyFilter:
    def __init__(self, threshold, predictor, predictor_version=None, reverse=False):
        """
-        Filter the models according to predcted latency.
+        Filter the models according to predicted latency.
        Parameters
        ----------
@@ -140,7 +27,7 @@ class LatencyFilter:
            determine the targeted device
        reverse: `bool`
            if reverse is `False`, then the model returns `True` when `latency < threshold`,
-            else otherwisse
+            else otherwise
        """
        self.predictors = load_latency_predictor(predictor, predictor_version)
        self.threshold = threshold
@@ -153,7 +40,7 @@ class LatencyFilter:
 @click.command()
 @click.option('--port', default=8081, help='On which port the experiment is run.')
 def _main(port):
-    base_model = ShuffleNetV2(32)
+    base_model = ShuffleNetV2OneShot(32)
    base_predictor = 'cortexA76cpu_tflite21'
    transf = [
        transforms.RandomCrop(32, padding=4),
@@ -170,13 +57,12 @@ def _main(port):
                                val_dataloaders=pl.DataLoader(test_dataset, batch_size=64),
                                max_epochs=2, gpus=1)
-    simple_strategy = strategy.Random(model_filter=LatencyFilter(threshold=100, predictor=base_predictor))
+    simple_strategy = strategy.RegularizedEvolution(model_filter=LatencyFilter(threshold=100, predictor=base_predictor), population_size=2, cycles=2)
    exp = RetiariiExperiment(base_model, trainer, strategy=simple_strategy)
    exp_config = RetiariiExeConfig('local')
    exp_config.trial_concurrency = 2
-    exp_config.max_trial_number = 2
+    # exp_config.max_trial_number = 2
    exp_config.trial_gpu_number = 1
    exp_config.training_service.use_active_gpu = False
    exp_config.execution_engine = 'base'
@@ -185,8 +71,10 @@ def _main(port):
    exp.run(exp_config, port)
    print('Exported models:')
-    for model in exp.export_top_models(formatter='dict'):
+    for i, model in enumerate(exp.export_top_models(formatter='dict')):
        print(model)
+        with open(f'architecture_final_{i}.json', 'w') as f: 
+            json.dump(get_archchoice_by_model(model), f, indent=4)
 if __name__ == '__main__':

--- a/examples/nas/oneshot/spos/supernet.py
+++ b/examples/nas/oneshot/spos/supernet.py
@@ -8,13 +8,12 @@ import random
 import numpy as np
 import torch
 import torch.nn as nn
-from nni.nas.pytorch.callbacks import LRSchedulerCallback
+import torchvision.transforms as transforms
-from nni.nas.pytorch.callbacks import ModelCheckpoint
+import torchvision.datasets as datasets
-from nni.algorithms.nas.pytorch.spos import SPOSSupernetTrainingMutator, SPOSSupernetTrainer
+from nni.retiarii.oneshot.pytorch import SinglePathTrainer
-from dataloader import get_imagenet_iter_dali
 from network import ShuffleNetV2OneShot, load_and_parse_state_dict
-from utils import CrossEntropyLabelSmooth, accuracy
+from utils import CrossEntropyLabelSmooth, accuracy, ToBGRTensor
 logger = logging.getLogger("nni.spos.supernet")
@@ -45,16 +44,17 @@ if __name__ == "__main__":
    torch.backends.cudnn.deterministic = True
    model = ShuffleNetV2OneShot()
-    flops_func = model.get_candidate_flops
    if args.load_checkpoint:
        if not args.spos_preprocessing:
            logger.warning("You might want to use SPOS preprocessing if you are loading their checkpoints.")
-        model.load_state_dict(load_and_parse_state_dict())
+        # load state_dict and 
+        model_dict = model.state_dict()
+        model_dict.update(load_and_parse_state_dict())
+        model.load_state_dict(model_dict)
+        logger.info(f'Model loaded from ./data/checkpoint-150000.pth.tar')
    model.cuda()
    if torch.cuda.device_count() > 1:  # exclude last gpu, saving for data preprocessing on gpu
        model = nn.DataParallel(model, device_ids=list(range(0, torch.cuda.device_count() - 1)))
-    mutator = SPOSSupernetTrainingMutator(model, flops_func=flops_func,
-                                          flops_lb=290E6, flops_ub=360E6)
    criterion = CrossEntropyLabelSmooth(1000, args.label_smoothing)
    optimizer = torch.optim.SGD(model.parameters(), lr=args.learning_rate,
                                momentum=args.momentum, weight_decay=args.weight_decay)
@@ -62,14 +62,22 @@ if __name__ == "__main__":
                                                  lambda step: (1.0 - step / args.epochs)
                                                  if step <= args.epochs else 0,
                                                  last_epoch=-1)
-    train_loader = get_imagenet_iter_dali("train", args.imagenet_dir, args.batch_size, args.workers,
+    if args.spos_preprocessing:
-                                          spos_preprocessing=args.spos_preprocessing)
+        trans = transforms.Compose([
-    valid_loader = get_imagenet_iter_dali("val", args.imagenet_dir, args.batch_size, args.workers,
+            transforms.RandomResizedCrop(224),
-                                          spos_preprocessing=args.spos_preprocessing)
+            transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
-    trainer = SPOSSupernetTrainer(model, criterion, accuracy, optimizer,
+            transforms.RandomHorizontalFlip(0.5),
-                                  args.epochs, train_loader, valid_loader,
+            ToBGRTensor(),
-                                  mutator=mutator, batch_size=args.batch_size,
+        ])
-                                  log_frequency=args.log_frequency, workers=args.workers,
+    else:
-                                  callbacks=[LRSchedulerCallback(scheduler),
+        trans = transforms.Compose([
-                                             ModelCheckpoint("./checkpoints")])
+            transforms.RandomResizedCrop(224),
-    trainer.train()
+            transforms.ToTensor()
+        ])
+    train_dataset = datasets.ImageNet(args.imagenet_dir, split='train', transform=trans)
+    val_dataset = datasets.ImageNet(args.imagenet_dir, split='val', transform=trans)
+    trainer = SinglePathTrainer(model, criterion, accuracy, optimizer,
+                                args.epochs, train_dataset, val_dataset,
+                                batch_size=args.batch_size,
+                                log_frequency=args.log_frequency, workers=args.workers)
+    trainer.fit()
--- a/examples/nas/oneshot/spos/tester.py
+++ b/examples/nas/oneshot/spos/tester.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-import argparse
-import logging
-import random
-import time
-from itertools import cycle
-import nni
-import numpy as np
-import torch
-import torch.nn as nn
-from nni.algorithms.nas.pytorch.classic_nas import get_and_apply_next_architecture
-from nni.nas.pytorch.utils import AverageMeterGroup
-from dataloader import get_imagenet_iter_dali
-from network import ShuffleNetV2OneShot, load_and_parse_state_dict
-from utils import CrossEntropyLabelSmooth, accuracy
-logger = logging.getLogger("nni.spos.tester")
-def retrain_bn(model, criterion, max_iters, log_freq, loader):
-    with torch.no_grad():
-        logger.info("Clear BN statistics...")
-        for m in model.modules():
-            if isinstance(m, nn.BatchNorm2d):
-                m.running_mean = torch.zeros_like(m.running_mean)
-                m.running_var = torch.ones_like(m.running_var)
-        logger.info("Train BN with training set (BN sanitize)...")
-        model.train()
-        meters = AverageMeterGroup()
-        for step in range(max_iters):
-            inputs, targets = next(loader)
-            logits = model(inputs)
-            loss = criterion(logits, targets)
-            metrics = accuracy(logits, targets)
-            metrics["loss"] = loss.item()
-            meters.update(metrics)
-            if step % log_freq == 0 or step + 1 == max_iters:
-                logger.info("Train Step [%d/%d] %s", step + 1, max_iters, meters)
-def test_acc(model, criterion, log_freq, loader):
-    logger.info("Start testing...")
-    model.eval()
-    meters = AverageMeterGroup()
-    start_time = time.time()
-    with torch.no_grad():
-        for step, (inputs, targets) in enumerate(loader):
-            logits = model(inputs)
-            loss = criterion(logits, targets)
-            metrics = accuracy(logits, targets)
-            metrics["loss"] = loss.item()
-            meters.update(metrics)
-            if step % log_freq == 0 or step + 1 == len(loader):
-                logger.info("Valid Step [%d/%d] time %.3fs acc1 %.4f acc5 %.4f loss %.4f",
-                            step + 1, len(loader), time.time() - start_time,
-                            meters.acc1.avg, meters.acc5.avg, meters.loss.avg)
-    return meters.acc1.avg
-def evaluate_acc(model, criterion, args, loader_train, loader_test):
-    acc_before = test_acc(model, criterion, args.log_frequency, loader_test)
-    nni.report_intermediate_result(acc_before)
-    retrain_bn(model, criterion, args.train_iters, args.log_frequency, loader_train)
-    acc = test_acc(model, criterion, args.log_frequency, loader_test)
-    assert isinstance(acc, float)
-    nni.report_intermediate_result(acc)
-    nni.report_final_result(acc)
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser("SPOS Candidate Tester")
-    parser.add_argument("--imagenet-dir", type=str, default="./data/imagenet")
-    parser.add_argument("--checkpoint", type=str, default="./data/checkpoint-150000.pth.tar")
-    parser.add_argument("--spos-preprocessing", action="store_true", default=False,
-                        help="When true, image values will range from 0 to 255 and use BGR "
-                             "(as in original repo).")
-    parser.add_argument("--seed", type=int, default=42)
-    parser.add_argument("--workers", type=int, default=6)
-    parser.add_argument("--train-batch-size", type=int, default=128)
-    parser.add_argument("--train-iters", type=int, default=200)
-    parser.add_argument("--test-batch-size", type=int, default=512)
-    parser.add_argument("--log-frequency", type=int, default=10)
-    args = parser.parse_args()
-    # use a fixed set of image will improve the performance
-    torch.manual_seed(args.seed)
-    torch.cuda.manual_seed_all(args.seed)
-    np.random.seed(args.seed)
-    random.seed(args.seed)
-    torch.backends.cudnn.deterministic = True
-    assert torch.cuda.is_available()
-    model = ShuffleNetV2OneShot()
-    criterion = CrossEntropyLabelSmooth(1000, 0.1)
-    get_and_apply_next_architecture(model)
-    model.load_state_dict(load_and_parse_state_dict(filepath=args.checkpoint))
-    model.cuda()
-    train_loader = get_imagenet_iter_dali("train", args.imagenet_dir, args.train_batch_size, args.workers,
-                                          spos_preprocessing=args.spos_preprocessing,
-                                          seed=args.seed, device_id=0)
-    val_loader = get_imagenet_iter_dali("val", args.imagenet_dir, args.test_batch_size, args.workers,
-                                        spos_preprocessing=args.spos_preprocessing, shuffle=True,
-                                        seed=args.seed, device_id=0)
-    train_loader = cycle(train_loader)
-    evaluate_acc(model, criterion, args, train_loader, val_loader)
--- a/examples/nas/oneshot/spos/tuner.py
+++ b/examples/nas/oneshot/spos/tuner.py
-# Copyright (c) Microsoft Corporation.
-# Licensed under the MIT license.
-from nni.nas.pytorch.spos import SPOSEvolution
-from network import ShuffleNetV2OneShot
-class EvolutionWithFlops(SPOSEvolution):
-    """
-    This tuner extends the function of evolution tuner, by limiting the flops generated by tuner.
-    Needs a function to examine the flops.
-    """
-    def __init__(self, flops_limit=330E6, **kwargs):
-        super().__init__(**kwargs)
-        self.model = ShuffleNetV2OneShot()
-        self.flops_limit = flops_limit
-    def _is_legal(self, cand):
-        if not super()._is_legal(cand):
-            return False
-        if self.model.get_candidate_flops(cand) > self.flops_limit:
-            return False
-        return True
--- a/examples/nas/oneshot/spos/utils.py
+++ b/examples/nas/oneshot/spos/utils.py
@@ -3,6 +3,8 @@
 import torch
 import torch.nn as nn
+import numpy as np
+import PIL
 class CrossEntropyLabelSmooth(nn.Module):
@@ -39,3 +41,24 @@ def accuracy(output, target, topk=(1, 5)):
        correct_k = correct[:k].reshape(-1).float().sum(0)
        res["acc{}".format(k)] = correct_k.mul_(1.0 / batch_size).item()
    return res
+class ToBGRTensor(object):
+    def __call__(self, img):
+        assert isinstance(img, (np.ndarray, PIL.Image.Image))
+        if isinstance(img, PIL.Image.Image):
+            img = np.asarray(img)
+        img = img[:,:, ::-1] # 2 BGR
+        img = np.transpose(img, [2, 0, 1]) # 2 (3, H, W)
+        img = np.ascontiguousarray(img)
+        img = torch.from_numpy(img).float()
+        return img
+def get_archchoice_by_model(model):
+    result = {}
+    for k, v in model.items():
+        assert k in v
+        result[k] = model[k].split("_")[-1]
+    return result
--- a/nni/retiarii/strategy/evolution.py
+++ b/nni/retiarii/strategy/evolution.py
@@ -10,7 +10,7 @@ import time
 from ..execution import query_available_resources, submit_models
 from ..graph import ModelStatus
 from .base import BaseStrategy
-from .utils import dry_run_for_search_space, get_targeted_model
+from .utils import dry_run_for_search_space, get_targeted_model, filter_model
 _logger = logging.getLogger(__name__)
@@ -47,10 +47,12 @@ class RegularizedEvolution(BaseStrategy):
        Can be one of "ignore" and "worst". If "ignore", simply give up the model and find a new one.
        If "worst", mark the model as -inf (if maximize, inf if minimize), so that the algorithm "learns" to avoid such model.
        Default: ignore.
+    model_filter: Callable[[Model], bool]
+        Feed the model and return a bool. This will filter the models in search space and select which to submit.
    """
    def __init__(self, optimize_mode='maximize', population_size=100, sample_size=25, cycles=20000,
-                 mutation_prob=0.05, on_failure='ignore'):
+                 mutation_prob=0.05, on_failure='ignore', model_filter=None):
        assert optimize_mode in ['maximize', 'minimize']
        assert on_failure in ['ignore', 'worst']
        assert sample_size < population_size
@@ -67,6 +69,7 @@ class RegularizedEvolution(BaseStrategy):
        self._population = collections.deque()
        self._running_models = []
        self._polling_interval = 2.
+        self.filter = model_filter
    def random(self, search_space):
        return {k: random.choice(v) for k, v in search_space.items()}
@@ -127,6 +130,11 @@ class RegularizedEvolution(BaseStrategy):
    def _submit_config(self, config, base_model, mutators):
        _logger.debug('Model submitted to running queue: %s', config)
        model = get_targeted_model(base_model, mutators, config)
+        if not filter_model(self.filter, model):
+            if self.on_failure == "worst":
+                model.status = ModelStatus.Failed
+                self._running_models.append((config, model))
+        else:
            submit_models(model)
            self._running_models.append((config, model))
        return model