Unverified Commit ac14b9e4 authored by Yiwu Yao's avatar Yiwu Yao Committed by GitHub
Browse files

New NAS algorithm: Blockwise DNAS FBNet (#3532)

parent dfe3c27b
FBNet
======
For the mobile application of facial landmark, based on the basic architecture of PFLD model, we have applied the FBNet (Block-wise DNAS) to design an concise model with the trade-off between latency and accuracy. References are listed as below:
* `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__
* `PFLD: A Practical Facial Landmark Detector <https://arxiv.org/abs/1902.10859>`__
FBNet is a block-wise differentiable NAS method (Block-wise DNAS), where the best candidate building blocks can be chosen by using Gumbel Softmax random sampling and differentiable training. At each layer (or stage) to be searched, the diverse candidate blocks are side by side planned (just like the effectiveness of structural re-parameterization), leading to sufficient pre-training of the supernet. The pre-trained supernet is further sampled for finetuning of the subnet, to achieve better performance.
.. image:: ../../img/fbnet.png
:target: ../../img/fbnet.png
:alt:
PFLD is a lightweight facial landmark model for realtime application. The architecture of PLFD is firstly simplified for acceleration, by using the stem block of PeleeNet, average pooling with depthwise convolution and eSE module.
To achieve better trade-off between latency and accuracy, the FBNet is further applied on the simplified PFLD for searching the best block at each specific layer. The search space is based on the FBNet space, and optimized for mobile deployment by using the average pooling with depthwise convolution and eSE module etc.
Experiments
------------
To verify the effectiveness of FBNet applied on PFLD, we choose the open source dataset with 106 landmark points as the benchmark:
* `Grand Challenge of 106-Point Facial Landmark Localization <https://arxiv.org/abs/1905.03469>`__
The baseline model is denoted as MobileNet-V3 PFLD (`Reference baseline <https://github.com/Hsintao/pfld_106_face_landmarks>`__), and the searched model is denoted as Subnet. The experimental results are listed as below, where the latency is tested on Qualcomm 625 CPU (ARMv8):
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Size
- Latency
- Validation NME
* - MobileNet-V3 PFLD
- 1.01MB
- 10ms
- 6.22%
* - Subnet
- 693KB
- 1.60ms
- 5.58%
Example
--------
`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/oneshot/pfld>`__
Please run the following scripts at the example directory.
The Python dependencies used here are listed as below:
.. code-block:: bash
numpy==1.18.5
opencv-python==4.5.1.48
torch==1.6.0
torchvision==0.7.0
onnx==1.8.1
onnx-simplifier==0.3.5
onnxruntime==1.7.0
Data Preparation
-----------------
Firstly, you should download the dataset `106points dataset <https://drive.google.com/file/d/1I7QdnLxAlyG2Tq3L66QYzGhiBEoVfzKo/view?usp=sharing>`__ to the path ``./data/106points`` . The dataset includes the train-set and test-set:
.. code-block:: bash
./data/106points/train_data/imgs
./data/106points/train_data/list.txt
./data/106points/test_data/imgs
./data/106points/test_data/list.txt
Quik Start
-----------
1. Search
^^^^^^^^^^
Based on the architecture of simplified PFLD, the setting of multi-stage search space and hyper-parameters for searching should be firstly configured to construct the supernet, as an example:
.. code-block:: bash
from lib.builder import search_space
from lib.ops import PRIMITIVES
from lib.supernet import PFLDInference, AuxiliaryNet
from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig,
# configuration of hyper-parameters
# search_space defines the multi-stage search space
nas_config = NASConfig(
model_dir="./ckpt_save",
nas_lr=0.01,
mode="mul",
alpha=0.25,
beta=0.6,
search_space=search_space,
)
# lookup table to manage the information
lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
# created supernet
pfld_backbone = PFLDInference(lookup_table)
After creation of the supernet with the specification of search space and hyper-parameters, we can run below command to start searching and training of the supernet:
.. code-block:: bash
python train.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points"
The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/supernet/checkpoint_best.pth``.
2. Finetune
^^^^^^^^^^^^
After pre-training of the supernet, we can run below command to sample the subnet and conduct the finetuning:
.. code-block:: bash
python retrain.py --dev_id "0,1" --snapshot "./ckpt_save" --data_root "./data/106points" \
--supernet "./ckpt_save/supernet/checkpoint_best.pth"
The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/subnet/checkpoint_best.pth``.
3. Export
^^^^^^^^^^
After the finetuning of subnet, we can run below command to export the ONNX model:
.. code-block:: bash
python export.py --supernet "./ckpt_save/supernet/checkpoint_best.pth" \
--resume "./ckpt_save/subnet/checkpoint_best.pth"
ONNX model is saved as ``./output/subnet.onnx``, which can be further converted to the mobile inference engine by using `MNN <https://github.com/alibaba/MNN>`__ .
The checkpoints of pre-trained supernet and subnet are offered as below:
* `Supernet <https://drive.google.com/file/d/1TCuWKq8u4_BQ84BWbHSCZ45N3JGB9kFJ/view?usp=sharing>`__
* `Subnet <https://drive.google.com/file/d/160rkuwB7y7qlBZNM3W_T53cb6MQIYHIE/view?usp=sharing>`__
* `ONNX model <https://drive.google.com/file/d/1s-v-aOiMv0cqBspPVF3vSGujTbn_T_Uo/view?usp=sharing>`__
...@@ -58,6 +58,8 @@ NNI currently supports the one-shot NAS algorithms listed below and is adding mo ...@@ -58,6 +58,8 @@ NNI currently supports the one-shot NAS algorithms listed below and is adding mo
- `Cyclic Differentiable Architecture Search <https://arxiv.org/pdf/2006.10724.pdf>`__ builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture. - `Cyclic Differentiable Architecture Search <https://arxiv.org/pdf/2006.10724.pdf>`__ builds a cyclic feedback mechanism between the search and evaluation networks. It introduces a cyclic differentiable architecture search framework which integrates the two networks into a unified architecture.
* - `ProxylessNAS <Proxylessnas.rst>`__ * - `ProxylessNAS <Proxylessnas.rst>`__
- `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__. It removes proxy, directly learns the architectures for large-scale target tasks and target hardware platforms. - `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__. It removes proxy, directly learns the architectures for large-scale target tasks and target hardware platforms.
* - `FBNet <FBNet.rst>`__
- `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__. It is a block-wise differentiable neural network architecture search method with the hardware-aware constraint.
* - `TextNAS <TextNAS.rst>`__ * - `TextNAS <TextNAS.rst>`__
- `TextNAS: A Neural Architecture Search Space tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__. It is a neural architecture search algorithm tailored for text representation. - `TextNAS: A Neural Architecture Search Space tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__. It is a neural architecture search algorithm tailored for text representation.
* - `Cream <Cream.rst>`__ * - `Cream <Cream.rst>`__
......
...@@ -14,5 +14,6 @@ One-shot NAS algorithms leverage weight sharing among models in neural architect ...@@ -14,5 +14,6 @@ One-shot NAS algorithms leverage weight sharing among models in neural architect
SPOS <SPOS> SPOS <SPOS>
CDARTS <CDARTS> CDARTS <CDARTS>
ProxylessNAS <Proxylessnas> ProxylessNAS <Proxylessnas>
FBNet <FBNet>
TextNAS <TextNAS> TextNAS <TextNAS>
Cream <Cream> Cream <Cream>
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import cv2
import os
import numpy as np
from torch.utils import data
class PFLDDatasets(data.Dataset):
""" Dataset to manage the data loading, augmentation and generation. """
def __init__(self, file_list, transforms=None, data_root="", img_size=112):
"""
Parameters
----------
file_list : list
a list of file path and annotations
transforms : function
function for data augmentation
data_root : str
the root path of dataset
img_size : int
the size of image height or width
"""
self.line = None
self.path = None
self.img_size = img_size
self.land = None
self.angle = None
self.data_root = data_root
self.transforms = transforms
with open(file_list, "r") as f:
self.lines = f.readlines()
def __getitem__(self, index):
""" Get the data sample and labels with the index. """
self.line = self.lines[index].strip().split()
# load image
if self.data_root:
self.img = cv2.imread(os.path.join(self.data_root, self.line[0]))
else:
self.img = cv2.imread(self.line[0])
# resize
self.img = cv2.resize(self.img, (self.img_size, self.img_size))
# obtain gt labels
self.land = np.asarray(self.line[1: (106 * 2 + 1)], dtype=np.float32)
self.angle = np.asarray(self.line[(106 * 2 + 1):], dtype=np.float32)
# augmentation
if self.transforms:
self.img = self.transforms(self.img)
return self.img, self.land, self.angle
def __len__(self):
""" Get the size of dataset. """
return len(self.lines)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import argparse
import onnx
import onnxsim
import os
import torch
from lib.builder import search_space
from lib.ops import PRIMITIVES
from nni.algorithms.nas.pytorch.fbnet import (
LookUpTable,
NASConfig,
model_init,
)
parser = argparse.ArgumentParser(description="Export the ONNX model")
parser.add_argument("--net", default="subnet", type=str)
parser.add_argument("--supernet", default="", type=str, metavar="PATH")
parser.add_argument("--resume", default="", type=str, metavar="PATH")
parser.add_argument("--num_points", default=106, type=int)
parser.add_argument("--img_size", default=112, type=int)
parser.add_argument("--onnx", default="./output/pfld.onnx", type=str)
parser.add_argument("--onnx_sim", default="./output/subnet.onnx", type=str)
args = parser.parse_args()
os.makedirs("./output", exist_ok=True)
if args.net == "subnet":
from lib.subnet import PFLDInference
else:
raise ValueError("Network is not implemented")
check = torch.load(args.supernet, map_location=torch.device("cpu"))
sampled_arch = check["arch_sample"]
nas_config = NASConfig(search_space=search_space)
lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
pfld_backbone = PFLDInference(lookup_table, sampled_arch, args.num_points)
pfld_backbone.eval()
check_sub = torch.load(args.resume, map_location=torch.device("cpu"))
param_dict = check_sub["pfld_backbone"]
model_init(pfld_backbone, param_dict)
print("Convert PyTorch model to ONNX.")
dummy_input = torch.randn(1, 3, args.img_size, args.img_size)
input_names = ["input"]
output_names = ["output"]
torch.onnx.export(
pfld_backbone,
dummy_input,
args.onnx,
verbose=True,
input_names=input_names,
output_names=output_names,
)
print("Check ONNX model.")
model = onnx.load(args.onnx)
print("Simplifying the ONNX model.")
model_opt, check = onnxsim.simplify(args.onnx)
assert check, "Simplified ONNX model could not be validated"
onnx.save(model_opt, args.onnx_sim)
print("Onnx model simplify Ok!")
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
search_space = {
# multi-stage definition for candidate layers
# here two stages are defined for PFLD searching
"stages": {
"stage_0": {
"ops": [
"mb_k3_res",
"mb_k3_e2_res",
"mb_k3_res_d3",
"mb_k5_res",
"mb_k5_e2_res",
"sep_k3",
"sep_k5",
"gh_k3",
"gh_k5",
],
"layer_num": 2,
},
"stage_1": {
"ops": [
"mb_k3_e2_res",
"mb_k3_e4_res",
"mb_k3_e2_res_se",
"mb_k3_res_d3",
"mb_k5_res",
"mb_k5_e2_res",
"mb_k5_res_se",
"mb_k5_e2_res_se",
"gh_k5",
],
"layer_num": 3,
},
},
# necessary information of layers for NAS
# the basic information is as (input_channels, height, width)
"input_shape": [
(32, 14, 14),
(32, 14, 14),
(32, 14, 14),
(64, 7, 7),
(64, 7, 7),
],
# output channels for each layer
"channel_size": [32, 32, 64, 64, 64],
# stride for each layer
"strides": [1, 1, 2, 1, 1],
# height of feature map for each layer
"fm_size": [14, 14, 7, 7, 7],
}
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
# Basic primitives as the network path
PRIMITIVES = {
"skip": lambda c_in, c_out, stride, **kwargs: Identity(
c_in, c_out, stride, **kwargs
),
"conv1x1": lambda c_in, c_out, stride, **kwargs: Conv1x1(
c_in, c_out, stride, **kwargs
),
"depth_conv": lambda c_in, c_out, stride, **kwargs: DepthConv(
c_in, c_out, stride, **kwargs
),
"sep_k3": lambda c_in, c_out, stride, **kwargs: SeparableConv(
c_in, c_out, stride, **kwargs
),
"sep_k5": lambda c_in, c_out, stride, **kwargs: SeparableConv(
c_in, c_out, stride, kernel=5, **kwargs
),
"gh_k3": lambda c_in, c_out, stride, **kwargs: GhostModule(
c_in, c_out, stride, **kwargs
),
"gh_k5": lambda c_in, c_out, stride, **kwargs: GhostModule(
c_in, c_out, stride, kernel=5, **kwargs
),
"mb_k3": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=3, expand=1, **kwargs
),
"mb_k3_e2": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=3, expand=2, **kwargs
),
"mb_k3_e4": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=3, expand=4, **kwargs
),
"mb_k3_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=3, expand=1, res=True, **kwargs
),
"mb_k3_e2_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=3, expand=2, res=True, **kwargs
),
"mb_k3_e4_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=3, expand=4, res=True, **kwargs
),
"mb_k3_d2": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=2,
res=False,
dilation=2,
**kwargs,
),
"mb_k3_d3": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=2,
res=False,
dilation=3,
**kwargs,
),
"mb_k3_res_d2": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=2,
res=True,
dilation=2,
**kwargs,
),
"mb_k3_res_d3": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=2,
res=True,
dilation=3,
**kwargs,
),
"mb_k3_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=1,
res=True,
dilation=1,
se=True,
**kwargs,
),
"mb_k3_e2_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=2,
res=True,
dilation=1,
se=True,
**kwargs,
),
"mb_k3_e4_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=3,
expand=4,
res=True,
dilation=1,
se=True,
**kwargs,
),
"mb_k5": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=5, expand=1, **kwargs
),
"mb_k5_e2": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=5, expand=2, **kwargs
),
"mb_k5_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=5, expand=1, res=True, **kwargs
),
"mb_k5_e2_res": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in, c_out, stride, kernel=5, expand=2, res=True, **kwargs
),
"mb_k5_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=5,
expand=1,
res=True,
dilation=1,
se=True,
**kwargs,
),
"mb_k5_e2_res_se": lambda c_in, c_out, stride, **kwargs: MBBlock(
c_in,
c_out,
stride,
kernel=5,
expand=2,
res=True,
dilation=1,
se=True,
**kwargs,
),
}
def conv_bn(inp, oup, kernel, stride, pad=1, groups=1):
return nn.Sequential(
nn.Conv2d(inp, oup, kernel, stride, pad, groups=groups, bias=False),
nn.BatchNorm2d(oup),
nn.ReLU(inplace=True),
)
class SeparableConv(nn.Module):
"""Separable convolution."""
def __init__(self, in_ch, out_ch, stride=1, kernel=3, fm_size=7):
super(SeparableConv, self).__init__()
assert stride in [1, 2], "stride should be in [1, 2]"
pad = kernel // 2
self.conv = nn.Sequential(
conv_bn(in_ch, in_ch, kernel, stride, pad=pad, groups=in_ch),
conv_bn(in_ch, out_ch, 1, 1, pad=0),
)
def forward(self, x):
return self.conv(x)
class Conv1x1(nn.Module):
"""1x1 convolution."""
def __init__(self, in_ch, out_ch, stride=1, kernel=1, fm_size=7):
super(Conv1x1, self).__init__()
assert stride in [1, 2], "stride should be in [1, 2]"
padding = kernel // 2
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel, stride, padding),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.conv(x)
class DepthConv(nn.Module):
"""depth convolution."""
def __init__(self, in_ch, out_ch, stride=1, kernel=3, fm_size=7):
super(DepthConv, self).__init__()
assert stride in [1, 2], "stride should be in [1, 2]"
padding = kernel // 2
self.conv = nn.Sequential(
nn.Conv2d(in_ch, in_ch, kernel, stride, padding, groups=in_ch),
nn.ReLU(inplace=True),
nn.Conv2d(in_ch, out_ch, 1, 1, 0),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.conv(x)
class GhostModule(nn.Module):
"""Gost module."""
def __init__(self, in_ch, out_ch, stride=1, kernel=3, fm_size=7):
super(GhostModule, self).__init__()
mid_ch = out_ch // 2
self.primary_conv = conv_bn(in_ch, mid_ch, 1, stride, pad=0)
self.cheap_operation = conv_bn(
mid_ch, mid_ch, kernel, 1, kernel // 2, mid_ch
)
def forward(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
return torch.cat([x1, x2], dim=1)
class StemBlock(nn.Module):
def __init__(self, in_ch=3, init_ch=32, bottleneck=True):
super(StemBlock, self).__init__()
self.stem_1 = conv_bn(in_ch, init_ch, 3, 2, 1)
mid_ch = int(init_ch // 2) if bottleneck else init_ch
self.stem_2a = conv_bn(init_ch, mid_ch, 1, 1, 0)
self.stem_2b = SeparableConv(mid_ch, init_ch, 2, 1)
self.stem_2p = nn.MaxPool2d(kernel_size=2, stride=2)
self.stem_3 = conv_bn(init_ch * 2, init_ch, 1, 1, 0)
def forward(self, x):
stem_1_out = self.stem_1(x)
stem_2a_out = self.stem_2a(stem_1_out)
stem_2b_out = self.stem_2b(stem_2a_out)
stem_2p_out = self.stem_2p(stem_1_out)
out = self.stem_3(torch.cat((stem_2b_out, stem_2p_out), 1))
return out, stem_1_out
class Identity(nn.Module):
""" Identity module."""
def __init__(self, in_ch, out_ch, stride=1, fm_size=7):
super(Identity, self).__init__()
self.conv = (
conv_bn(in_ch, out_ch, kernel=1, stride=stride, pad=0)
if in_ch != out_ch or stride != 1
else None
)
def forward(self, x):
if self.conv:
out = self.conv(x)
else:
out = x
# Add dropout to avoid overfit on Identity (PDARTS)
out = nn.functional.dropout(out, p=0.5)
return out
class Hsigmoid(nn.Module):
"""Hsigmoid activation function."""
def __init__(self, inplace=True):
super(Hsigmoid, self).__init__()
self.inplace = inplace
def forward(self, x):
return F.relu6(x + 3.0, inplace=self.inplace) / 6.0
class eSEModule(nn.Module):
""" The improved SE Module."""
def __init__(self, channel, fm_size=7, se=True):
super(eSEModule, self).__init__()
self.se = se
if self.se:
self.avg_pool = nn.Conv2d(
channel, channel, fm_size, 1, 0, groups=channel, bias=False
)
self.fc = nn.Conv2d(channel, channel, kernel_size=1, padding=0)
self.hsigmoid = Hsigmoid()
def forward(self, x):
if self.se:
input = x
x = self.avg_pool(x)
x = self.fc(x)
x = self.hsigmoid(x)
return input * x
else:
return x
class ChannelShuffle(nn.Module):
"""Procedure: [N,C,H,W] -> [N,g,C/g,H,W] -> [N,C/g,g,H,w] -> [N,C,H,W]."""
def __init__(self, groups):
super(ChannelShuffle, self).__init__()
self.groups = groups
def forward(self, x):
if self.groups == 1:
return x
N, C, H, W = x.size()
g = self.groups
assert C % g == 0, "group size {} is not for channel {}".format(g, C)
return (
x.view(N, g, int(C // g), H, W)
.permute(0, 2, 1, 3, 4)
.contiguous()
.view(N, C, H, W)
)
class MBBlock(nn.Module):
"""The Inverted Residual Block, with channel shuffle or eSEModule."""
def __init__(
self,
in_ch,
out_ch,
stride=1,
kernel=3,
expand=1,
res=False,
dilation=1,
se=False,
fm_size=7,
group=1,
mid_ch=-1,
):
super(MBBlock, self).__init__()
assert stride in [1, 2], "stride should be in [1, 2]"
assert kernel in [3, 5], "kernel size should be in [3, 5]"
assert dilation in [1, 2, 3, 4], "dilation should be in [1, 2, 3, 4]"
assert group in [1, 2], "group should be in [1, 2]"
self.use_res_connect = res and (stride == 1)
padding = kernel // 2 + (dilation - 1)
mid_ch = mid_ch if mid_ch > 0 else (in_ch * expand)
# Basic Modules
conv_layer = nn.Conv2d
norm_layer = nn.BatchNorm2d
activation_layer = nn.ReLU
channel_suffle = ChannelShuffle
se_layer = eSEModule
self.ir_block = nn.Sequential(
# pointwise convolution
conv_layer(in_ch, mid_ch, 1, 1, 0, bias=False, groups=group),
norm_layer(mid_ch),
activation_layer(inplace=True),
# channel shuffle if necessary
channel_suffle(group),
# depthwise convolution
conv_layer(
mid_ch,
mid_ch,
kernel,
stride,
padding=padding,
dilation=dilation,
groups=mid_ch,
bias=False,
),
norm_layer(mid_ch),
# eSEModule if necessary
se_layer(mid_ch, fm_size, se),
activation_layer(inplace=True),
# pointwise convolution
conv_layer(mid_ch, out_ch, 1, 1, 0, bias=False, groups=group),
norm_layer(out_ch),
)
def forward(self, x):
if self.use_res_connect:
return x + self.ir_block(x)
else:
return self.ir_block(x)
class SingleOperation(nn.Module):
"""Single operation for sampled path."""
def __init__(self, layers_configs, stage_ops, sampled_op=""):
"""
Parameters
----------
layers_configs : list
the layer config: [input_channel, output_channel, stride, height]
stage_ops : dict
the pairs of op name and layer operator
sampled_op : str
the searched layer name
"""
super(SingleOperation, self).__init__()
fm = {"fm_size": layers_configs[3]}
ops_names = [op_name for op_name in stage_ops]
sampled_op = sampled_op if sampled_op else ops_names[0]
# define the single op
self.op = stage_ops[sampled_op](*layers_configs[0:3], **fm)
def forward(self, x):
return self.op(x)
def choice_blocks(layers_configs, stage_ops):
"""
Create list of layer candidates for NNI one-shot NAS.
Parameters
----------
layers_configs : list
the layer config: [input_channel, output_channel, stride, height]
stage_ops : dict
the pairs of op name and layer operator
Returns
-------
output: list
list of layer operators
"""
ops_names = [op for op in stage_ops]
fm = {"fm_size": layers_configs[3]}
op_list = [stage_ops[op](*layers_configs[0:3], **fm) for op in ops_names]
return op_list
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import torch
import torch.nn as nn
from lib.ops import (
MBBlock,
SeparableConv,
SingleOperation,
StemBlock,
conv_bn,
)
from torch.nn import init
INIT_CH = 16
class PFLDInference(nn.Module):
""" The subnet with the architecture of PFLD. """
def __init__(self, lookup_table, sampled_ops, num_points=106):
"""
Parameters
----------
lookup_table : class
to manage the candidate ops, layer information and layer perf
sampled_ops : list of str
the searched layer names of the subnet
num_points : int
the number of landmarks for prediction
"""
super(PFLDInference, self).__init__()
stage_names = [stage_name for stage_name in lookup_table.layer_num]
stage_n = [lookup_table.layer_num[stage] for stage in stage_names]
self.stem = StemBlock(init_ch=INIT_CH, bottleneck=False)
self.block4_1 = MBBlock(INIT_CH, 32, stride=2, mid_ch=32)
stages_0 = [
SingleOperation(
lookup_table.layer_configs[layer_id],
lookup_table.lut_ops[stage_names[0]],
sampled_ops[layer_id],
)
for layer_id in range(stage_n[0])
]
stages_1 = [
SingleOperation(
lookup_table.layer_configs[layer_id],
lookup_table.lut_ops[stage_names[1]],
sampled_ops[layer_id],
)
for layer_id in range(stage_n[0], stage_n[0] + stage_n[1])
]
blocks = stages_0 + stages_1
self.blocks = nn.Sequential(*blocks)
self.avg_pool1 = nn.Conv2d(
INIT_CH, INIT_CH, 9, 8, 1, groups=INIT_CH, bias=False
)
self.avg_pool2 = nn.Conv2d(32, 32, 3, 2, 1, groups=32, bias=False)
self.block6_1 = nn.Conv2d(96 + INIT_CH, 64, 1, 1, 0, bias=False)
self.block6_2 = MBBlock(64, 64, res=True, se=True, mid_ch=128)
self.block6_3 = SeparableConv(64, 128, 1)
self.conv7 = nn.Conv2d(128, 128, 7, 1, 0, groups=128, bias=False)
self.fc = nn.Conv2d(128, num_points * 2, 1, 1, 0, bias=True)
# init params
self.init_params()
def init_params(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
init.kaiming_normal_(m.weight, mode="fan_out")
if m.bias is not None:
init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
init.constant_(m.weight, 1)
init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
init.normal_(m.weight, std=0.001)
if m.bias is not None:
init.constant_(m.bias, 0)
def forward(self, x):
"""
Parameters
----------
x : tensor
input image
Returns
-------
output: tensor
the predicted landmarks
output: tensor
the intermediate features
"""
x, y1 = self.stem(x)
out1 = x
x = self.block4_1(x)
for i, block in enumerate(self.blocks):
x = block(x)
if i == 1:
y2 = x
elif i == 4:
y3 = x
y1 = self.avg_pool1(y1)
y2 = self.avg_pool2(y2)
multi_scale = torch.cat([y3, y2, y1], 1)
y = self.block6_1(multi_scale)
y = self.block6_2(y)
y = self.block6_3(y)
y = self.conv7(y)
landmarks = self.fc(y)
return landmarks, out1
class AuxiliaryNet(nn.Module):
""" AuxiliaryNet to predict pose angles. """
def __init__(self):
super(AuxiliaryNet, self).__init__()
self.conv1 = conv_bn(INIT_CH, 64, 3, 2)
self.conv2 = conv_bn(64, 64, 3, 1)
self.conv3 = conv_bn(64, 32, 3, 2)
self.conv4 = conv_bn(32, 64, 7, 1)
self.max_pool1 = nn.MaxPool2d(3)
self.fc1 = nn.Linear(64, 32)
self.fc2 = nn.Linear(32, 3)
def forward(self, x):
"""
Parameters
----------
x : tensor
input intermediate features
Returns
-------
output: tensor
the predicted pose angles
"""
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.max_pool1(x)
x = x.view(x.size(0), -1)
x = self.fc1(x)
x = self.fc2(x)
return x
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import torch
import torch.nn as nn
from lib.ops import (
MBBlock,
SeparableConv,
StemBlock,
choice_blocks,
conv_bn,
)
from nni.nas.pytorch import mutables
from torch.nn import init
INIT_CH = 16
class PFLDInference(nn.Module):
""" PFLD model for facial landmark."""
def __init__(self, lookup_table, num_points=106):
"""
Parameters
----------
lookup_table : class
to manage the candidate ops, layer information and layer perf
num_points : int
the number of landmarks for prediction
"""
super(PFLDInference, self).__init__()
stage_names = [stage for stage in lookup_table.layer_num]
stage_lnum = [lookup_table.layer_num[stage] for stage in stage_names]
self.stem = StemBlock(init_ch=INIT_CH, bottleneck=False)
self.block4_1 = MBBlock(INIT_CH, 32, stride=2, mid_ch=32)
stages_0 = [
mutables.LayerChoice(
choice_blocks(
lookup_table.layer_configs[layer_id],
lookup_table.lut_ops[stage_names[0]],
)
)
for layer_id in range(stage_lnum[0])
]
stages_1 = [
mutables.LayerChoice(
choice_blocks(
lookup_table.layer_configs[layer_id],
lookup_table.lut_ops[stage_names[1]],
)
)
for layer_id in range(stage_lnum[0], stage_lnum[0] + stage_lnum[1])
]
blocks = stages_0 + stages_1
self.blocks = nn.Sequential(*blocks)
self.avg_pool1 = nn.Conv2d(
INIT_CH, INIT_CH, 9, 8, 1, groups=INIT_CH, bias=False
)
self.avg_pool2 = nn.Conv2d(32, 32, 3, 2, 1, groups=32, bias=False)
self.block6_1 = nn.Conv2d(96 + INIT_CH, 64, 1, 1, 0, bias=False)
self.block6_2 = MBBlock(64, 64, res=True, se=True, mid_ch=128)
self.block6_3 = SeparableConv(64, 128, 1)
self.conv7 = nn.Conv2d(128, 128, 7, 1, 0, groups=128, bias=False)
self.fc = nn.Conv2d(128, num_points * 2, 1, 1, 0, bias=True)
# init params
self.init_params()
def init_params(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
init.kaiming_normal_(m.weight, mode="fan_out")
if m.bias is not None:
init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
init.constant_(m.weight, 1)
init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
init.normal_(m.weight, std=0.001)
if m.bias is not None:
init.constant_(m.bias, 0)
def forward(self, x):
"""
Parameters
----------
x : tensor
input image
Returns
-------
output: tensor
the predicted landmarks
output: tensor
the intermediate features
"""
x, y1 = self.stem(x)
out1 = x
x = self.block4_1(x)
for i, block in enumerate(self.blocks):
x = block(x)
if i == 1:
y2 = x
elif i == 4:
y3 = x
y1 = self.avg_pool1(y1)
y2 = self.avg_pool2(y2)
multi_scale = torch.cat([y3, y2, y1], 1)
y = self.block6_1(multi_scale)
y = self.block6_2(y)
y = self.block6_3(y)
y = self.conv7(y)
landmarks = self.fc(y)
return landmarks, out1
class AuxiliaryNet(nn.Module):
""" AuxiliaryNet to predict pose angles. """
def __init__(self):
super(AuxiliaryNet, self).__init__()
self.conv1 = conv_bn(INIT_CH, 64, 3, 2)
self.conv2 = conv_bn(64, 64, 3, 1)
self.conv3 = conv_bn(64, 32, 3, 2)
self.conv4 = conv_bn(32, 64, 7, 1)
self.max_pool1 = nn.MaxPool2d(3)
self.fc1 = nn.Linear(64, 32)
self.fc2 = nn.Linear(32, 3)
def forward(self, x):
"""
Parameters
----------
x : tensor
input intermediate features
Returns
-------
output: tensor
the predicted pose angles
"""
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.max_pool1(x)
x = x.view(x.size(0), -1)
x = self.fc1(x)
x = self.fc2(x)
return x
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import os
import time
import torch
import numpy as np
from nni.algorithms.nas.pytorch.fbnet import FBNetTrainer
from nni.nas.pytorch.utils import AverageMeter
from .utils import accuracy
class PFLDTrainer(FBNetTrainer):
def __init__(
self,
model,
auxiliarynet,
model_optim,
criterion,
device,
device_ids,
config,
lookup_table,
train_loader,
valid_loader,
n_epochs=300,
load_ckpt=False,
arch_path=None,
logger=None,
):
"""
Parameters
----------
model : pytorch model
the user model, which has mutables
auxiliarynet : pytorch model
the auxiliarynet to regress angle
model_optim : pytorch optimizer
the user defined optimizer
criterion : pytorch loss
the main task loss
device : pytorch device
the devices to train/search the model
device_ids : list of int
the indexes of devices used for training
config : class
configuration object for fbnet training
lookup_table : class
lookup table object for fbnet training
train_loader : pytorch data loader
data loader for the training set
valid_loader : pytorch data loader
data loader for the validation set
n_epochs : int
number of epochs to train/search
load_ckpt : bool
whether load checkpoint
arch_path : str
the path to store chosen architecture
logger : logger
the logger
"""
super(PFLDTrainer, self).__init__(
model,
model_optim,
criterion,
device,
device_ids,
lookup_table,
train_loader,
valid_loader,
n_epochs,
load_ckpt,
arch_path,
logger,
)
# DataParallel of the AuxiliaryNet to PFLD
self.auxiliarynet = auxiliarynet
self.auxiliarynet = torch.nn.DataParallel(
self.auxiliarynet, device_ids=device_ids
)
self.auxiliarynet.to(device)
def _validate(self):
"""
Do validation. During validation, LayerChoices use the mixed-op.
Returns
-------
float, float
average loss, average nme
"""
# test on validation set under eval mode
self.model.eval()
self.auxiliarynet.eval()
losses, nme = list(), list()
batch_time = AverageMeter("batch_time")
end = time.time()
with torch.no_grad():
for i, (img, land_gt, angle_gt) in enumerate(self.valid_loader):
img = img.to(self.device, non_blocking=True)
landmark_gt = land_gt.to(self.device, non_blocking=True)
angle_gt = angle_gt.to(self.device, non_blocking=True)
landmark, _ = self.model(img)
# compute the l2 loss
landmark = landmark.squeeze()
l2_diff = torch.sum((landmark_gt - landmark) ** 2, axis=1)
loss = torch.mean(l2_diff)
losses.append(loss.cpu().detach().numpy())
# compute the accuracy
landmark = landmark.cpu().detach().numpy()
landmark = landmark.reshape(landmark.shape[0], -1, 2)
landmark_gt = landmark_gt.cpu().detach().numpy()
landmark_gt = landmark_gt.reshape(landmark_gt.shape[0], -1, 2)
_, nme_i = accuracy(landmark, landmark_gt)
for item in nme_i:
nme.append(item)
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
self.logger.info("===> Evaluate:")
self.logger.info(
"Eval set: Average loss: {:.4f} nme: {:.4f}".format(
np.mean(losses), np.mean(nme)
)
)
return np.mean(losses), np.mean(nme)
def _train_epoch(self, epoch, optimizer, arch_train=False):
"""
Train one epoch.
"""
# switch to train mode
self.model.train()
self.auxiliarynet.train()
batch_time = AverageMeter("batch_time")
data_time = AverageMeter("data_time")
losses = AverageMeter("losses")
data_loader = self.valid_loader if arch_train else self.train_loader
end = time.time()
for i, (img, landmark_gt, angle_gt) in enumerate(data_loader):
data_time.update(time.time() - end)
img = img.to(self.device, non_blocking=True)
landmark_gt = landmark_gt.to(self.device, non_blocking=True)
angle_gt = angle_gt.to(self.device, non_blocking=True)
lands, feats = self.model(img)
landmarks = lands.squeeze()
angle = self.auxiliarynet(feats)
# task loss
weighted_loss, l2_loss = self.criterion(
landmark_gt, angle_gt, angle, landmarks
)
loss = l2_loss if arch_train else weighted_loss
# hardware-aware loss
perf_cost = self._get_perf_cost(requires_grad=True)
regu_loss = self.reg_loss(perf_cost)
if self.mode.startswith("mul"):
loss = loss * regu_loss
elif self.mode.startswith("add"):
loss = loss + regu_loss
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
# measure accuracy and record loss
losses.update(np.squeeze(loss.cpu().detach().numpy()), img.size(0))
if i % 10 == 0:
batch_log = (
"Train [{0}][{1}]\t"
"Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
"Data {data_time.val:.3f} ({data_time.avg:.3f})\t"
"Loss {losses.val:.4f} ({losses.avg:.4f})".format(
epoch + 1,
i,
batch_time=batch_time,
data_time=data_time,
losses=losses,
)
)
self.logger.info(batch_log)
def _warm_up(self):
"""
Warm up the model, while the architecture weights are not trained.
"""
for epoch in range(self.epoch, self.start_epoch):
self.logger.info("\n--------Warmup epoch: %d--------\n", epoch + 1)
self._train_epoch(epoch, self.model_optim)
# adjust learning rate
self.scheduler.step()
# validation
_, _ = self._validate()
if epoch % 10 == 0:
filename = os.path.join(
self.config.model_dir, "checkpoint_%s.pth" % epoch
)
self.save_checkpoint(epoch, filename)
def _train(self):
"""
Train the model, it trains model weights and architecute weights.
Architecture weights are trained according to the schedule.
Before updating architecture weights, ```requires_grad``` is enabled.
Then, it is disabled after the updating, in order not to update
architecture weights when training model weights.
"""
arch_param_num = self.mutator.num_arch_params()
self.logger.info("#arch_params: {}".format(arch_param_num))
self.epoch = max(self.start_epoch, self.epoch)
ckpt_path = self.config.model_dir
choice_names = None
val_nme = 1e6
for epoch in range(self.epoch, self.n_epochs):
# update the weight parameters
self.logger.info("\n--------Train epoch: %d--------\n", epoch + 1)
self._train_epoch(epoch, self.model_optim)
# adjust learning rate
self.scheduler.step()
# update the architecture parameters
self.logger.info("Update architecture parameters")
self.mutator.arch_requires_grad()
self._train_epoch(epoch, self.arch_optimizer, True)
self.mutator.arch_disable_grad()
# temperature annealing
self.temp = self.temp * self.exp_anneal_rate
self.mutator.set_temperature(self.temp)
# sample the architecture of sub-network
choice_names = self._layer_choice_sample()
# validate
_, nme = self._validate()
if epoch % 10 == 0:
filename = os.path.join(ckpt_path, "checkpoint_%s.pth" % epoch)
self.save_checkpoint(epoch, filename, choice_names)
if nme < val_nme:
filename = os.path.join(ckpt_path, "checkpoint_best.pth")
self.save_checkpoint(epoch, filename, choice_names)
val_nme = nme
self.logger.info("Best nme: {:.4f}".format(val_nme))
def save_checkpoint(self, epoch, filename, choice_names=None):
"""
Save checkpoint of the whole model.
Saving model weights and architecture weights as ```filename```,
and saving currently chosen architecture in ```arch_path```.
"""
state = {
"pfld_backbone": self.model.state_dict(),
"auxiliarynet": self.auxiliarynet.state_dict(),
"optim": self.model_optim.state_dict(),
"epoch": epoch,
"arch_sample": choice_names,
}
torch.save(state, filename)
self.logger.info("Save checkpoint to {0:}".format(filename))
if self.arch_path:
self.export(self.arch_path)
def load_checkpoint(self, filename):
"""
Load the checkpoint from ```filename```.
"""
ckpt = torch.load(filename)
self.epoch = ckpt["epoch"]
self.model.load_state_dict(ckpt["pfld_backbone"])
self.auxiliarynet.load_state_dict(ckpt["auxiliarynet"])
self.model_optim.load_state_dict(ckpt["optim"])
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import torch
import numpy as np
import torch.nn as nn
def accuracy(preds, target):
"""
Calculate the NME (Normalized Mean Error).
Parameters
----------
preds : numpy array
the predicted landmarks
target : numpy array
the ground truth of landmarks
Returns
-------
output: float32
the nme value
output: list
the list of l2 distances
"""
N = preds.shape[0]
L = preds.shape[1]
rmse = np.zeros(N).astype(np.float32)
for i in range(N):
pts_pred, pts_gt = (
preds[i],
target[i],
)
if L == 19:
# aflw
interocular = 34
elif L == 29:
# cofw
interocular = np.linalg.norm(pts_gt[8] - pts_gt[9])
elif L == 68:
# interocular
interocular = np.linalg.norm(pts_gt[36] - pts_gt[45])
elif L == 98:
# euclidean dis from left eye to right eye
interocular = np.linalg.norm(pts_gt[60] - pts_gt[72])
elif L == 106:
# euclidean dis from left eye to right eye
interocular = np.linalg.norm(pts_gt[35] - pts_gt[93])
else:
raise ValueError("Number of landmarks is wrong")
pred_dis = np.sum(np.linalg.norm(pts_pred - pts_gt, axis=1))
rmse[i] = pred_dis / (interocular * L)
return np.mean(rmse), rmse
class PFLDLoss(nn.Module):
"""Weighted loss of L2 distance with the pose angle for PFLD."""
def __init__(self):
super(PFLDLoss, self).__init__()
def forward(self, landmark_gt, euler_angle_gt, angle, landmarks):
"""
Calculate weighted L2 loss for PFLD.
Parameters
----------
landmark_gt : tensor
the ground truth of landmarks
euler_angle_gt : tensor
the ground truth of pose angle
angle : tensor
the predicted pose angle
landmarks : float32
the predicted landmarks
Returns
-------
output: tensor
the weighted L2 loss
output: tensor
the normal L2 loss
"""
weight_angle = torch.sum(1 - torch.cos(angle - euler_angle_gt), axis=1)
l2_distant = torch.sum((landmark_gt - landmarks) ** 2, axis=1)
return torch.mean(weight_angle * l2_distant), torch.mean(l2_distant)
def bounded_regress_loss(
landmark_gt, landmarks_t, landmarks_s, reg_m=0.5, br_alpha=0.05
):
"""
Calculate the Bounded Regression Loss for Knowledge Distillation.
Parameters
----------
landmark_gt : tensor
the ground truth of landmarks
landmarks_t : tensor
the predicted landmarks of teacher
landmarks_s : tensor
the predicted landmarks of student
reg_m : float32
the value to control the regresion constraint
br_alpha : float32
the balance value for kd loss
Returns
-------
output: tensor
the bounded regression loss
"""
l2_dis_s = (landmark_gt - landmarks_s).pow(2).sum(1)
l2_dis_s_m = l2_dis_s + reg_m
l2_dis_t = (landmark_gt - landmarks_t).pow(2).sum(1)
br_loss = l2_dis_s[l2_dis_s_m > l2_dis_t].sum()
return br_loss * br_alpha
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import argparse
import logging
import os
import time
import torch
import torchvision
import numpy as np
from datasets import PFLDDatasets
from lib.builder import search_space
from lib.ops import PRIMITIVES
from lib.utils import PFLDLoss, accuracy
from nni.algorithms.nas.pytorch.fbnet import (
LookUpTable,
NASConfig,
supernet_sample,
)
from nni.nas.pytorch.utils import AverageMeter
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def validate(model, auxiliarynet, valid_loader, device, logger):
"""Do validation."""
model.eval()
auxiliarynet.eval()
losses, nme = list(), list()
with torch.no_grad():
for i, (img, land_gt, angle_gt) in enumerate(valid_loader):
img = img.to(device, non_blocking=True)
landmark_gt = land_gt.to(device, non_blocking=True)
angle_gt = angle_gt.to(device, non_blocking=True)
landmark, _ = model(img)
# compute the l2 loss
landmark = landmark.squeeze()
l2_diff = torch.sum((landmark_gt - landmark) ** 2, axis=1)
loss = torch.mean(l2_diff)
losses.append(loss.cpu().detach().numpy())
# compute the accuracy
landmark = landmark.cpu().detach().numpy()
landmark = landmark.reshape(landmark.shape[0], -1, 2)
landmark_gt = landmark_gt.cpu().detach().numpy()
landmark_gt = landmark_gt.reshape(landmark_gt.shape[0], -1, 2)
_, nme_i = accuracy(landmark, landmark_gt)
for item in nme_i:
nme.append(item)
logger.info("===> Evaluate:")
logger.info(
"Eval set: Average loss: {:.4f} nme: {:.4f}".format(
np.mean(losses), np.mean(nme)
)
)
return np.mean(losses), np.mean(nme)
def train_epoch(
model,
auxiliarynet,
criterion,
train_loader,
device,
epoch,
optimizer,
logger,
):
"""Train one epoch."""
model.train()
auxiliarynet.train()
batch_time = AverageMeter("batch_time")
data_time = AverageMeter("data_time")
losses = AverageMeter("losses")
end = time.time()
for i, (img, landmark_gt, angle_gt) in enumerate(train_loader):
data_time.update(time.time() - end)
img = img.to(device, non_blocking=True)
landmark_gt = landmark_gt.to(device, non_blocking=True)
angle_gt = angle_gt.to(device, non_blocking=True)
lands, feats = model(img)
landmarks = lands.squeeze()
angle = auxiliarynet(feats)
# task loss
weighted_loss, _ = criterion(
landmark_gt, angle_gt, angle, landmarks
)
loss = weighted_loss
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
# measure accuracy and record loss
losses.update(np.squeeze(loss.cpu().detach().numpy()), img.size(0))
if i % 10 == 0:
batch_log = (
"Train [{0}][{1}]\t"
"Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
"Data {data_time.val:.3f} ({data_time.avg:.3f})\t"
"Loss {losses.val:.4f} ({losses.avg:.4f})".format(
epoch + 1,
i,
batch_time=batch_time,
data_time=data_time,
losses=losses,
)
)
logger.info(batch_log)
def save_checkpoint(model, auxiliarynet, optimizer, filename, logger):
"""Save checkpoint of the whole model."""
state = {
"pfld_backbone": model.state_dict(),
"auxiliarynet": auxiliarynet.state_dict(),
"optim": optimizer.state_dict(),
}
torch.save(state, filename)
logger.info("Save checkpoint to {0:}".format(filename))
def main(args):
""" The main function for supernet pre-training and subnet fine-tuning. """
logging.basicConfig(
format="[%(asctime)s] [p%(process)s] [%(pathname)s\
:%(lineno)d] [%(levelname)s] %(message)s",
level=logging.INFO,
handlers=[
logging.FileHandler(args.log_file, mode="w"),
logging.StreamHandler(),
],
)
# print the information of arguments
for arg in vars(args):
s = arg + ": " + str(getattr(args, arg))
logging.info(s)
# for 106 landmarks
num_points = 106
# list of device ids, and the number of workers for data loading
device_ids = [int(id) for id in args.dev_id.split(",")]
dev_num = len(device_ids)
num_workers = 4 * dev_num
# import subnet for fine-tuning
from lib.subnet import PFLDInference, AuxiliaryNet
# the configuration for training control
nas_config = NASConfig(
model_dir=args.snapshot,
search_space=search_space,
)
# look-up table with information of search space, flops per block, etc.
lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
check = torch.load(args.supernet, map_location=torch.device("cpu"))
sampled_arch = check["arch_sample"]
logging.info(sampled_arch)
# create subnet
pfld_backbone = PFLDInference(lookup_table, sampled_arch, num_points)
# pre-load the weights from pre-trained supernet
state_dict = check["pfld_backbone"]
supernet_sample(pfld_backbone, state_dict, sampled_arch, lookup_table)
# the auxiliary-net of PFLD to predict the pose angle
auxiliarynet = AuxiliaryNet()
# DataParallel
pfld_backbone = torch.nn.DataParallel(pfld_backbone, device_ids=device_ids)
pfld_backbone.to(device)
auxiliarynet = torch.nn.DataParallel(auxiliarynet, device_ids=device_ids)
auxiliarynet.to(device)
# main task loss
criterion = PFLDLoss()
# optimizer / scheduler for weight train
optimizer = torch.optim.RMSprop(
[
{"params": pfld_backbone.parameters()},
{"params": auxiliarynet.parameters()},
],
lr=args.base_lr,
momentum=0.0,
weight_decay=args.weight_decay,
)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=args.end_epoch, last_epoch=-1
)
# data argmentation and dataloader
transform = torchvision.transforms.Compose(
[torchvision.transforms.ToTensor()]
)
# the landmark dataset with 106 points is default used
train_dataset = PFLDDatasets(
os.path.join(args.data_root, "train_data/list.txt"),
transform,
data_root=args.data_root,
img_size=args.img_size,
)
dataloader = DataLoader(
train_dataset,
batch_size=args.train_batchsize,
shuffle=True,
num_workers=num_workers,
pin_memory=True,
drop_last=False,
)
val_dataset = PFLDDatasets(
os.path.join(args.data_root, "test_data/list.txt"),
transform,
data_root=args.data_root,
img_size=args.img_size,
)
val_dataloader = DataLoader(
val_dataset,
batch_size=args.val_batchsize,
shuffle=False,
num_workers=num_workers,
pin_memory=True,
)
# start finetune
ckpt_path = args.snapshot
val_nme = 1e6
for epoch in range(0, args.end_epoch):
logging.info("\n--------Train epoch: %d--------\n", epoch + 1)
# update the weight parameters
train_epoch(
pfld_backbone,
auxiliarynet,
criterion,
dataloader,
device,
epoch,
optimizer,
logging,
)
# adjust learning rate
scheduler.step()
# validate
_, nme = validate(
pfld_backbone, auxiliarynet, val_dataloader, device, logging
)
if epoch % 10 == 0:
filename = os.path.join(ckpt_path, "checkpoint_%s.pth" % epoch)
save_checkpoint(
pfld_backbone, auxiliarynet, optimizer, filename, logging
)
if nme < val_nme:
filename = os.path.join(ckpt_path, "checkpoint_best.pth")
save_checkpoint(
pfld_backbone, auxiliarynet, optimizer, filename, logging
)
val_nme = nme
logging.info("Best nme: {:.4f}".format(val_nme))
def parse_args():
""" Parse the user arguments. """
parser = argparse.ArgumentParser(description="Finetuning for PFLD")
parser.add_argument("--dev_id", dest="dev_id", default="0", type=str)
parser.add_argument("--base_lr", default=0.0001, type=int)
parser.add_argument("--weight-decay", "--wd", default=1e-6, type=float)
parser.add_argument("--img_size", default=112, type=int)
parser.add_argument("--supernet", default="", type=str, metavar="PATH")
parser.add_argument("--end_epoch", default=300, type=int)
parser.add_argument(
"--snapshot", default="models", type=str, metavar="PATH"
)
parser.add_argument("--log_file", default="train.log", type=str)
parser.add_argument(
"--data_root", default="/dataset", type=str, metavar="PATH"
)
parser.add_argument("--train_batchsize", default=256, type=int)
parser.add_argument("--val_batchsize", default=128, type=int)
args = parser.parse_args()
args.snapshot = os.path.join(args.snapshot, 'subnet')
args.log_file = os.path.join(args.snapshot, "{}.log".format('subnet'))
os.makedirs(args.snapshot, exist_ok=True)
return args
if __name__ == "__main__":
args = parse_args()
main(args)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import argparse
import logging
import os
import torch
import torchvision
import numpy as np
from datasets import PFLDDatasets
from lib.builder import search_space
from lib.ops import PRIMITIVES
from lib.trainer import PFLDTrainer
from lib.utils import PFLDLoss
from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def main(args):
""" The main function for supernet pre-training and subnet fine-tuning. """
logging.basicConfig(
format="[%(asctime)s] [p%(process)s] [%(pathname)s\
:%(lineno)d] [%(levelname)s] %(message)s",
level=logging.INFO,
handlers=[
logging.FileHandler(args.log_file, mode="w"),
logging.StreamHandler(),
],
)
# print the information of arguments
for arg in vars(args):
s = arg + ": " + str(getattr(args, arg))
logging.info(s)
# for 106 landmarks
num_points = 106
# list of device ids, and the number of workers for data loading
device_ids = [int(id) for id in args.dev_id.split(",")]
dev_num = len(device_ids)
num_workers = 4 * dev_num
# random seed
manual_seed = 1
np.random.seed(manual_seed)
torch.manual_seed(manual_seed)
torch.cuda.manual_seed_all(manual_seed)
# import supernet for block-wise DNAS pre-training
from lib.supernet import PFLDInference, AuxiliaryNet
# the configuration for training control
nas_config = NASConfig(
model_dir=args.snapshot,
nas_lr=args.theta_lr,
mode=args.mode,
alpha=args.alpha,
beta=args.beta,
search_space=search_space,
)
# look-up table with information of search space, flops per block, etc.
lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
# create supernet
pfld_backbone = PFLDInference(lookup_table, num_points)
# the auxiliary-net of PFLD to predict the pose angle
auxiliarynet = AuxiliaryNet()
# main task loss
criterion = PFLDLoss()
# optimizer for weight train
if args.opt == "adam":
optimizer = torch.optim.AdamW(
[
{"params": pfld_backbone.parameters()},
{"params": auxiliarynet.parameters()},
],
lr=args.base_lr,
weight_decay=args.weight_decay,
)
elif args.opt == "rms":
optimizer = torch.optim.RMSprop(
[
{"params": pfld_backbone.parameters()},
{"params": auxiliarynet.parameters()},
],
lr=args.base_lr,
momentum=0.0,
weight_decay=args.weight_decay,
)
# data argmentation and dataloader
transform = torchvision.transforms.Compose(
[torchvision.transforms.ToTensor()]
)
# the landmark dataset with 106 points is default used
train_dataset = PFLDDatasets(
os.path.join(args.data_root, "train_data/list.txt"),
transform,
data_root=args.data_root,
img_size=args.img_size,
)
dataloader = DataLoader(
train_dataset,
batch_size=args.train_batchsize,
shuffle=True,
num_workers=num_workers,
pin_memory=True,
drop_last=False,
)
val_dataset = PFLDDatasets(
os.path.join(args.data_root, "test_data/list.txt"),
transform,
data_root=args.data_root,
img_size=args.img_size,
)
val_dataloader = DataLoader(
val_dataset,
batch_size=args.val_batchsize,
shuffle=False,
num_workers=num_workers,
pin_memory=True,
)
# create the trainer, then search/finetune
trainer = PFLDTrainer(
pfld_backbone,
auxiliarynet,
optimizer,
criterion,
device,
device_ids,
nas_config,
lookup_table,
dataloader,
val_dataloader,
n_epochs=args.end_epoch,
logger=logging,
)
trainer.train()
def parse_args():
""" Parse the user arguments. """
parser = argparse.ArgumentParser(description="FBNet for PFLD")
parser.add_argument("--dev_id", dest="dev_id", default="0", type=str)
parser.add_argument("--opt", default="rms", type=str)
parser.add_argument("--base_lr", default=0.0001, type=int)
parser.add_argument("--weight-decay", "--wd", default=1e-6, type=float)
parser.add_argument("--img_size", default=112, type=int)
parser.add_argument("--theta-lr", "--tlr", default=0.01, type=float)
parser.add_argument(
"--mode", default="mul", type=str, choices=["mul", "add"]
)
parser.add_argument("--alpha", default=0.25, type=float)
parser.add_argument("--beta", default=0.6, type=float)
parser.add_argument("--end_epoch", default=300, type=int)
parser.add_argument(
"--snapshot", default="models", type=str, metavar="PATH"
)
parser.add_argument("--log_file", default="train.log", type=str)
parser.add_argument(
"--data_root", default="/dataset", type=str, metavar="PATH"
)
parser.add_argument("--train_batchsize", default=256, type=int)
parser.add_argument("--val_batchsize", default=128, type=int)
args = parser.parse_args()
args.snapshot = os.path.join(args.snapshot, 'supernet')
args.log_file = os.path.join(args.snapshot, "{}.log".format('supernet'))
os.makedirs(args.snapshot, exist_ok=True)
return args
if __name__ == "__main__":
args = parse_args()
main(args)
from __future__ import absolute_import
from .mutator import FBNetMutator # noqa: F401
from .trainer import FBNetTrainer # noqa: F401
from .utils import ( # noqa: F401
LookUpTable,
NASConfig,
RegularizerLoss,
model_init,
supernet_sample,
)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import torch
from torch import nn as nn
from torch.nn import functional as F
import numpy as np
from nni.nas.pytorch.base_mutator import BaseMutator
from nni.nas.pytorch.mutables import LayerChoice
class MixedOp(nn.Module):
"""
This class is to instantiate and manage info of one LayerChoice.
It includes architecture weights and member functions for the weights.
"""
def __init__(self, mutable, latency):
"""
Parameters
----------
mutable : LayerChoice
A LayerChoice in user model
latency : List
performance cost for each op in mutable
"""
super(MixedOp, self).__init__()
self.latency = latency
n_choices = len(mutable)
self.path_alpha = nn.Parameter(
torch.FloatTensor([1.0 / n_choices for i in range(n_choices)])
)
self.path_alpha.requires_grad = False
self.temperature = 1.0
def get_path_alpha(self):
"""Return the architecture parameter."""
return self.path_alpha
def get_weighted_latency(self):
"""Return the weighted perf_cost of current mutable."""
soft_masks = self.probs_over_ops()
weighted_latency = sum(m * l for m, l in zip(soft_masks, self.latency))
return weighted_latency
def set_temperature(self, temperature):
"""
Set the annealed temperature for gumbel softmax.
Parameters
----------
temperature : float
The annealed temperature for gumbel softmax
"""
self.temperature = temperature
def to_requires_grad(self):
"""Enable gradient calculation."""
self.path_alpha.requires_grad = True
def to_disable_grad(self):
"""Disable gradient calculation."""
self.path_alpha.requires_grad = False
def probs_over_ops(self):
"""Apply gumbel softmax to generate probability distribution."""
return F.gumbel_softmax(self.path_alpha, self.temperature)
def forward(self, mutable, x):
"""
Define forward of LayerChoice.
Parameters
----------
mutable : LayerChoice
this layer's mutable
x : tensor
inputs of this layer, only support one input
Returns
-------
output: tensor
output of this layer
"""
candidate_ops = list(mutable)
soft_masks = self.probs_over_ops()
output = sum(m * op(x) for m, op in zip(soft_masks, candidate_ops))
return output
@property
def chosen_index(self):
"""
choose the op with max prob
Returns
-------
int
index of the chosen one
"""
alphas = self.path_alpha.data.detach().cpu().numpy()
index = int(np.argmax(alphas))
return index
class FBNetMutator(BaseMutator):
"""
This mutator initializes and operates all the LayerChoices of the supernet.
It is for the related trainer to control the training flow of LayerChoices,
coordinating with whole training process.
"""
def __init__(self, model, lookup_table):
"""
Init a MixedOp instance for each mutable i.e., LayerChoice.
And register the instantiated MixedOp in corresponding LayerChoice.
If does not register it in LayerChoice, DataParallel does'nt work then,
for architecture weights are not included in the DataParallel model.
When MixedOPs are registered, we use ```requires_grad``` to control
whether calculate gradients of architecture weights.
Parameters
----------
model : pytorch model
The model that users want to tune,
it includes search space defined with nni nas apis
lookup_table : class
lookup table object to manage model space information,
including candidate ops for each stage as the model space,
input channels/output channels/stride/fm_size as the layer config,
and the performance information for perf_cost accumulation.
"""
super(FBNetMutator, self).__init__(model)
self.mutable_list = []
# Collect the op names of the candidate ops within each mutable
ops_names_mutable = dict()
left = 0
right = 1
for stage_name in lookup_table.layer_num:
right = lookup_table.layer_num[stage_name]
stage_ops = lookup_table.lut_ops[stage_name]
ops_names = [op_name for op_name in stage_ops]
for i in range(left, left + right):
ops_names_mutable[i] = ops_names
left = right
# Create the mixed op
for i, mutable in enumerate(self.undedup_mutables):
ops_names = ops_names_mutable[i]
latency_mutable = lookup_table.lut_perf[i]
latency = [latency_mutable[op_name] for op_name in ops_names]
self.mutable_list.append(mutable)
mutable.registered_module = MixedOp(mutable, latency)
def on_forward_layer_choice(self, mutable, *args, **kwargs):
"""
Callback of layer choice forward. This function defines the forward
logic of the input mutable. So mutable is only interface, its real
implementation is defined in mutator.
Parameters
----------
mutable: LayerChoice
forward logic of this input mutable
args: list of torch.Tensor
inputs of this mutable
kwargs: dict
inputs of this mutable
Returns
-------
torch.Tensor
output of this mutable, i.e., LayerChoice
int
index of the chosen op
"""
# FIXME: return mask, to be consistent with other algorithms
idx = mutable.registered_module.chosen_index
return mutable.registered_module(mutable, *args, **kwargs), idx
def num_arch_params(self):
"""
The number of mutables, i.e., LayerChoice
Returns
-------
int
the number of LayerChoice in user model
"""
return len(self.mutable_list)
def get_architecture_parameters(self):
"""
Get all the architecture parameters.
yield
-----
PyTorch Parameter
Return path_alpha of the traversed mutable
"""
for mutable in self.undedup_mutables:
yield mutable.registered_module.get_path_alpha()
def get_weighted_latency(self):
"""
Get the latency weighted by gumbel softmax coefficients.
yield
-----
Tuple
Return the weighted_latency of the traversed mutable
"""
for mutable in self.undedup_mutables:
yield mutable.registered_module.get_weighted_latency()
def set_temperature(self, temperature):
"""
Set the annealed temperature of the op for gumbel softmax.
Parameters
----------
temperature : float
The annealed temperature for gumbel softmax
"""
for mutable in self.undedup_mutables:
mutable.registered_module.set_temperature(temperature)
def arch_requires_grad(self):
"""
Make architecture weights require gradient
"""
for mutable in self.undedup_mutables:
mutable.registered_module.to_requires_grad()
def arch_disable_grad(self):
"""
Disable gradient of architecture weights, i.e., does not
calculate gradient for them.
"""
for mutable in self.undedup_mutables:
mutable.registered_module.to_disable_grad()
def sample_final(self):
"""
Generate the final chosen architecture.
Returns
-------
dict
the choice of each mutable, i.e., LayerChoice
"""
result = dict()
for mutable in self.undedup_mutables:
assert isinstance(mutable, LayerChoice)
index = mutable.registered_module.chosen_index
# pylint: disable=not-callable
result[mutable.key] = (
F.one_hot(torch.tensor(index), num_classes=len(mutable))
.view(-1)
.bool(),
)
return result
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import json
import os
import time
import torch
import numpy as np
from torch.autograd import Variable
from nni.nas.pytorch.base_trainer import BaseTrainer
from nni.nas.pytorch.trainer import TorchTensorEncoder
from nni.nas.pytorch.utils import AverageMeter
from .mutator import FBNetMutator
from .utils import RegularizerLoss, accuracy
class FBNetTrainer(BaseTrainer):
def __init__(
self,
model,
model_optim,
criterion,
device,
device_ids,
lookup_table,
train_loader,
valid_loader,
n_epochs=120,
load_ckpt=False,
arch_path=None,
logger=None,
):
"""
Parameters
----------
model : pytorch model
the user model, which has mutables
model_optim : pytorch optimizer
the user defined optimizer
criterion : pytorch loss
the main task loss, nn.CrossEntropyLoss() is for classification
device : pytorch device
the devices to train/search the model
device_ids : list of int
the indexes of devices used for training
lookup_table : class
lookup table object for fbnet training
train_loader : pytorch data loader
data loader for the training set
valid_loader : pytorch data loader
data loader for the validation set
n_epochs : int
number of epochs to train/search
load_ckpt : bool
whether load checkpoint
arch_path : str
the path to store chosen architecture
logger : logger
the logger
"""
self.model = model
self.model_optim = model_optim
self.train_loader = train_loader
self.valid_loader = valid_loader
self.device = device
self.dev_num = len(device_ids)
self.n_epochs = n_epochs
self.lookup_table = lookup_table
self.config = lookup_table.config
self.start_epoch = self.config.start_epoch
self.temp = self.config.init_temperature
self.exp_anneal_rate = self.config.exp_anneal_rate
self.mode = self.config.mode
self.load_ckpt = load_ckpt
self.arch_path = arch_path
self.logger = logger
# scheduler of learning rate
self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
model_optim, T_max=n_epochs, last_epoch=-1
)
# init mutator
self.mutator = FBNetMutator(model, lookup_table)
self.mutator.set_temperature(self.temp)
# DataParallel should be put behind the init of mutator
self.model = torch.nn.DataParallel(self.model, device_ids=device_ids)
self.model.to(device)
# build architecture optimizer
self.arch_optimizer = torch.optim.AdamW(
self.mutator.get_architecture_parameters(),
self.config.nas_lr,
weight_decay=self.config.nas_weight_decay,
)
self.reg_loss = RegularizerLoss(config=self.config)
self.criterion = criterion
self.epoch = 0
def _layer_choice_sample(self):
"""
Sample the index of network within layer choice
"""
stages = [stage_name for stage_name in self.lookup_table.layer_num]
stage_lnum = [self.lookup_table.layer_num[stage] for stage in stages]
# get the choice idx in each layer
choice_ids = list()
layer_id = 0
for param in self.mutator.get_architecture_parameters():
param_np = param.cpu().detach().numpy()
op_idx = np.argmax(param_np)
choice_ids.append(op_idx)
self.logger.info(
"layer {}: {}, index: {}".format(layer_id, param_np, op_idx)
)
layer_id += 1
# get the arch_sample
choice_names = list()
layer_id = 0
for i, stage_name in enumerate(stages):
ops_names = [op for op in self.lookup_table.lut_ops[stage_name]]
for j in range(stage_lnum[i]):
searched_op = ops_names[choice_ids[layer_id]]
choice_names.append(searched_op)
layer_id += 1
self.logger.info(choice_names)
return choice_names
def _get_perf_cost(self, requires_grad=True):
"""
Get the accumulated performance cost.
"""
perf_cost = Variable(
torch.zeros(1), requires_grad=requires_grad
).to(self.device, non_blocking=True)
for latency in self.mutator.get_weighted_latency():
perf_cost = perf_cost + latency
return perf_cost
def _validate(self):
"""
Do validation. During validation, LayerChoices use the mixed-op.
Returns
-------
float, float, float
average loss, average top1 accuracy, average top5 accuracy
"""
self.valid_loader.batch_sampler.drop_last = False
batch_time = AverageMeter("batch_time")
losses = AverageMeter("losses")
top1 = AverageMeter("top1")
top5 = AverageMeter("top5")
# test on validation set under eval mode
self.model.eval()
end = time.time()
with torch.no_grad():
for i, (images, labels) in enumerate(self.valid_loader):
images = images.to(self.device, non_blocking=True)
labels = labels.to(self.device, non_blocking=True)
output = self.model(images)
loss = self.criterion(output, labels)
acc1, acc5 = accuracy(output, labels, topk=(1, 5))
losses.update(loss, images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % 10 == 0 or i + 1 == len(self.valid_loader):
test_log = (
"Valid" + ": [{0}/{1}]\t"
"Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
"Loss {loss.val:.4f} ({loss.avg:.4f})\t"
"Top-1 acc {top1.val:.3f} ({top1.avg:.3f})\t"
"Top-5 acc {top5.val:.3f} ({top5.avg:.3f})".format(
i,
len(self.valid_loader) - 1,
batch_time=batch_time,
loss=losses,
top1=top1,
top5=top5,
)
)
self.logger.info(test_log)
return losses.avg, top1.avg, top5.avg
def _train_epoch(self, epoch, optimizer, arch_train=False):
"""
Train one epoch.
"""
batch_time = AverageMeter("batch_time")
data_time = AverageMeter("data_time")
losses = AverageMeter("losses")
top1 = AverageMeter("top1")
top5 = AverageMeter("top5")
# switch to train mode
self.model.train()
data_loader = self.valid_loader if arch_train else self.train_loader
end = time.time()
for i, (images, labels) in enumerate(data_loader):
data_time.update(time.time() - end)
images = images.to(self.device, non_blocking=True)
labels = labels.to(self.device, non_blocking=True)
output = self.model(images)
loss = self.criterion(output, labels)
# hardware-aware loss
perf_cost = self._get_perf_cost(requires_grad=True)
regu_loss = self.reg_loss(perf_cost)
if self.mode.startswith("mul"):
loss = loss * regu_loss
elif self.mode.startswith("add"):
loss = loss + regu_loss
# measure accuracy and record loss
acc1, acc5 = accuracy(output, labels, topk=(1, 5))
losses.update(loss.item(), images.size(0))
top1.update(acc1[0].item(), images.size(0))
top5.update(acc5[0].item(), images.size(0))
# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()
# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()
if i % 10 == 0:
batch_log = (
"Warmup Train [{0}][{1}]\t"
"Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t"
"Data {data_time.val:.3f} ({data_time.avg:.3f})\t"
"Loss {losses.val:.4f} ({losses.avg:.4f})\t"
"Top-1 acc {top1.val:.3f} ({top1.avg:.3f})\t"
"Top-5 acc {top5.val:.3f} ({top5.avg:.3f})\t".format(
epoch + 1,
i,
batch_time=batch_time,
data_time=data_time,
losses=losses,
top1=top1,
top5=top5,
)
)
self.logger.info(batch_log)
def _warm_up(self):
"""
Warm up the model, while the architecture weights are not trained.
"""
for epoch in range(self.epoch, self.start_epoch):
self.logger.info("\n--------Warmup epoch: %d--------\n", epoch + 1)
self._train_epoch(epoch, self.model_optim)
# adjust learning rate
self.scheduler.step()
# validation
val_loss, val_top1, val_top5 = self._validate()
val_log = (
"Warmup Valid [{0}/{1}]\t"
"loss {2:.3f}\ttop-1 acc {3:.3f}\ttop-5 acc {4:.3f}".format(
epoch + 1, self.warmup_epochs, val_loss, val_top1, val_top5
)
)
self.logger.info(val_log)
if epoch % 10 == 0:
filename = os.path.join(
self.config.model_dir, "checkpoint_%s.pth" % epoch
)
self.save_checkpoint(epoch, filename)
def _train(self):
"""
Train the model, it trains model weights and architecute weights.
Architecture weights are trained according to the schedule.
Before updating architecture weights, ```requires_grad``` is enabled.
Then, it is disabled after the updating, in order not to update
architecture weights when training model weights.
"""
arch_param_num = self.mutator.num_arch_params()
self.logger.info("#arch_params: {}".format(arch_param_num))
self.epoch = max(self.start_epoch, self.epoch)
ckpt_path = self.config.model_dir
choice_names = None
top1_best = 0.0
for epoch in range(self.epoch, self.n_epochs):
self.logger.info("\n--------Train epoch: %d--------\n", epoch + 1)
# update the weight parameters
self._train_epoch(epoch, self.model_optim)
# adjust learning rate
self.scheduler.step()
self.logger.info("Update architecture parameters")
# update the architecture parameters
self.mutator.arch_requires_grad()
self._train_epoch(epoch, self.arch_optimizer, True)
self.mutator.arch_disable_grad()
# temperature annealing
self.temp = self.temp * self.exp_anneal_rate
self.mutator.set_temperature(self.temp)
# sample the architecture of sub-network
choice_names = self._layer_choice_sample()
# validate
val_loss, val_top1, val_top5 = self._validate()
val_log = (
"Valid [{0}]\t"
"loss {1:.3f}\ttop-1 acc {2:.3f} \ttop-5 acc {3:.3f}".format(
epoch + 1, val_loss, val_top1, val_top5
)
)
self.logger.info(val_log)
if epoch % 10 == 0:
filename = os.path.join(ckpt_path, "checkpoint_%s.pth" % epoch)
self.save_checkpoint(epoch, filename, choice_names)
val_top1 = val_top1.cpu().as_numpy()
if val_top1 > top1_best:
filename = os.path.join(ckpt_path, "checkpoint_best.pth")
self.save_checkpoint(epoch, filename, choice_names)
top1_best = val_top1
def save_checkpoint(self, epoch, filename, choice_names=None):
"""
Save checkpoint of the whole model.
Saving model weights and architecture weights as ```filename```,
and saving currently chosen architecture in ```arch_path```.
"""
state = {
"model": self.model.state_dict(),
"optim": self.model_optim.state_dict(),
"epoch": epoch,
"arch_sample": choice_names,
}
torch.save(state, filename)
self.logger.info("Save checkpoint to {0:}".format(filename))
if self.arch_path:
self.export(self.arch_path)
def load_checkpoint(self, filename):
"""
Load the checkpoint from ```ckpt_path```.
"""
ckpt = torch.load(filename)
self.epoch = ckpt["epoch"]
self.model.load_state_dict(ckpt["model"])
self.model_optim.load_state_dict(ckpt["optim"])
def train(self):
"""
Train the whole model.
"""
if self.load_ckpt:
ckpt_path = self.config.model_dir
filename = os.path.join(ckpt_path, "checkpoint_best.pth")
if os.path.exists(filename):
self.load_checkpoint(filename)
if self.epoch < self.start_epoch:
self._warm_up()
self._train()
def export(self, file_name):
"""
Export the chosen architecture into a file
Parameters
----------
file_name : str
the file that stores exported chosen architecture
"""
exported_arch = self.mutator.sample_final()
with open(file_name, "w") as f:
json.dump(
exported_arch,
f,
indent=2,
sort_keys=True,
cls=TorchTensorEncoder,
)
def validate(self):
raise NotImplementedError
def checkpoint(self):
raise NotImplementedError
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from __future__ import absolute_import, division, print_function
import gc # noqa: F401
import os
import timeit
import torch
import numpy as np
import torch.nn as nn
from nni.compression.pytorch.utils.counter import count_flops_params
LUT_FILE = "lut.npy"
LUT_PATH = "lut"
class NASConfig:
def __init__(
self,
perf_metric="flops",
lut_load=False,
model_dir=None,
nas_lr=0.01,
nas_weight_decay=5e-4,
mode="mul",
alpha=0.25,
beta=0.6,
start_epoch=50,
init_temperature=5.0,
exp_anneal_rate=np.exp(-0.045),
search_space=None,
):
# LUT of performance metric
# flops means the multiplies, latency means the time cost on platform
self.perf_metric = perf_metric
assert perf_metric in [
"flops",
"latency",
], "perf_metric should be ['flops', 'latency']"
# wether load or create lut file
self.lut_load = lut_load
# necessary dirs
self.lut_en = model_dir is not None
if self.lut_en:
self.model_dir = model_dir
os.makedirs(model_dir, exist_ok=True)
self.lut_path = os.path.join(model_dir, LUT_PATH)
os.makedirs(self.lut_path, exist_ok=True)
# NAS learning setting
self.nas_lr = nas_lr
self.nas_weight_decay = nas_weight_decay
# hardware-aware loss setting
self.mode = mode
assert mode in ["mul", "add"], "mode should be ['mul', 'add']"
self.alpha = alpha
self.beta = beta
# NAS training setting
self.start_epoch = start_epoch
self.init_temperature = init_temperature
self.exp_anneal_rate = exp_anneal_rate
# definition of search blocks and space
self.search_space = search_space
class RegularizerLoss(nn.Module):
"""Auxilliary loss for hardware-aware NAS."""
def __init__(self, config):
"""
Parameters
----------
config : class
to manage the configuration for NAS training, and search space etc.
"""
super(RegularizerLoss, self).__init__()
self.mode = config.mode
self.alpha = config.alpha
self.beta = config.beta
def forward(self, perf_cost, batch_size=1):
"""
Parameters
----------
perf_cost : tensor
the accumulated performance cost
batch_size : int
batch size for normalization
Returns
-------
output: tensor
the hardware-aware constraint loss
"""
if self.mode == "mul":
log_loss = torch.log(perf_cost / batch_size) ** self.beta
return self.alpha * log_loss
elif self.mode == "add":
linear_loss = (perf_cost / batch_size) ** self.beta
return self.alpha * linear_loss
else:
raise NotImplementedError
def accuracy(output, target, topk=(1,)):
"""
Computes the precision@k for the specified values of k
Parameters
----------
output : pytorch tensor
output, e.g., predicted value
target : pytorch tensor
label
topk : tuple
specify top1 and top5
Returns
-------
list
accuracy of top1 and top5
"""
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res
def supernet_sample(model, state_dict, sampled_arch=[], lookup_table=None):
"""
Initialize the searched sub-model from supernet.
Parameters
----------
model : pytorch model
the created subnet
state_dict : checkpoint
the checkpoint of supernet, including the pre-trained params
sampled_arch : list of str
the searched layer names of the subnet
lookup_table : class
to manage the candidate ops, layer information and layer performance
"""
replace = list()
stages = [stage for stage in lookup_table.layer_num]
stage_lnum = [lookup_table.layer_num[stage] for stage in stages]
if sampled_arch:
layer_id = 0
for i, stage in enumerate(stages):
ops_names = [op_name for op_name in lookup_table.lut_ops[stage]]
for j in range(stage_lnum[i]):
searched_op = sampled_arch[layer_id]
op_i = ops_names.index(searched_op)
replace.append(
[
"blocks.{}.".format(layer_id),
"blocks.{}.op.".format(layer_id),
"blocks.{}.{}.".format(layer_id, op_i),
]
)
layer_id += 1
model_init(model, state_dict, replace=replace)
def model_init(model, state_dict, replace=[]):
"""Initialize the model from state_dict."""
prefix = "module."
param_dict = dict()
for k, v in state_dict.items():
if k.startswith(prefix):
k = k[7:]
param_dict[k] = v
for k, (name, m) in enumerate(model.named_modules()):
if replace:
for layer_replace in replace:
assert len(layer_replace) == 3, "The elements should be three."
pre_scope, key, replace_key = layer_replace
if pre_scope in name:
name = name.replace(key, replace_key)
# Copy the state_dict to current model
if (name + ".weight" in param_dict) or (
name + ".running_mean" in param_dict
):
if isinstance(m, nn.BatchNorm2d):
shape = m.running_mean.shape
if shape == param_dict[name + ".running_mean"].shape:
if m.weight is not None:
m.weight.data = param_dict[name + ".weight"]
m.bias.data = param_dict[name + ".bias"]
m.running_mean = param_dict[name + ".running_mean"]
m.running_var = param_dict[name + ".running_var"]
elif isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
shape = m.weight.data.shape
if shape == param_dict[name + ".weight"].shape:
m.weight.data = param_dict[name + ".weight"]
if m.bias is not None:
m.bias.data = param_dict[name + ".bias"]
elif isinstance(m, nn.ConvTranspose2d):
m.weight.data = param_dict[name + ".weight"]
if m.bias is not None:
m.bias.data = param_dict[name + ".bias"]
class LookUpTable:
"""Build look-up table for NAS."""
def __init__(self, config, primitives):
"""
Parameters
----------
config : class
to manage the configuration for NAS training, and search space etc.
"""
self.config = config
# definition of search blocks and space
self.search_space = config.search_space
# layers for NAS
self.cnt_layers = len(self.search_space["input_shape"])
# constructors for each operation
self.lut_ops = {
stage_name: {
op_name: primitives[op_name]
for op_name in self.search_space["stages"][stage_name]["ops"]
}
for stage_name in self.search_space["stages"]
}
self.layer_num = {
stage_name: self.search_space["stages"][stage_name]["layer_num"]
for stage_name in self.search_space["stages"]
}
# arguments for the ops constructors, input_shapes just for convinience
self.layer_configs, self.layer_in_shapes = self._layer_configs()
# lookup_table
self.perf_metric = config.perf_metric
if config.lut_en:
self.lut_perf = None
self.lut_file = os.path.join(config.lut_path, LUT_FILE)
if config.lut_load:
self._load_from_file()
else:
self._create_perfs()
def _layer_configs(self):
"""Generate basic params for different layers."""
# layer_configs are : c_in, c_out, stride, fm_size
layer_configs = [
[
self.search_space["input_shape"][layer_id][0],
self.search_space["channel_size"][layer_id],
self.search_space["strides"][layer_id],
self.search_space["fm_size"][layer_id],
]
for layer_id in range(self.cnt_layers)
]
# layer_in_shapes are (C_in, input_w, input_h)
layer_in_shapes = self.search_space["input_shape"]
return layer_configs, layer_in_shapes
def _create_perfs(self, cnt_of_runs=200):
"""Create performance cost for each op."""
if self.perf_metric == "latency":
self.lut_perf = self._calculate_latency(cnt_of_runs)
elif self.perf_metric == "flops":
self.lut_perf = self._calculate_flops()
self._write_lut_to_file()
def _calculate_flops(self, eps=0.001):
"""FLOPs cost."""
flops_lut = [{} for i in range(self.cnt_layers)]
layer_id = 0
for stage_name in self.lut_ops:
stage_ops = self.lut_ops[stage_name]
ops_num = self.layer_num[stage_name]
for _ in range(ops_num):
for op_name in stage_ops:
layer_config = self.layer_configs[layer_id]
key_params = {"fm_size": layer_config[3]}
op = stage_ops[op_name](*layer_config[0:3], **key_params)
# measured in Flops
in_shape = self.layer_in_shapes[layer_id]
x = (1, in_shape[0], in_shape[1], in_shape[2])
flops, _, _ = count_flops_params(op, x, verbose=False)
flops = eps if flops == 0.0 else flops
flops_lut[layer_id][op_name] = float(flops)
layer_id += 1
return flops_lut
def _calculate_latency(self, cnt_of_runs):
"""Latency cost."""
LATENCY_BATCH_SIZE = 1
latency_lut = [{} for i in range(self.cnt_layers)]
layer_id = 0
for stage_name in self.lut_ops:
stage_ops = self.lut_ops[stage_name]
ops_num = self.layer_num[stage_name]
for _ in range(ops_num):
for op_name in stage_ops:
layer_config = self.layer_configs[layer_id]
key_params = {"fm_size": layer_config[3]}
op = stage_ops[op_name](*layer_config[0:3], **key_params)
input_data = torch.randn(
(LATENCY_BATCH_SIZE, *self.layer_in_shapes[layer_id])
)
globals()["op"], globals()["input_data"] = op, input_data
total_time = timeit.timeit(
"output = op(input_data)",
setup="gc.enable()",
globals=globals(),
number=cnt_of_runs,
)
# measured in micro-second
latency_lut[layer_id][op_name] = (
total_time / cnt_of_runs / LATENCY_BATCH_SIZE * 1e6
)
layer_id += 1
return latency_lut
def _write_lut_to_file(self):
"""Save lut as numpy file."""
np.save(self.lut_file, self.lut_perf)
def _load_from_file(self):
"""Load numpy file."""
self.lut_perf = np.load(self.lut_file, allow_pickle=True)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment