@@ -30,44 +33,42 @@ The complete example code can be found :githublink:`here <examples/nas/oneshot/p
...
@@ -30,44 +33,42 @@ The complete example code can be found :githublink:`here <examples/nas/oneshot/p
* **model** (*PyTorch model, required*\ ) - The model that users want to tune/search. It has mutables to specify search space.
* **model** (*PyTorch model, required*\ ) - The model that users want to tune/search. It has mutables to specify search space.
* **model_optim** (*PyTorch optimizer, required*\ ) - The optimizer users want to train the model.
* **metrics** (*PyTorch module, required*\ ) - The main term of the loss function for model train. Receives logits and ground truth label, return a loss tensor.
* **device** (*device, required*\ ) - The devices that users provide to do the train/search. The trainer applies data parallel on the model for users.
* **optimizer** (*PyTorch Optimizer, required*\) - The optimizer used for optimizing the model.
* **train_loader** (*PyTorch data loader, required*\ ) - The data loader for training set.
* **num_epochs** (*int, optional, default = 120*\ ) - The number of epochs to train/search.
* **valid_loader** (*PyTorch data loader, required*\ ) - The data loader for validation set.
* **dataset** (*PyTorch dataset, required*\ ) - Dataset for training. Will be split for training weights and architecture weights.
* **label_smoothing** (*float, optional, default = 0.1*\ ) - The degree of label smoothing.
* **warmup_epochs** (*int, optional, default = 0*\ ) - The number of epochs to do during warmup.
* **n_epochs** (*int, optional, default = 120*\ ) - The number of epochs to train/search.
* **init_lr** (*float, optional, default = 0.025*\ ) - The initial learning rate for training the model.
* **workers** (*int, optional, default = 4*\ ) - Workers for data loading.
* **binary_mode** (*'two', 'full', or 'full_v2', optional, default = 'full_v2'*\ ) - The forward/backward mode for the binary weights in mutator. 'full' means forward all the candidate ops, 'two' means only forward two sampled ops, 'full_v2' means recomputing the inactive ops during backward.
* **device** (*device, optional, default = 'cpu'*\ ) - The devices that users provide to do the train/search. The trainer applies data parallel on the model for users.
* **arch_init_type** (*'normal' or 'uniform', optional, default = 'normal'*\ ) - The way to init architecture parameters.
* **arch_init_ratio** (*float, optional, default = 1e-3*\ ) - The ratio to init architecture parameters.
* **arc_learning_rate** (*float, optional, default = 1e-3*\ ) - The learning rate of the architecture parameters optimizer.
* **arch_optim_lr** (*float, optional, default = 1e-3*\ ) - The learning rate of the architecture parameters optimizer.
* **grad_reg_loss_type** (*'mul#log', 'add#linear', or None, optional, default = 'add#linear'*\ ) - Regularization type to add hardware related loss. The trainer will not apply loss regularization when grad_reg_loss_type is set as None.
* **arch_weight_decay** (*float, optional, default = 0*\ ) - Weight decay of the architecture parameters optimizer.
* **grad_reg_loss_params** (*dict, optional, default = None*\ ) - Regularization params. 'alpha' and 'beta' is required when ``grad_reg_loss_type`` is 'mul#log', 'lambda' is required when ``grad_reg_loss_type`` is 'add#linear'.
* **grad_update_arch_param_every** (*int, optional, default = 5*\ ) - Update architecture weights every this number of minibatches.
* **applied_hardware** (*string, optional, default = None*\ ) - Applied hardware for to constraint the model's latency. Latency is predicted by Microsoft nn-Meter (https://github.com/microsoft/nn-Meter).
* **grad_update_steps** (*int, optional, default = 1*\ ) - During each update of architecture weights, the number of steps to train architecture weights.
* **dummy_input** (*tuple, optional, default = (1, 3, 224, 224)*\ ) - The dummy input shape when applied to the target hardware.
* **warmup** (*bool, optional, default = True*\ ) - Whether to do warmup.
* **ref_latency** (*float, optional, default = 65.0*\ ) - Reference latency value in the applied hardware (ms).
* **warmup_epochs** (*int, optional, default = 25*\ ) - The number of epochs to do during warmup.
* **arch_valid_frequency** (*int, optional, default = 1*\ ) - The frequency of printing validation result.
* **ckpt_path** (*str, optional, default = None*\ ) - checkpoint path, if load_ckpt is True, ckpt_path cannot be None.
* **arch_path** (*str, optional, default = None*\ ) - The path to store chosen architecture.
Implementation
Implementation
--------------
--------------
The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based, and support different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'. In our current implementation on NNI, gradient descent training approach is supported, but has not supported different hardwares. The complete support is ongoing.
The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based. In our current implementation on NNI, gradient descent training approach is supported. The complete support of ProxylessNAS is ongoing.
The official implementation supports different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'. In NNI repo, the hardware latency prediction is supported by `Microsoft nn-Meter <https://github.com/microsoft/nn-Meter>`__. nn-Meter is an accurate inference latency predictor for DNN models on diverse edge devices. nn-Meter support four hardwares up to now, including *'cortexA76cpu_tflite21'*, *'adreno640gpu_tflite21'*, *'adreno630gpu_tflite21'*, and *'myriadvpu_openvino2019r2'*. Users can find more information about nn-Meter on its website. More hardware will be supported in the future.
Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in :githublink:`example code <examples/nas/oneshot/proxylessnas>` using :githublink:`NNI NAS interface <nni/algorithms/nas/pytorch/proxylessnas>`.
Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, we put the specified search space in :githublink:`example code <examples/nas/oneshot/proxylessnas>` using :githublink:`NNI NAS interface <nni/retiarii/oneshot/pytorch/proxyless>`.
.. image:: ../../img/proxylessnas.png
.. image:: ../../img/proxylessnas.png
:target: ../../img/proxylessnas.png
:target: ../../img/proxylessnas.png
:alt:
:alt:
ProxylessNAS training approach is composed of ProxylessNasMutator and ProxylessNasTrainer. ProxylessNasMutator instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**\ , architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The mutator also exposes two member functions, i.e., ``arch_requires_grad``\ , ``arch_disable_grad``\ , for the trainer to control the training of architecture weights.
ProxylessNAS training approach is composed of ProxylessLayerChoice and ProxylessNasTrainer. ProxylessLayerChoice instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**\ , architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The ProxylessLayerChoice class also exposes two member functions, i.e., ``resample``\ , ``finalize_grad``\ , for the trainer to control the training of architecture weights.
ProxylessNasMutator also implements the forward logic of the mutables (i.e., LayerChoice).
ProxylessNasMutator also implements the forward logic of the mutables (i.e., LayerChoice).
Reproduce Results
Reproduce Results
-----------------
-----------------
To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it. The test accuracy of the found architecture is top1: 72.31, top5: 90.26.
To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it.