***New use case sharing**: [Cost-effective Hyper-parameter Tuning using AdaptDL with NNI](https://medium.com/casl-project/cost-effective-hyper-parameter-tuning-using-adaptdl-with-nni-e55642888761) - _posted on Feb-23-2021_
...
...
@@ -252,7 +252,7 @@ Note:
* Download the examples via clone the source code.
Here, {nni-version} should by replaced by the version of NNI, e.g., ``master``, ``v2.0``. You can also check the latest ``bash-completion`` script :githublink:`here <tools/bash-completion>`.
Here, {nni-version} should by replaced by the version of NNI, e.g., ``master``, ``v2.1``. You can also check the latest ``bash-completion`` script :githublink:`here <tools/bash-completion>`.
This document includes two parts. The first part explains the design decision of ``@basic_unit`` and ``serializer``. The second part is the tutorial of how to write a model space with mutators.
``@basic_unit`` and ``serializer``
----------------------------------
.. _serializer:
``@basic_unit`` and ``serialize`` can be viewed as some kind of serializer. They are designed for making the whole model (including training) serializable to be executed on another process or machine.
**@basic_unit** annotates that a module is a basic unit, i.e, no need to understand the details of this module. The effect is that it prevents Retiarii to parse this module. To understand this, we first briefly explain how Retiarii works: it converts user-defined model to a graph representation (called graph IR) using `TorchScript <https://pytorch.org/docs/stable/jit.html>`__, each instantiated module in the model is converted to a subgraph. Then mutations are applied to the graph to generate new graphs. Each new graph is then converted back to PyTorch code and executed. ``@basic_unit`` here means the module will not be converted to a subgraph, instead, it is converted to a single graph node as a basic unit. That is, the module will not be unfolded anymore. When the module is not unfolded, mutations on initialization parameters of this module becomes easier.
``@basic_unit`` is usually used in the following cases:
* When users want to tune initialization parameters of a module using ``ValueChoice``, then decorate the module with ``@basic_unit``. For example, ``self.conv = MyConv(kernel_size=nn.ValueChoice([1, 3, 5]))``, here ``MyConv`` should be decorated.
* When a module cannot be successfully parsed to a subgraph, decorate the module with ``@basic_unit``. The parse failure could be due to complex control flow. Currently Retiarii does not support adhoc loop, if there is adhoc loop in a module's forward, this class should be decorated as serializable module. For example, the following ``MyModule`` should be decorated.
.. code-block:: python
@basic_unit
class MyModule(nn.Module):
def __init__(self):
...
def forward(self, x):
for i in range(10): # <- adhoc loop
...
* Some inline mutation APIs require their handled module to be decorated with ``@basic_unit``. For example, user-defined module that is provided to ``LayerChoice`` as a candidate op should be decorated.
**serialize** is mainly used for serializing model training logic. It enables re-instantiation of model evaluator in another process or machine. Re-instantiation is necessary because most of time model and evaluator should be sent to training services. ``serialize`` is implemented by recording the initialization parameters of user instantiated evaluator.
The evaluator related APIs provided by Retiarii have already supported serialization, for example ``pl.Classification``, ``pl.DataLoader``, no need to apply ``serialize`` on them. In the following case users should use ``serialize`` API manually.
If the initialization parameters of the evaluator APIs (e.g., ``pl.Classification``, ``pl.DataLoader``) are not primitive types (e.g., ``int``, ``string``), they should be applied with ``serialize``. If those parameters' initialization parameters are not primitive types, ``serialize`` should also be applied. In a word, ``serialize`` should be applied recursively if necessary.
Express Mutations with Mutators
-------------------------------
Besides inline mutations which have been demonstrated `here <./Tutorial.rst>`__, Retiarii provides a more general approach to express a model space: *Mutator*. Inline mutations APIs are also implemented with mutator, which can be seen as a special case of model mutation.
.. note:: Mutator and inline mutation APIs cannot be used together.
A mutator is a piece of logic to express how to mutate a given model. Users are free to write their own mutators. Then a model space is expressed with a base model and a list of mutators. A model in the model space is sampled by applying the mutators on the base model one after another. An example is shown below.
The input of ``mutate`` is graph IR (Intermediate Representation) of the base model (please refer to `here <./ApiReference.rst>`__ for the format and APIs of the IR), users can mutate the graph using the graph's member functions (e.g., ``get_nodes_by_label``, ``update_operation``). The mutation operations can be combined with the API ``self.choice``, in order to express a set of possible mutations. In the above example, the node's operation can be changed to any operation from ``candidate_op_list``.
Use placehoder to make mutation easier: ``nn.Placeholder``. If you want to mutate a subgraph or node of your model, you can define a placeholder in this model to represent the subgraph or node. Then, use mutator to mutate this placeholder to make it real modules.
.. code-block:: python
ph = nn.Placeholder(
label='mutable_0',
kernel_size_options=[1, 3, 5],
n_layer_options=[1, 2, 3, 4],
exp_ratio=exp_ratio,
stride=stride
)
``label`` is used by mutator to identify this placeholder. The other parameters are the information that are required by mutator. They can be accessed from ``node.operation.parameters`` as a dict, it could include any information that users want to put to pass it to user defined mutator. The complete example code can be found in :githublink:`Mnasnet base model <test/retiarii_test/mnasnet/base_mnasnet.py>`.
Starting an experiment is almost the same as using inline mutation APIs. The only difference is that the applied mutators should be passed to ``RetiariiExperiment``. Below is a simple example.
Before reading this tutorial, we highly recommend you to first go through the tutorial of how to `define a model space <./Tutorial.rst#define-your-model-space>`__.
Model Search with One-shot Trainer
----------------------------------
With a defined model space, users can explore the space in two ways. One is using strategy and single-arch evaluator as demonstrated `here <./Tutorial.rst#explore-the-defined-model-space>`__. The other is using one-shot trainer, which consumes much less computational resource compared to the first one. In this tutorial we focus on this one-shot approach. The principle of one-shot approach is combining all the models in a model space into one big model (usually called super-model or super-graph). It takes charge of both search, training and testing, by training and evaluating this big model.
We list the supported one-shot trainers here:
* DARTS trainer
* ENAS trainer
* ProxylessNAS trainer
* Single-path (random) trainer
See `API reference <./ApiReference.rst>`__ for detailed usages. Here, we show an example to use DARTS trainer manually.
.. code-block:: python
from nni.retiarii.oneshot.pytorch import DartsTrainer
One-shot experiment can be visualized with NAS UI, please refer to `here <../Visualization.rst>`__ for the usage guidance. Note that NAS visualization is under intensive development.
Customize a New One-shot Trainer
--------------------------------
One-shot trainers should inherit ``nni.retiarii.oneshot.BaseOneShotTrainer``, and need to implement ``fit()`` (used to conduct the fitting and searching process) and ``export()`` method (used to return the searched best architecture).
Writing a one-shot trainer is very different to single-arch evaluator. First of all, there are no more restrictions on init method arguments, any Python arguments are acceptable. Secondly, the model fed into one-shot trainers might be a model with Retiarii-specific modules, such as LayerChoice and InputChoice. Such model cannot directly forward-propagate and trainers need to decide how to handle those modules.
A typical example is DartsTrainer, where learnable-parameters are used to combine multiple choices in LayerChoice. Retiarii provides ease-to-use utility functions for module-replace purposes, namely ``replace_layer_choice``, ``replace_input_choice``. A simplified example is as follows:
.. code-block:: python
from nni.retiarii.oneshot import BaseOneShotTrainer
from nni.retiarii.oneshot.pytorch import replace_layer_choice, replace_input_choice
for (trn_X, trn_y), (val_X, val_y) in zip(self.train_loader, self.valid_loader):
self.train_architecture(val_X, val_y)
self.train_model_weight(trn_X, trn_y)
@torch.no_grad()
def export(self):
result = dict()
for name, module in self.nas_modules:
if name not in result:
result[name] = select_best_of_module(module)
return result
The full code of DartsTrainer is available to Retiarii source code. Please have a check at :githublink:`DartsTrainer <nni/retiarii/oneshot/pytorch/darts.py>`.
The simplest way to customize a new evaluator is with functional APIs, which is very easy when training code is already available. Users only need to write a fit function that wraps everything. This function takes one positional arguments (model) and possible keyword arguments. In this way, users get everything under their control, but exposes less information to the framework and thus fewer opportunities for possible optimization. An example is as belows:
It's recommended to write training code in PyTorch-Lightning style, that is, to write a LightningModule that defines all elements needed for training (e.g., loss function, optimizer) and to define a trainer that takes (optional) dataloaders to execute the training. Before that, please read the `document of PyTorch-lightning <https://pytorch-lightning.readthedocs.io/>` to learn the basic concepts and components provided by PyTorch-lightning.
It's recommended to write training code in PyTorch-Lightning style, that is, to write a LightningModule that defines all elements needed for training (e.g., loss function, optimizer) and to define a trainer that takes (optional) dataloaders to execute the training. Before that, please read the `document of PyTorch-lightning <https://pytorch-lightning.readthedocs.io/>`__ to learn the basic concepts and components provided by PyTorch-lightning.
In pratice, writing a new training module in NNI should inherit ``nni.retiarii.evaluator.pytorch.lightning.LightningModule``, which has a ``set_model`` that will be called after ``__init__`` to save the candidate model (generated by strategy) as ``self.model``. The rest of the process (like ``training_step``) should be the same as writing any other lightning module. Evaluators should also communicate with strategies via two API calls (``nni.report_intermediate_result`` for periodical metrics and ``nni.report_final_result`` for final metrics), added in ``on_validation_epoch_end`` and ``teardown`` respectively.
In practice, writing a new training module in Retiarii should inherit ``nni.retiarii.evaluator.pytorch.lightning.LightningModule``, which has a ``set_model`` that will be called after ``__init__`` to save the candidate model (generated by strategy) as ``self.model``. The rest of the process (like ``training_step``) should be the same as writing any other lightning module. Evaluators should also communicate with strategies via two API calls (``nni.report_intermediate_result`` for periodical metrics and ``nni.report_final_result`` for final metrics), added in ``on_validation_epoch_end`` and ``teardown`` respectively.
An example is as follows:
...
...
@@ -97,61 +94,3 @@ Then, users need to wrap everything (including LightningModule, trainer and data
One-shot trainers should inheirt ``nni.retiarii.oneshot.BaseOneShotTrainer``, and need to implement ``fit()`` (used to conduct the fitting and searching process) and ``export()`` method (used to return the searched best architecture).
Writing a one-shot trainer is very different to classic evaluators. First of all, there are no more restrictions on init method arguments, any Python arguments are acceptable. Secondly, the model feeded into one-shot trainers might be a model with Retiarii-specific modules, such as LayerChoice and InputChoice. Such model cannot directly forward-propagate and trainers need to decide how to handle those modules.
A typical example is DartsTrainer, where learnable-parameters are used to combine multiple choices in LayerChoice. Retiarii provides ease-to-use utility functions for module-replace purposes, namely ``replace_layer_choice``, ``replace_input_choice``. A simplified example is as follows:
.. code-block:: python
from nni.retiarii.oneshot import BaseOneShotTrainer
from nni.retiarii.oneshot.pytorch import replace_layer_choice, replace_input_choice
To monitor your job's code, you need to visit your studio which you create at step 5. Once the job completes, go to the Outputs + logs tab. There you can see a 70_driver_log.txt file, This file contains the standard output from a run and can be useful when you're debugging remote runs in the cloud. Learn more about aml from `here <https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world>`__.
* trainingServicePlatforms. required key. This field specify the platforms used in hybrid mode, the values using yaml list format. NNI support setting ``local``, ``remote``, ``aml``, ``pai`` in this field.
.. Note:: If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set ``remote`` as one of the platform, should also set ``machineList`` and ``remoteConfig`` configuration. Local platform in hybrid mode does not support windows for now.
.. Note:: If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set ``remote`` as one of the platform, should also set ``machineList`` and ``remoteConfig`` configuration.
@@ -71,4 +71,4 @@ Our documentation is built with :githublink:`sphinx <docs>`.
* It's an image link which needs to be formatted with embedded html grammar, please use global URL like ``https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png``, which can be automatically generated by dragging picture onto `Github Issue <https://github.com/Microsoft/nni/issues/new>`__ Box.
* It cannot be re-formatted by sphinx, such as source code, please use its global URL. For source code that links to our github repo, please use URLs rooted at ``https://github.com/Microsoft/nni/tree/v2.0/`` (:githublink:`mnist.py <examples/trials/mnist-pytorch/mnist.py>` for example).
* It cannot be re-formatted by sphinx, such as source code, please use its global URL. For source code that links to our github repo, please use URLs rooted at ``https://github.com/Microsoft/nni/tree/v2.1/`` (:githublink:`mnist.py <examples/trials/mnist-pytorch/mnist.py>` for example).