Unverified Commit 5bf5e46c authored by Yuge Zhang's avatar Yuge Zhang Committed by GitHub
Browse files

Port markdown docs to rst and introduce "githublink" (#3107)

parent dbb2434f
Single Path One-Shot (SPOS)
===========================
Introduction
------------
Proposed in `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ is a one-shot NAS method that addresses the difficulties in training One-Shot NAS models by constructing a simplified supernet trained with an uniform path sampling method, so that all underlying architectures (and their weights) get trained fully and equally. An evolutionary algorithm is then applied to efficiently search for the best-performing architectures without any fine tuning.
Implementation on NNI is based on `official repo <https://github.com/megvii-model/SinglePathOneShot>`__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase. We have also shown
Examples
--------
Here is a use case, which is the search space in paper, and the way to use flops limit to perform uniform sampling.
:githublink:`Example code <examples/nas/spos>`
Requirements
^^^^^^^^^^^^
NVIDIA DALI >= 0.16 is needed as we use DALI to accelerate the data loading of ImageNet. `Installation guide <https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/installation.html>`__
Download the flops lookup table from `here <https://1drv.ms/u/s!Am_mmG2-KsrnajesvSdfsq_cN48?e=aHVppN>`__ (maintained by `Megvii <https://github.com/megvii-model>`__\ ).
Put ``op_flops_dict.pkl`` and ``checkpoint-150000.pth.tar`` (if you don't want to retrain the supernet) under ``data`` directory.
Prepare ImageNet in the standard format (follow the script `here <https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4>`__\ ). Linking it to ``data/imagenet`` will be more convenient.
After preparation, it's expected to have the following code structure:
.. code-block:: bash
spos
├── architecture_final.json
├── blocks.py
├── config_search.yml
├── data
│ ├── imagenet
│ │ ├── train
│ │ └── val
│ └── op_flops_dict.pkl
├── dataloader.py
├── network.py
├── readme.md
├── scratch.py
├── supernet.py
├── tester.py
├── tuner.py
└── utils.py
Step 1. Train Supernet
^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
python supernet.py
Will export the checkpoint to ``checkpoints`` directory, for the next step.
NOTE: The data loading used in the official repo is `slightly different from usual <https://github.com/megvii-model/SinglePathOneShot/issues/5>`__\ , as they use BGR tensor and keep the values between 0 and 255 intentionally to align with their own DL framework. The option ``--spos-preprocessing`` will simulate the behavior used originally and enable you to use the checkpoints pretrained.
Step 2. Evolution Search
^^^^^^^^^^^^^^^^^^^^^^^^
Single Path One-Shot leverages evolution algorithm to search for the best architecture. The tester, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
In order to make the tuner aware of the flops limit and have the ability to calculate the flops, we created a new tuner called ``EvolutionWithFlops`` in ``tuner.py``\ , inheriting the tuner in SDK.
To have a search space ready for NNI framework, first run
.. code-block:: bash
nnictl ss_gen -t "python tester.py"
This will generate a file called ``nni_auto_gen_search_space.json``\ , which is a serialized representation of your search space.
By default, it will use ``checkpoint-150000.pth.tar`` downloaded previously. In case you want to use the checkpoint trained by yourself from the last step, specify ``--checkpoint`` in the command in ``config_search.yml``.
Then search with evolution tuner.
.. code-block:: bash
nnictl create --config config_search.yml
The final architecture exported from every epoch of evolution can be found in ``checkpoints`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/log``.
Step 3. Train from Scratch
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
python scratch.py
By default, it will use ``architecture_final.json``. This architecture is provided by the official repo (converted into NNI format). You can use any architecture (e.g., the architecture found in step 2) with ``--fixed-arc`` option.
Reference
---------
PyTorch
^^^^^^^
.. autoclass:: nni.algorithms.nas.pytorch.spos.SPOSEvolution
:members:
.. autoclass:: nni.algorithms.nas.pytorch.spos.SPOSSupernetTrainer
:members:
.. autoclass:: nni.algorithms.nas.pytorch.spos.SPOSSupernetTrainingMutator
:members:
Known Limitations
-----------------
* Block search only. Channel search is not supported yet.
* Only GPU version is provided here.
Current Reproduction Results
----------------------------
Reproduction is still undergoing. Due to the gap between official release and original paper, we compare our current results with official repo (our run) and paper.
* Evolution phase is almost aligned with official repo. Our evolution algorithm shows a converging trend and reaches ~65% accuracy at the end of search. Nevertheless, this result is not on par with paper. For details, please refer to `this issue <https://github.com/megvii-model/SinglePathOneShot/issues/6>`__.
* Retrain phase is not aligned. Our retraining code, which uses the architecture released by the authors, reaches 72.14% accuracy, still having a gap towards 73.61% by official release and 74.3% reported in original paper.
.. role:: raw-html(raw)
:format: html
Search Space Zoo
================
DartsCell
---------
DartsCell is extracted from :githublink:`CNN model <examples/nas/darts>`. A DartsCell is a directed acyclic graph containing an ordered sequence of N nodes and each node stands for a latent representation (e.g. feature map in a convolutional network). Directed edges from Node 1 to Node 2 are associated with some operations that transform Node 1 and the result is stored on Node 2. The `Candidate operators <#predefined-operations-darts>`__ between nodes is predefined and unchangeable. One edge represents an operation that chosen from the predefined ones to be applied to the starting node of the edge. One cell contains two input nodes, a single output node, and other ``n_node`` nodes. The input nodes are defined as the cell outputs in the previous two layers. The output of the cell is obtained by applying a reduction operation (e.g. concatenation) to all the intermediate nodes. To make the search space continuous, the categorical choice of a particular operation is relaxed to a softmax over all possible operations. By adjusting the weight of softmax on every node, the operation with the highest probability is chosen to be part of the final structure. A CNN model can be formed by stacking several cells together, which builds a search space. Note that, in DARTS paper all cells in the model share the same structure.
One structure in the Darts search space is shown below. Note that, NNI merges the last one of the four intermediate nodes and the output node.
.. image:: ../../img/NAS_Darts_cell.svg
:target: ../../img/NAS_Darts_cell.svg
:alt:
The predefined operators are shown `here <#predefined-operations-darts>`__.
.. autoclass:: nni.nas.pytorch.search_space_zoo.DartsCell
:members:
Example code
^^^^^^^^^^^^
:githublink:`example code <examples/nas/search_space_zoo/darts_example.py>`
.. code-block:: bash
git clone https://github.com/Microsoft/nni.git
cd nni/examples/nas/search_space_zoo
# search the best structure
python3 darts_example.py
:raw-html:`<a name="predefined-operations-darts"></a>`
Candidate operators
^^^^^^^^^^^^^^^^^^^
All supported operators for Darts are listed below.
*
MaxPool / AvgPool
* MaxPool: Call ``torch.nn.MaxPool2d``. This operation applies a 2D max pooling over all input channels. Its parameters ``kernel_size=3`` and ``padding=1`` are fixed. The pooling result will pass through a BatchNorm2d then return as the result.
*
AvgPool: Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels. Its parameters ``kernel_size=3`` and ``padding=1`` are fixed. The pooling result will pass through a BatchNorm2d then return as the result.
MaxPool / AvgPool with ``kernel_size=3`` and ``padding=1`` followed by BatchNorm2d
.. autoclass:: nni.nas.pytorch.search_space_zoo.darts_ops.PoolBN
*
SkipConnect
There is no operation between two nodes. Call ``torch.nn.Identity`` to forward what it gets to the output.
*
Zero operation
There is no connection between two nodes.
*
DilConv3x3 / DilConv5x5
:raw-html:`<a name="DilConv"></a>`\ DilConv3x3: (Dilated) depthwise separable Conv. It's a 3x3 depthwise convolution with ``C_in`` groups, followed by a 1x1 pointwise convolution. It reduces the amount of parameters. Input is first passed through relu, then DilConv and finally batchNorm2d. **Note that the operation is not Dilated Convolution, but we follow the convention in NAS papers to name it DilConv.** 3x3 DilConv has parameters ``kernel_size=3``\ , ``padding=1`` and 5x5 DilConv has parameters ``kernel_size=5``\ , ``padding=4``.
.. autoclass:: nni.nas.pytorch.search_space_zoo.darts_ops.DilConv
*
SepConv3x3 / SepConv5x5
Composed of two DilConvs with fixed ``kernel_size=3``\ , ``padding=1`` or ``kernel_size=5``\ , ``padding=2`` sequentially.
.. autoclass:: nni.nas.pytorch.search_space_zoo.darts_ops.SepConv
ENASMicroLayer
--------------
This layer is extracted from the model designed :githublink:`here <examples/nas/enas>`. A model contains several blocks that share the same architecture. A block is made up of some normal layers and reduction layers, ``ENASMicroLayer`` is a unified implementation of the two types of layers. The only difference between the two layers is that reduction layers apply all operations with ``stride=2``.
ENAS Micro employs a DAG with N nodes in one cell, where the nodes represent local computations, and the edges represent the flow of information between the N nodes. One cell contains two input nodes and a single output node. The following nodes choose two previous nodes as input and apply two operations from `predefined ones <#predefined-operations-enas>`__ then add them as the output of this node. For example, Node 4 chooses Node 1 and Node 3 as inputs then applies ``MaxPool`` and ``AvgPool`` on the inputs respectively, then adds and sums them as the output of Node 4. Nodes that are not served as input for any other node are viewed as the output of the layer. If there are multiple output nodes, the model will calculate the average of these nodes as the layer output.
The ENAS micro search space is shown below.
.. image:: ../../img/NAS_ENAS_micro.svg
:target: ../../img/NAS_ENAS_micro.svg
:alt:
The predefined operators can be seen `here <#predefined-operations-enas>`__.
.. autoclass:: nni.nas.pytorch.search_space_zoo.ENASMicroLayer
:members:
The Reduction Layer is made up of two Conv operations followed by BatchNorm, each of them will output ``C_out//2`` channels and concat them in channels as the output. The Convolution has ``kernel_size=1`` and ``stride=2``\ , and they perform alternate sampling on the input to reduce the resolution without loss of information. This layer is wrapped in ``ENASMicroLayer``.
Example code
^^^^^^^^^^^^
:githublink:`example code <examples/nas/search_space_zoo/enas_micro_example.py>`
.. code-block:: bash
git clone https://github.com/Microsoft/nni.git
cd nni/examples/nas/search_space_zoo
# search the best cell structure
python3 enas_micro_example.py
:raw-html:`<a name="predefined-operations-enas"></a>`
Candidate operators
^^^^^^^^^^^^^^^^^^^
All supported operators for ENAS micro search are listed below.
*
MaxPool / AvgPool
* MaxPool: Call ``torch.nn.MaxPool2d``. This operation applies a 2D max pooling over all input channels followed by BatchNorm2d. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
* AvgPool: Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels followed by BatchNorm2d. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.Pool
*
SepConv
* SepConvBN3x3: ReLU followed by a `DilConv <#DilConv>`__ and BatchNorm. Convolution parameters are ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
*
SepConvBN5x5: Do the same operation as the previous one but it has different kernel sizes and paddings, which is set to 5 and 2 respectively.
.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.SepConvBN
*
SkipConnect
Call ``torch.nn.Identity`` to connect directly to the next cell.
ENASMacroLayer
--------------
In Macro search, the controller makes two decisions for each layer: i) the `operation <#macro-operations>`__ to perform on the result of the previous layer, ii) which the previous layer to connect to for SkipConnects. ENAS uses a controller to design the whole model architecture instead of one of its components. The output of operations is going to concat with the tensor of the chosen layer for SkipConnect. NNI provides `predefined operators <#macro-operations>`__ for macro search, which are listed in `Candidate operators <#macro-operations>`__.
Part of one structure in the ENAS macro search space is shown below.
.. image:: ../../img/NAS_ENAS_macro.svg
:target: ../../img/NAS_ENAS_macro.svg
:alt:
.. autoclass:: nni.nas.pytorch.search_space_zoo.ENASMacroLayer
:members:
To describe the whole search space, NNI provides a model, which is built by stacking the layers.
.. autoclass:: nni.nas.pytorch.search_space_zoo.ENASMacroGeneralModel
:members:
Example code
^^^^^^^^^^^^
:githublink:`example code <examples/nas/search_space_zoo/enas_macro_example.py>`
.. code-block:: bash
git clone https://github.com/Microsoft/nni.git
cd nni/examples/nas/search_space_zoo
# search the best cell structure
python3 enas_macro_example.py
:raw-html:`<a name="macro-operations"></a>`
Candidate operators
^^^^^^^^^^^^^^^^^^^
All supported operators for ENAS macro search are listed below.
*
ConvBranch
All input first passes into a StdConv, which is made up of a 1x1Conv followed by BatchNorm2d and ReLU. Then the intermediate result goes through one of the operations listed below. The final result is calculated through a BatchNorm2d and ReLU as post-procedure.
* Separable Conv3x3: If ``separable=True``\ , the cell will use `SepConv <#DilConv>`__ instead of normal Conv operation. SepConv's ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
* Separable Conv5x5: SepConv's ``kernel_size=5``\ , ``stride=1`` and ``padding=2``.
* Normal Conv3x3: If ``separable=False``\ , the cell will use a normal Conv operations with ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
*
Normal Conv5x5: Conv's ``kernel_size=5``\ , ``stride=1`` and ``padding=2``.
.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.ConvBranch
*
PoolBranch
All input first passes into a StdConv, which is made up of a 1x1Conv followed by BatchNorm2d and ReLU. Then the intermediate goes through pooling operation followed by BatchNorm.
* AvgPool: Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
*
MaxPool: Call ``torch.nn.MaxPool2d``. This operation applies a 2D max pooling over all input channels. Its parameters are fixed to ``kernel_size=3``\ , ``stride=1`` and ``padding=1``.
.. autoclass:: nni.nas.pytorch.search_space_zoo.enas_ops.PoolBranch
NAS-Bench-201
-------------
NAS Bench 201 defines a unified search space, which is algorithm agnostic. The predefined skeleton consists of a stack of cells that share the same architecture. Every cell contains four nodes and a DAG is formed by connecting edges among them, where the node represents the sum of feature maps and the edge stands for an operation transforming a tensor from the source node to the target node. The predefined candidate operators can be found in `Candidate operators <#nas-bench-201-reference>`__.
The search space of NAS Bench 201 is shown below.
.. image:: ../../img/NAS_Bench_201.svg
:target: ../../img/NAS_Bench_201.svg
:alt:
.. autoclass:: nni.nas.pytorch.nasbench201.NASBench201Cell
:members:
Example code
^^^^^^^^^^^^
:githublink:`example code <examples/nas/search_space_zoo/nas_bench_201.py>`
.. code-block:: bash
# for structure searching
git clone https://github.com/Microsoft/nni.git
cd nni/examples/nas/search_space_zoo
python3 nas_bench_201.py
:raw-html:`<a name="nas-bench-201-reference"></a>`
Candidate operators
^^^^^^^^^^^^^^^^^^^
All supported operators for NAS Bench 201 are listed below.
*
AvgPool
If the number of input channels is not equal to the number of output channels, the input will first pass through a ``ReLUConvBN`` layer with ``kernel_size=1``\ , ``stride=1``\ , ``padding=0``\ , and ``dilation=0``.
Call ``torch.nn.AvgPool2d``. This operation applies a 2D average pooling over all input channels followed by BatchNorm2d. Its parameters are fixed to ``kernel_size=3`` and ``padding=1``.
.. autoclass:: nni.nas.pytorch.nasbench201.nasbench201_ops.Pooling
:members:
*
Conv
* Conv1x1: Consist of a sequence of ReLU, ``nn.Cinv2d`` and BatchNorm. The Conv operation's parameter is fixed to ``kernal_size=1``\ , ``padding=0``\ , and ``dilation=1``.
* Conv3x3: Consist of a sequence of ReLU, ``nn.Cinv2d`` and BatchNorm. The Conv operation's parameter is fixed to ``kernal_size=3``\ , ``padding=1``\ , and ``dilation=1``.
.. autoclass:: nni.nas.pytorch.nasbench201.nasbench201_ops.ReLUConvBN
:members:
*
SkipConnect
Call ``torch.nn.Identity`` to connect directly to the next cell.
*
Zeroize
Generate zero tensors indicating there is no connection from the source node to the target node.
.. autoclass:: nni.nas.pytorch.nasbench201.nasbench201_ops.Zero
:members:
TextNAS
=======
Introduction
------------
This is the implementation of the TextNAS algorithm proposed in the paper `TextNAS: A Neural Architecture Search Space tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__. TextNAS is a neural architecture search algorithm tailored for text representation, more specifically, TextNAS is based on a novel search space consists of operators widely adopted to solve various NLP tasks, and TextNAS also supports multi-path ensemble within a single network to balance the width and depth of the architecture.
The search space of TextNAS contains:
.. code-block:: bash
* 1-D convolutional operator with filter size 1, 3, 5, 7
* recurrent operator (bi-directional GRU)
* self-attention operator
* pooling operator (max/average)
Following the ENAS algorithm, TextNAS also utilizes parameter sharing to accelerate the search speed and adopts a reinforcement-learning controller for the architecture sampling and generation. Please refer to the paper for more details of TextNAS.
Preparation
-----------
Prepare the word vectors and SST dataset, and organize them in data directory as shown below:
.. code-block:: bash
textnas
├── data
│ ├── sst
│ │ └── trees
│ │ ├── dev.txt
│ │ ├── test.txt
│ │ └── train.txt
│ └── glove.840B.300d.txt
├── dataloader.py
├── model.py
├── ops.py
├── README.md
├── search.py
└── utils.py
The following link might be helpful for finding and downloading the corresponding dataset:
* `GloVe: Global Vectors for Word Representation <https://nlp.stanford.edu/projects/glove/>`__
* `glove.840B.300d.txt <http://nlp.stanford.edu/data/glove.840B.300d.zip>`__
* `Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank <https://nlp.stanford.edu/sentiment/>`__
* `trainDevTestTrees_PTB.zip <https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip>`__
Examples
--------
Search Space
^^^^^^^^^^^^
:githublink:`Example code <examples/nas/textnas>`
.. code-block:: bash
# In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
git clone https://github.com/Microsoft/nni.git
# search the best architecture
cd examples/nas/textnas
# view more options for search
python3 search.py -h
After each search epoch, 10 sampled architectures will be tested directly. Their performances are expected to be 40% - 42% after 10 epochs.
By default, 20 sampled architectures will be exported into ``checkpoints`` directory for next step.
retrain
^^^^^^^
.. code-block:: bash
# In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
git clone https://github.com/Microsoft/nni.git
# search the best architecture
cd examples/nas/textnas
# default to retrain on sst-2
sh run_retrain.sh
Reference
---------
TextNAS directly uses EnasTrainer, please refer to `ENAS <./ENAS.rst>`__ for the trainer APIs.
NAS Visualization (Experimental)
================================
Built-in Trainers Support
-------------------------
Currently, only ENAS and DARTS support visualization. Examples of `ENAS <./ENAS.md>`__ and `DARTS <./DARTS.rst>`__ has demonstrated how to enable visualization in your code, namely, adding this before ``trainer.train()``\ :
.. code-block:: python
trainer.enable_visualization()
This will create a directory ``logs/<current_time_stamp>`` in your working folder, in which you will find two files ``graph.json`` and ``log``.
You don't have to wait until your program finishes to launch NAS UI, but it's important that these two files have been already created. Launch NAS UI with
.. code-block:: bash
nnictl webui nas --logdir logs/<current_time_stamp> --port <port>
Visualize a Customized Trainer
------------------------------
If you are interested in how to customize a trainer, please read this `doc <./Advanced.rst#extend-the-ability-of-one-shot-trainers>`__.
You should do two modifications to an existing trainer to enable visualization:
#. Export your graph before training, with
.. code-block:: python
vis_graph = self.mutator.graph(inputs)
# `inputs` is a dummy input to your model. For example, torch.randn((1, 3, 32, 32)).cuda()
# If your model has multiple inputs, it should be a tuple.
with open("/path/to/your/logdir/graph.json", "w") as f:
json.dump(vis_graph, f)
#. Logging the choices you've made. You can do it once per epoch, once per mini-batch or whatever frequency you'd like.
.. code-block:: python
def __init__(self):
# ...
self.status_writer = open("/path/to/your/logdir/log", "w") # create a writer
def train(self):
# ...
print(json.dumps(self.mutator.status()), file=self.status_writer, flush=True) # dump a record of status
If you are implementing your customized trainer inheriting ``Trainer``. We have provided ``enable_visualization()`` and ``_write_graph_status()`` for easy-to-use purposes. All you need to do is calling ``trainer.enable_visualization()`` before start, and ``trainer._write_graph_status()`` each time you want to do the logging. But remember both of these APIs are experimental and subject to change in future.
Last but not least, invode NAS UI with
.. code-block:: bash
nnictl webui nas --logdir /path/to/your/logdir
NAS UI Preview
--------------
.. image:: ../../img/nasui-1.png
:target: ../../img/nasui-1.png
:alt:
.. image:: ../../img/nasui-2.png
:target: ../../img/nasui-2.png
:alt:
Limitations
-----------
* NAS visualization only works with PyTorch >=1.4. We've tested it on PyTorch 1.3.1 and it doesn't work.
* We rely on PyTorch support for tensorboard for graph export, which relies on ``torch.jit``. It will not work if your model doesn't support ``jit``.
* There are known performance issues when loading a moderate-size graph with many op choices (like DARTS search space).
Feedback
--------
NAS UI is currently experimental. We welcome your feedback. `Here <https://github.com/microsoft/nni/pull/2085>`__ we have listed all the to-do items of NAS UI in the future. Feel free to comment (or `submit a new issue <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__\ ) if you have other suggestions.
Write A .. role:: raw-html(raw)
:format: html
Search Space
====================
Genrally, a search space describes candiate architectures from which users want to find the best one. Different search algorithms, no matter classic NAS or one-shot NAS, can be applied on the search space. NNI provides APIs to unified the expression of neural architecture search space.
A search space can be built on a base model. This is also a common practice when a user wants to apply NAS on an existing model. Take `MNIST on PyTorch <https://github.com/pytorch/examples/blob/master/mnist/main.py>`__ as an example. Note that NNI provides the same APIs for expressing search space on PyTorch and TensorFlow.
.. code-block:: python
from nni.nas.pytorch import mutables
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = mutables.LayerChoice([
nn.Conv2d(1, 32, 3, 1),
nn.Conv2d(1, 32, 5, 3)
]) # try 3x3 kernel and 5x5 kernel
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
# ... same as original ...
return output
The example above adds an option of choosing conv5x5 at conv1. The modification is as simple as declaring a ``LayerChoice`` with the original conv3x3 and a new conv5x5 as its parameter. That's it! You don't have to modify the forward function in any way. You can imagine conv1 as any other module without NAS.
So how about the possibilities of connections? This can be done using ``InputChoice``. To allow for a skip connection on the MNIST example, we add another layer called conv3. In the following example, a possible connection from conv2 is added to the output of conv3.
.. code-block:: python
from nni.nas.pytorch import mutables
class Net(nn.Module):
def __init__(self):
# ... same ...
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.conv3 = nn.Conv2d(64, 64, 1, 1)
# declaring that there is exactly one candidate to choose from
# search strategy will choose one or None
self.skipcon = mutables.InputChoice(n_candidates=1)
# ... same ...
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x0 = self.skipcon([x]) # choose one or none from [x]
x = self.conv3(x)
if x0 is not None: # skipconnection is open
x += x0
x = F.max_pool2d(x, 2)
# ... same ...
return output
Input choice can be thought of as a callable module that receives a list of tensors and outputs the concatenation/sum/mean of some of them (sum by default), or ``None`` if none is selected. Like layer choices, input choices should be **initialized in ``__init__`` and called in ``forward``**. This is to allow search algorithms to identify these choices and do necessary preparations.
``LayerChoice`` and ``InputChoice`` are both **mutables**. Mutable means "changeable". As opposed to traditional deep learning layers/modules which have fixed operation types once defined, models with mutable are essentially a series of possible models.
Users can specify a **key** for each mutable. By default, NNI will assign one for you that is globally unique, but in case users want to share choices (for example, there are two ``LayerChoice``\ s with the same candidate operations and you want them to have the same choice, i.e., if first one chooses the i-th op, the second one also chooses the i-th op), they can give them the same key. The key marks the identity for this choice and will be used in the dumped checkpoint. So if you want to increase the readability of your exported architecture, manually assigning keys to each mutable would be a good idea. For advanced usage on mutables (e.g., ``LayerChoice`` and ``InputChoice``\ ), see `Mutables <./NasReference.rst>`__.
With search space defined, the next step is searching for the best model from it. Please refer to `classic NAS algorithms <./ClassicNas.md>`__ and `one-shot NAS algorithms <./NasGuide.rst>`__ for how to search from your defined search space.
Overview
========
NNI (Neural Network Intelligence) is a toolkit to help users design and tune machine learning models (e.g., hyperparameters), neural network architectures, or complex system's parameters, in an efficient and automatic way. NNI has several appealing properties: ease-of-use, scalability, flexibility, and efficiency.
* **Ease-of-use**\ : NNI can be easily installed through python pip. Only several lines need to be added to your code in order to use NNI's power. You can use both the commandline tool and WebUI to work with your experiments.
* **Scalability**\ : Tuning hyperparameters or the neural architecture often demands a large number of computational resources, while NNI is designed to fully leverage different computation resources, such as remote machines, training platforms (e.g., OpenPAI, Kubernetes). Hundreds of trials could run in parallel by depending on the capacity of your configured training platforms.
* **Flexibility**\ : Besides rich built-in algorithms, NNI allows users to customize various hyperparameter tuning algorithms, neural architecture search algorithms, early stopping algorithms, etc. Users can also extend NNI with more training platforms, such as virtual machines, kubernetes service on the cloud. Moreover, NNI can connect to external environments to tune special applications/models on them.
* **Efficiency**\ : We are intensively working on more efficient model tuning on both the system and algorithm level. For example, we leverage early feedback to speedup the tuning procedure.
The figure below shows high-level architecture of NNI.
.. raw:: html
<p align="center">
<img src="https://user-images.githubusercontent.com/16907603/92089316-94147200-ee00-11ea-9944-bf3c4544257f.png" alt="drawing" width="700"/>
</p>
Key Concepts
------------
*
*Experiment*\ : One task of, for example, finding out the best hyperparameters of a model, finding out the best neural network architecture, etc. It consists of trials and AutoML algorithms.
*
*Search Space*\ : The feasible region for tuning the model. For example, the value range of each hyperparameter.
*
*Configuration*\ : An instance from the search space, that is, each hyperparameter has a specific value.
*
*Trial*\ : An individual attempt at applying a new configuration (e.g., a set of hyperparameter values, a specific neural architecture, etc.). Trial code should be able to run with the provided configuration.
*
*Tuner*\ : An AutoML algorithm, which generates a new configuration for the next try. A new trial will run with this configuration.
*
*Assessor*\ : Analyze a trial's intermediate results (e.g., periodically evaluated accuracy on test dataset) to tell whether this trial can be early stopped or not.
*
*Training Platform*\ : Where trials are executed. Depending on your experiment's configuration, it could be your local machine, or remote servers, or large-scale training platform (e.g., OpenPAI, Kubernetes).
Basically, an experiment runs as follows: Tuner receives search space and generates configurations. These configurations will be submitted to training platforms, such as the local machine, remote machines, or training clusters. Their performances are reported back to Tuner. Then, new configurations are generated and submitted.
For each experiment, the user only needs to define a search space and update a few lines of code, and then leverage NNI built-in Tuner/Assessor and training platforms to search the best hyperparameters and/or neural architecture. There are basically 3 steps:
..
Step 1: `Define search space <Tutorial/SearchSpaceSpec.rst>`__
Step 2: `Update model codes <TrialExample/Trials.rst>`__
Step 3: `Define Experiment <Tutorial/ExperimentConfig.rst>`__
.. raw:: html
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816627-5d13db80-2302-11e9-8f3e-627e260203d5.jpg" alt="drawing"/>
</p>
For more details about how to run an experiment, please refer to `Get Started <Tutorial/QuickStart.rst>`__.
Core Features
-------------
NNI provides a key capacity to run multiple instances in parallel to find the best combinations of parameters. This feature can be used in various domains, like finding the best hyperparameters for a deep learning model or finding the best configuration for database and other complex systems with real data.
NNI also provides algorithm toolkits for machine learning and deep learning, especially neural architecture search (NAS) algorithms, model compression algorithms, and feature engineering algorithms.
Hyperparameter Tuning
^^^^^^^^^^^^^^^^^^^^^
This is a core and basic feature of NNI, we provide many popular `automatic tuning algorithms <Tuner/BuiltinTuner.md>`__ (i.e., tuner) and `early stop algorithms <Assessor/BuiltinAssessor.md>`__ (i.e., assessor). You can follow `Quick Start <Tutorial/QuickStart.rst>`__ to tune your model (or system). Basically, there are the above three steps and then starting an NNI experiment.
General NAS Framework
^^^^^^^^^^^^^^^^^^^^^
This NAS framework is for users to easily specify candidate neural architectures, for example, one can specify multiple candidate operations (e.g., separable conv, dilated conv) for a single layer, and specify possible skip connections. NNI will find the best candidate automatically. On the other hand, the NAS framework provides a simple interface for another type of user (e.g., NAS algorithm researchers) to implement new NAS algorithms. A detailed description of NAS and its usage can be found `here <NAS/Overview.rst>`__.
NNI has support for many one-shot NAS algorithms such as ENAS and DARTS through NNI trial SDK. To use these algorithms you do not have to start an NNI experiment. Instead, import an algorithm in your trial code and simply run your trial code. If you want to tune the hyperparameters in the algorithms or want to run multiple instances, you can choose a tuner and start an NNI experiment.
Other than one-shot NAS, NAS can also run in a classic mode where each candidate architecture runs as an independent trial job. In this mode, similar to hyperparameter tuning, users have to start an NNI experiment and choose a tuner for NAS.
Model Compression
^^^^^^^^^^^^^^^^^
NNI provides an easy-to-use model compression framework to compress deep neural networks, the compressed networks typically have much smaller model size and much faster
inference speed without losing performance significantlly. Model compression on NNI includes pruning algorithms and quantization algorithms. NNI provides many pruning and
quantization algorithms through NNI trial SDK. Users can directly use them in their trial code and run the trial code without starting an NNI experiment. Users can also use NNI model compression framework to customize their own pruning and quantization algorithms.
A detailed description of model compression and its usage can be found `here <Compression/Overview.rst>`__.
Automatic Feature Engineering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Automatic feature engineering is for users to find the best features for their tasks. A detailed description of automatic feature engineering and its usage can be found `here <FeatureEngineering/Overview.rst>`__. It is supported through NNI trial SDK, which means you do not have to create an NNI experiment. Instead, simply import a built-in auto-feature-engineering algorithm in your trial code and directly run your trial code.
The auto-feature-engineering algorithms usually have a bunch of hyperparameters themselves. If you want to automatically tune those hyperparameters, you can leverage hyperparameter tuning of NNI, that is, choose a tuning algorithm (i.e., tuner) and start an NNI experiment for it.
Learn More
----------
* `Get started <Tutorial/QuickStart.rst>`__
* `How to adapt your trial code on NNI? <TrialExample/Trials.rst>`__
* `What are tuners supported by NNI? <Tuner/BuiltinTuner.rst>`__
* `How to customize your own tuner? <Tuner/CustomizeTuner.rst>`__
* `What are assessors supported by NNI? <Assessor/BuiltinAssessor.rst>`__
* `How to customize your own assessor? <Assessor/CustomizeAssessor.rst>`__
* `How to run an experiment on local? <TrainingService/LocalMode.rst>`__
* `How to run an experiment on multiple machines? <TrainingService/RemoteMachineMode.rst>`__
* `How to run an experiment on OpenPAI? <TrainingService/PaiMode.rst>`__
* `Examples <TrialExample/MnistExamples.rst>`__
* `Neural Architecture Search on NNI <NAS/Overview.rst>`__
* `Model Compression on NNI <Compression/Overview.rst>`__
* `Automatic feature engineering on NNI <FeatureEngineering/Overview.rst>`__
.. role:: raw-html(raw)
:format: html
ChangeLog
=========
Release 1.9 - 10/22/2020
========================
Major updates
-------------
Neural architecture search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Support regularized evolution algorithm for NAS scenario (#2802)
* Add NASBench201 in search space zoo (#2766)
Model compression
^^^^^^^^^^^^^^^^^
* AMC pruner improvement: support resnet, support reproduction of the experiments (default parameters in our example code) in AMC paper (#2876 #2906)
* Support constraint-aware on some of our pruners to improve model compression efficiency (#2657)
* Support "tf.keras.Sequential" in model compression for TensorFlow (#2887)
* Support customized op in the model flops counter (#2795)
* Support quantizing bias in QAT quantizer (#2914)
Training service
^^^^^^^^^^^^^^^^
* Support configuring python environment using "preCommand" in remote mode (#2875)
* Support AML training service in Windows (#2882)
* Support reuse mode for remote training service (#2923)
WebUI & nnictl
^^^^^^^^^^^^^^
* The "Overview" page on WebUI is redesigned with new layout (#2914)
* Upgraded node, yarn and FabricUI, and enabled Eslint (#2894 #2873 #2744)
* Add/Remove columns in hyper-parameter chart and trials table in "Trials detail" page (#2900)
* JSON format utility beautify on WebUI (#2863)
* Support nnictl command auto-completion (#2857)
UT & IT
-------
* Add integration test for experiment import and export (#2878)
* Add integration test for user installed builtin tuner (#2859)
* Add unit test for nnictl (#2912)
Documentation
-------------
* Refactor of the document for model compression (#2919)
Bug fixes
---------
* Bug fix of naïve evolution tuner, correctly deal with trial fails (#2695)
* Resolve the warning "WARNING (nni.protocol) IPC pipeline not exists, maybe you are importing tuner/assessor from trial code?" (#2864)
* Fix search space issue in experiment save/load (#2886)
* Fix bug in experiment import data (#2878)
* Fix annotation in remote mode (python 3.8 ast update issue) (#2881)
* Support boolean type for "choice" hyper-parameter when customizing trial configuration on WebUI (#3003)
Release 1.8 - 8/27/2020
=======================
Major updates
-------------
Training service
^^^^^^^^^^^^^^^^
* Access trial log directly on WebUI (local mode only) (#2718)
* Add OpenPAI trial job detail link (#2703)
* Support GPU scheduler in reusable environment (#2627) (#2769)
* Add timeout for ``web_channel`` in ``trial_runner`` (#2710)
* Show environment error message in AzureML mode (#2724)
* Add more log information when copying data in OpenPAI mode (#2702)
WebUI, nnictl and nnicli
^^^^^^^^^^^^^^^^^^^^^^^^
* Improve hyper-parameter parallel coordinates plot (#2691) (#2759)
* Add pagination for trial job list (#2738) (#2773)
* Enable panel close when clicking overlay region (#2734)
* Remove support for Multiphase on WebUI (#2760)
* Support save and restore experiments (#2750)
* Add intermediate results in export result (#2706)
* Add `command <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Tutorial/Nnictl.rst#nnictl-trial>`__ to list trial results with highest/lowest metrics (#2747)
* Improve the user experience of `nnicli <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/nnicli_ref.rst>`__ with `examples <https://github.com/microsoft/nni/blob/v1.8/examples/notebooks/retrieve_nni_info_with_python.ipynb>`__ (#2713)
Neural architecture search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Search space zoo: ENAS and DARTS <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/NAS/SearchSpaceZoo.rst>`__ (#2589)
* API to query intermediate results in NAS benchmark (#2728)
Model compression
^^^^^^^^^^^^^^^^^
* Support the List/Tuple Construct/Unpack operation for TorchModuleGraph (#2609)
* Model speedup improvement: Add support of DenseNet and InceptionV3 (#2719)
* Support the multiple successive tuple unpack operations (#2768)
* `Doc of comparing the performance of supported pruners <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/CommunitySharings/ModelCompressionComparison.rst>`__ (#2742)
* New pruners: `Sensitivity pruner <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Compressor/Pruner.md#sensitivity-pruner>`__ (#2684) and `AMC pruner <https://github.com/microsoft/nni/blob/v1.8/docs/en_US/Compressor/Pruner.rst>`__ (#2573) (#2786)
* TensorFlow v2 support in model compression (#2755)
Backward incompatible changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Update the default experiment folder from ``$HOME/nni/experiments`` to ``$HOME/nni-experiments``. If you want to view the experiments created by previous NNI releases, you can move the experiments folders from ``$HOME/nni/experiments`` to ``$HOME/nni-experiments`` manually. (#2686) (#2753)
* Dropped support for Python 3.5 and scikit-learn 0.20 (#2778) (#2777) (2783) (#2787) (#2788) (#2790)
Others
^^^^^^
* Upgrade TensorFlow version in Docker image (#2732) (#2735) (#2720)
Examples
--------
* Remove gpuNum in assessor examples (#2641)
Documentation
-------------
* Improve customized tuner documentation (#2628)
* Fix several typos and grammar mistakes in documentation (#2637 #2638, thanks @tomzx)
* Improve AzureML training service documentation (#2631)
* Improve CI of Chinese translation (#2654)
* Improve OpenPAI training service documenation (#2685)
* Improve documentation of community sharing (#2640)
* Add tutorial of Colab support (#2700)
* Improve documentation structure for model compression (#2676)
Bug fixes
---------
* Fix mkdir error in training service (#2673)
* Fix bug when using chmod in remote training service (#2689)
* Fix dependency issue by making ``_graph_utils`` imported inline (#2675)
* Fix mask issue in ``SimulatedAnnealingPruner`` (#2736)
* Fix intermediate graph zooming issue (#2738)
* Fix issue when dict is unordered when querying NAS benchmark (#2728)
* Fix import issue for gradient selector dataloader iterator (#2690)
* Fix support of adding tens of machines in remote training service (#2725)
* Fix several styling issues in WebUI (#2762 #2737)
* Fix support of unusual types in metrics including NaN and Infinity (#2782)
* Fix nnictl experiment delete (#2791)
Release 1.7 - 7/8/2020
======================
Major Features
--------------
Training Service
^^^^^^^^^^^^^^^^
* Support AML(Azure Machine Learning) platform as NNI training service.
* OpenPAI job can be reusable. When a trial is completed, the OpenPAI job won't stop, and wait next trial. `refer to reuse flag in OpenPAI config <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/PaiMode.rst#openpai-configurations>`__.
* `Support ignoring files and folders in code directory with .nniignore when uploading code directory to training service <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/Overview.rst#how-to-use-training-service>`__.
Neural Architecture Search (NAS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
`Provide NAS Open Benchmarks (NasBench101, NasBench201, NDS) with friendly APIs <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/Benchmarks.rst>`__.
*
`Support Classic NAS (i.e., non-weight-sharing mode) on TensorFlow 2.X <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/ClassicNas.rst>`__.
Model Compression
^^^^^^^^^^^^^^^^^
* Improve Model Speedup: track more dependencies among layers and automatically resolve mask conflict, support the speedup of pruned resnet.
* Added new pruners, including three auto model pruning algorithms: `NetAdapt Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#netadapt-pruner>`__\ , `SimulatedAnnealing Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#simulatedannealing-pruner>`__\ , `AutoCompress Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.md#autocompress-pruner>`__\ , and `ADMM Pruner <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Pruner.rst#admm-pruner>`__.
* Added `model sensitivity analysis tool <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/CompressionUtils.rst>`__ to help users find the sensitivity of each layer to the pruning.
*
`Easy flops calculation for model compression and NAS <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/CompressionUtils.rst#model-flops-parameters-counter>`__.
*
Update lottery ticket pruner to export winning ticket.
Examples
^^^^^^^^
* Automatically optimize tensor operators on NNI with a new `customized tuner OpEvo <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrialExample/OpEvoExamples.rst>`__.
Built-in tuners/assessors/advisors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Allow customized tuners/assessor/advisors to be installed as built-in algorithms <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Tutorial/InstallCustomizedAlgos.rst>`__.
WebUI
^^^^^
* Support visualizing nested search space more friendly.
* Show trial's dict keys in hyper-parameter graph.
* Enhancements to trial duration display.
Others
^^^^^^
* Provide utility function to merge parameters received from NNI
* Support setting paiStorageConfigName in pai mode
Documentation
-------------
* Improve `documentation for model compression <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/Compressor/Overview.rst>`__
* Improve `documentation <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/Benchmarks.rst>`__
and `examples <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/NAS/BenchmarksExample.ipynb>`__ for NAS benchmarks.
* Improve `documentation for AzureML training service <https://github.com/microsoft/nni/blob/v1.7/docs/en_US/TrainingService/AMLMode.rst>`__
* Homepage migration to readthedoc.
Bug Fixes
---------
* Fix bug for model graph with shared nn.Module
* Fix nodejs OOM when ``make build``
* Fix NASUI bugs
* Fix duration and intermediate results pictures update issue.
* Fix minor WebUI table style issues.
Release 1.6 - 5/26/2020
-----------------------
Major Features
^^^^^^^^^^^^^^
New Features and improvement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Improve IPC limitation to 100W
* improve code storage upload logic among trials in non-local platform
* support ``__version__`` for SDK version
* support windows dev intall
Web UI
^^^^^^
* Show trial error message
* finalize homepage layout
* Refactor overview's best trials module
* Remove multiphase from webui
* add tooltip for trial concurrency in the overview page
* Show top trials for hyper-parameter graph
HPO Updates
^^^^^^^^^^^
* Improve PBT on failure handling and support experiment resume for PBT
NAS Updates
^^^^^^^^^^^
* NAS support for TensorFlow 2.0 (preview) `TF2.0 NAS examples <https://github.com/microsoft/nni/tree/v1.9/examples/nas/naive-tf>`__
* Use OrderedDict for LayerChoice
* Prettify the format of export
* Replace layer choice with selected module after applied fixed architecture
Model Compression Updates
^^^^^^^^^^^^^^^^^^^^^^^^^
* Model compression PyTorch 1.4 support
Training Service Updates
^^^^^^^^^^^^^^^^^^^^^^^^
* update pai yaml merge logic
* support windows as remote machine in remote mode `Remote Mode <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/TrainingService/RemoteMachineMode.rst#windows>`__
Bug Fix
^^^^^^^
* fix dev install
* SPOS example crash when the checkpoints do not have state_dict
* Fix table sort issue when experiment had failed trial
* Support multi python env (conda, pyenv etc)
Release 1.5 - 4/13/2020
-----------------------
New Features and Documentation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hyper-Parameter Optimizing
^^^^^^^^^^^^^^^^^^^^^^^^^^
* New tuner: `Population Based Training (PBT) <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/Tuner/PBTTuner.rst>`__
* Trials can now report infinity and NaN as result
Neural Architecture Search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* New NAS algorithm: `TextNAS <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/NAS/TextNAS.rst>`__
* ENAS and DARTS now support `visualization <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/NAS/Visualization.rst>`__ through web UI.
Model Compression
^^^^^^^^^^^^^^^^^
* New Pruner: `GradientRankFilterPruner <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/Compressor/Pruner.rst#gradientrankfilterpruner>`__
* Compressors will validate configuration by default
* Refactor: Adding optimizer as an input argument of pruner, for easy support of DataParallel and more efficient iterative pruning. This is a broken change for the usage of iterative pruning algorithms.
* Model compression examples are refactored and improved
* Added documentation for `implementing compressing algorithm <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/Compressor/Framework.rst>`__
Training Service
^^^^^^^^^^^^^^^^
* Kubeflow now supports pytorchjob crd v1 (thanks external contributor @jiapinai)
* Experimental `DLTS <https://github.com/microsoft/nni/blob/v1.9/docs/en_US/TrainingService/DLTSMode.rst>`__ support
Overall Documentation Improvement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Documentation is significantly improved on grammar, spelling, and wording (thanks external contributor @AHartNtkn)
Fixed Bugs
^^^^^^^^^^
* ENAS cannot have more than one LSTM layers (thanks external contributor @marsggbo)
* NNI manager's timers will never unsubscribe (thanks external contributor @guilhermehn)
* NNI manager may exhaust head memory (thanks external contributor @Sundrops)
* Batch tuner does not support customized trials (#2075)
* Experiment cannot be killed if it failed on start (#2080)
* Non-number type metrics break web UI (#2278)
* A bug in lottery ticket pruner
* Other minor glitches
Release 1.4 - 2/19/2020
-----------------------
Major Features
^^^^^^^^^^^^^^
Neural Architecture Search
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Support `C-DARTS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/CDARTS.rst>`__ algorithm and add `the example <https://github.com/microsoft/nni/tree/v1.4/examples/nas/cdarts>`__ using it
* Support a preliminary version of `ProxylessNAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/Proxylessnas.rst>`__ and the corresponding `example <https://github.com/microsoft/nni/tree/v1.4/examples/nas/proxylessnas>`__
* Add unit tests for the NAS framework
Model Compression
^^^^^^^^^^^^^^^^^
* Support DataParallel for compressing models, and provide `an example <https://github.com/microsoft/nni/blob/v1.4/examples/model_compress/multi_gpu.py>`__ of using DataParallel
* Support `model speedup <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/ModelSpeedup.rst>`__ for compressed models, in Alpha version
Training Service
^^^^^^^^^^^^^^^^
* Support complete PAI configurations by allowing users to specify PAI config file path
* Add example config yaml files for the new PAI mode (i.e., paiK8S)
* Support deleting experiments using sshkey in remote mode (thanks external contributor @tyusr)
WebUI
^^^^^
* WebUI refactor: adopt fabric framework
Others
^^^^^^
* Support running `NNI experiment at foreground <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Tutorial/Nnictl#manage-an-experiment>`__\ , i.e., ``--foreground`` argument in ``nnictl create/resume/view``
* Support canceling the trials in UNKNOWN state
* Support large search space whose size could be up to 50mb (thanks external contributor @Sundrops)
Documentation
^^^^^^^^^^^^^
* Improve `the index structure <https://nni.readthedocs.io/en/latest/>`__ of NNI readthedocs
* Improve `documentation for NAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/NasGuide.rst>`__
* Improve documentation for `the new PAI mode <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrainingService/PaiMode.rst>`__
* Add QuickStart guidance for `NAS <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/NAS/QuickStart.md>`__ and `model compression <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/Compressor/QuickStart.rst>`__
* Improve documentation for `the supported EfficientNet <https://github.com/microsoft/nni/blob/v1.4/docs/en_US/TrialExample/EfficientNet.rst>`__
Bug Fixes
^^^^^^^^^
* Correctly support NaN in metric data, JSON compliant
* Fix the out-of-range bug of ``randint`` type in search space
* Fix the bug of wrong tensor device when exporting onnx model in model compression
* Fix incorrect handling of nnimanagerIP in the new PAI mode (i.e., paiK8S)
Release 1.3 - 12/30/2019
------------------------
Major Features
^^^^^^^^^^^^^^
Neural Architecture Search Algorithms Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Single Path One Shot <https://github.com/microsoft/nni/tree/v1.3/examples/nas/spos/>`__ algorithm and the example using it
Model Compression Algorithms Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Knowledge Distillation <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/TrialExample/KDExample.rst>`__ algorithm and the example using itExample
* Pruners
* `L2Filter Pruner <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/Compressor/Pruner.rst#3-l2filter-pruner>`__
* `ActivationAPoZRankFilterPruner <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/Compressor/Pruner.rst#1-activationapozrankfilterpruner>`__
* `ActivationMeanRankFilterPruner <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/Compressor/Pruner.rst#2-activationmeanrankfilterpruner>`__
* `BNN Quantizer <https://github.com/microsoft/nni/blob/v1.3/docs/en_US/Compressor/Quantizer.rst#bnn-quantizer>`__
#### Training Service
*
NFS Support for PAI
Instead of using HDFS as default storage, since OpenPAI v0.11, OpenPAI can have NFS or AzureBlob or other storage as default storage. In this release, NNI extended the support for this recent change made by OpenPAI, and could integrate with OpenPAI v0.11 or later version with various default storage.
*
Kubeflow update adoption
Adopted the Kubeflow 0.7's new supports for tf-operator.
Engineering (code and build automation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Enforced `ESLint <https://eslint.org/>`__ on static code analysis.
Small changes & Bug Fixes
^^^^^^^^^^^^^^^^^^^^^^^^^
* correctly recognize builtin tuner and customized tuner
* logging in dispatcher base
* fix the bug where tuner/assessor's failure sometimes kills the experiment.
* Fix local system as remote machine `issue <https://github.com/microsoft/nni/issues/1852>`__
* de-duplicate trial configuration in smac tuner `ticket <https://github.com/microsoft/nni/issues/1364>`__
Release 1.2 - 12/02/2019
------------------------
Major Features
^^^^^^^^^^^^^^
* `Feature Engineering <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/FeatureEngineering/Overview.rst>`__
* New feature engineering interface
* Feature selection algorithms: `Gradient feature selector <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/FeatureEngineering/GradientFeatureSelector.md>`__ & `GBDT selector <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/FeatureEngineering/GBDTSelector.rst>`__
* `Examples for feature engineering <https://github.com/microsoft/nni/tree/v1.2/examples/feature_engineering>`__
* Neural Architecture Search (NAS) on NNI
* `New NAS interface <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/NAS/NasInterface.rst>`__
* NAS algorithms: `ENAS <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/NAS/Overview.md#enas>`__\ , `DARTS <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/NAS/Overview.md#darts>`__\ , `P-DARTS <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/NAS/Overview.rst#p-darts>`__ (in PyTorch)
* NAS in classic mode (each trial runs independently)
* Model compression
* `New model pruning algorithms <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/Compressor/Overview.rst>`__\ : lottery ticket pruning approach, L1Filter pruner, Slim pruner, FPGM pruner
* `New model quantization algorithms <https://github.com/microsoft/nni/blob/v1.2/docs/en_US/Compressor/Overview.rst>`__\ : QAT quantizer, DoReFa quantizer
* Support the API for exporting compressed model.
* Training Service
* Support OpenPAI token authentication
* Examples:
* `An example to automatically tune rocksdb configuration with NNI <https://github.com/microsoft/nni/tree/v1.2/examples/trials/systems/rocksdb-fillrandom>`__.
* `A new MNIST trial example supports tensorflow 2.0 <https://github.com/microsoft/nni/tree/v1.2/examples/trials/mnist-tfv2>`__.
* Engineering Improvements
* For remote training service, trial jobs require no GPU are now scheduled with round-robin policy instead of random.
* Pylint rules added to check pull requests, new pull requests need to comply with these `pylint rules <https://github.com/microsoft/nni/blob/v1.2/pylintrc>`__.
* Web Portal & User Experience
* Support user to add customized trial.
* User can zoom out/in in detail graphs, except Hyper-parameter.
* Documentation
* Improved NNI API documentation with more API docstring.
Bug fix
^^^^^^^
* Fix the table sort issue when failed trials haven't metrics. -Issue #1773
* Maintain selected status(Maximal/Minimal) when the page switched. -PR#1710
* Make hyper-parameters graph's default metric yAxis more accurate. -PR#1736
* Fix GPU script permission issue. -Issue #1665
Release 1.1 - 10/23/2019
------------------------
Major Features
^^^^^^^^^^^^^^
* New tuner: `PPO Tuner <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Tuner/PPOTuner.rst>`__
* `View stopped experiments <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Tutorial/Nnictl.rst#view>`__
* Tuners can now use dedicated GPU resource (see ``gpuIndices`` in `tutorial <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Tutorial/ExperimentConfig.rst>`__ for details)
* Web UI improvements
* Trials detail page can now list hyperparameters of each trial, as well as their start and end time (via "add column")
* Viewing huge experiment is now less laggy
* More examples
* `EfficientNet PyTorch example <https://github.com/ultmaster/EfficientNet-PyTorch>`__
* `Cifar10 NAS example <https://github.com/microsoft/nni/blob/v1.1/examples/trials/nas_cifar10/README.rst>`__
* `Model compression toolkit - Alpha release <https://github.com/microsoft/nni/blob/v1.1/docs/en_US/Compressor/Overview.rst>`__\ : We are glad to announce the alpha release for model compression toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute
Fixed Bugs
^^^^^^^^^^
* Multiphase job hangs when search space exhuasted (issue #1204)
* ``nnictl`` fails when log not available (issue #1548)
Release 1.0 - 9/2/2019
----------------------
Major Features
^^^^^^^^^^^^^^
*
Tuners and Assessors
* Support Auto-Feature generator & selection -Issue#877 -PR #1387
* Provide auto feature interface
* Tuner based on beam search
* `Add Pakdd example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/auto-feature-engineering>`__
* Add a parallel algorithm to improve the performance of TPE with large concurrency. -PR #1052
* Support multiphase for hyperband -PR #1257
*
Training Service
* Support private docker registry -PR #755
* Engineering Improvements
* Python wrapper for rest api, support retrieve the values of the metrics in a programmatic way PR #1318
* New python API : get_experiment_id(), get_trial_id() -PR #1353 -Issue #1331 & -Issue#1368
* Optimized NAS Searchspace -PR #1393
* Unify NAS search space with _type -- "mutable_type"e
* Update random search tuner
* Set gpuNum as optional -Issue #1365
* Remove outputDir and dataDir configuration in PAI mode -Issue #1342
* When creating a trial in Kubeflow mode, codeDir will no longer be copied to logDir -Issue #1224
*
Web Portal & User Experience
* Show the best metric curve during search progress in WebUI -Issue #1218
* Show the current number of parameters list in multiphase experiment -Issue1210 -PR #1348
* Add "Intermediate count" option in AddColumn. -Issue #1210
* Support search parameters value in WebUI -Issue #1208
* Enable automatic scaling of axes for metric value in default metric graph -Issue #1360
* Add a detailed documentation link to the nnictl command in the command prompt -Issue #1260
* UX improvement for showing Error log -Issue #1173
*
Documentation
* Update the docs structure -Issue #1231
* (deprecated) Multi phase document improvement -Issue #1233 -PR #1242
* Add configuration example
* `WebUI description improvement <Tutorial/WebUI.rst>`__ -PR #1419
Bug fix
^^^^^^^
* (Bug fix)Fix the broken links in 0.9 release -Issue #1236
* (Bug fix)Script for auto-complete
* (Bug fix)Fix pipeline issue that it only check exit code of last command in a script. -PR #1417
* (Bug fix)quniform fors tuners -Issue #1377
* (Bug fix)'quniform' has different meaning beween GridSearch and other tuner. -Issue #1335
* (Bug fix)"nnictl experiment list" give the status of a "RUNNING" experiment as "INITIALIZED" -PR #1388
* (Bug fix)SMAC cannot be installed if nni is installed in dev mode -Issue #1376
* (Bug fix)The filter button of the intermediate result cannot be clicked -Issue #1263
* (Bug fix)API "/api/v1/nni/trial-jobs/xxx" doesn't show a trial's all parameters in multiphase experiment -Issue #1258
* (Bug fix)Succeeded trial doesn't have final result but webui show ×××(FINAL) -Issue #1207
* (Bug fix)IT for nnictl stop -Issue #1298
* (Bug fix)fix security warning
* (Bug fix)Hyper-parameter page broken -Issue #1332
* (Bug fix)Run flake8 tests to find Python syntax errors and undefined names -PR #1217
Release 0.9 - 7/1/2019
----------------------
Major Features
^^^^^^^^^^^^^^
* General NAS programming interface
* Add ``enas-mode`` and ``oneshot-mode`` for NAS interface: `PR #1201 <https://github.com/microsoft/nni/pull/1201#issue-291094510>`__
*
`Gaussian Process Tuner with Matern kernel <Tuner/GPTuner.rst>`__
*
(deprecated) Multiphase experiment supports
* Added new training service support for multiphase experiment: PAI mode supports multiphase experiment since v0.9.
* Added multiphase capability for the following builtin tuners:
* TPE, Random Search, Anneal, Naïve Evolution, SMAC, Network Morphism, Metis Tuner.
*
Web Portal
* Enable trial comparation in Web Portal. For details, refer to `View trials status <Tutorial/WebUI.rst>`__
* Allow users to adjust rendering interval of Web Portal. For details, refer to `View Summary Page <Tutorial/WebUI.rst>`__
* show intermediate results more friendly. For details, refer to `View trials status <Tutorial/WebUI.rst>`__
* `Commandline Interface <Tutorial/Nnictl.rst>`__
* ``nnictl experiment delete``\ : delete one or all experiments, it includes log, result, environment information and cache. It uses to delete useless experiment result, or save disk space.
* ``nnictl platform clean``\ : It uses to clean up disk on a target platform. The provided YAML file includes the information of target platform, and it follows the same schema as the NNI configuration file.
### Bug fix and other changes
* Tuner Installation Improvements: add `sklearn <https://scikit-learn.org/stable/>`__ to nni dependencies.
* (Bug Fix) Failed to connect to PAI http code - `Issue #1076 <https://github.com/microsoft/nni/issues/1076>`__
* (Bug Fix) Validate file name for PAI platform - `Issue #1164 <https://github.com/microsoft/nni/issues/1164>`__
* (Bug Fix) Update GMM evaluation in Metis Tuner
* (Bug Fix) Negative time number rendering in Web Portal - `Issue #1182 <https://github.com/microsoft/nni/issues/1182>`__\ , `Issue #1185 <https://github.com/microsoft/nni/issues/1185>`__
* (Bug Fix) Hyper-parameter not shown correctly in WebUI when there is only one hyper parameter - `Issue #1192 <https://github.com/microsoft/nni/issues/1192>`__
Release 0.8 - 6/4/2019
----------------------
Major Features
^^^^^^^^^^^^^^
* Support NNI on Windows for OpenPAI/Remote mode
* NNI running on windows for remote mode
* NNI running on windows for OpenPAI mode
* Advanced features for using GPU
* Run multiple trial jobs on the same GPU for local and remote mode
* Run trial jobs on the GPU running non-NNI jobs
* Kubeflow v1beta2 operator
* Support Kubeflow TFJob/PyTorchJob v1beta2
* `General NAS programming interface <https://github.com/microsoft/nni/blob/v0.8/docs/en_US/GeneralNasInterfaces.rst>`__
* Provide NAS programming interface for users to easily express their neural architecture search space through NNI annotation
* Provide a new command ``nnictl trial codegen`` for debugging the NAS code
* Tutorial of NAS programming interface, example of NAS on MNIST, customized random tuner for NAS
* Support resume tuner/advisor's state for experiment resume
* For experiment resume, tuner/advisor will be resumed by replaying finished trial data
* Web Portal
* Improve the design of copying trial's parameters
* Support 'randint' type in hyper-parameter graph
* Use should ComponentUpdate to avoid unnecessary render
Bug fix and other changes
^^^^^^^^^^^^^^^^^^^^^^^^^
* Bug fix that ``nnictl update`` has inconsistent command styles
* Support import data for SMAC tuner
* Bug fix that experiment state transition from ERROR back to RUNNING
* Fix bug of table entries
* Nested search space refinement
* Refine 'randint' type and support lower bound
* `Comparison of different hyper-parameter tuning algorithm <CommunitySharings/HpoComparison.rst>`__
* `Comparison of NAS algorithm <CommunitySharings/NasComparison.rst>`__
* `NNI practice on Recommenders <CommunitySharings/RecommendersSvd.rst>`__
Release 0.7 - 4/29/2018
-----------------------
Major Features
^^^^^^^^^^^^^^
* `Support NNI on Windows <Tutorial/InstallationWin.rst>`__
* NNI running on windows for local mode
* `New advisor: BOHB <Tuner/BohbAdvisor.rst>`__
* Support a new advisor BOHB, which is a robust and efficient hyperparameter tuning algorithm, combines the advantages of Bayesian optimization and Hyperband
* `Support import and export experiment data through nnictl <Tutorial/Nnictl.rst>`__
* Generate analysis results report after the experiment execution
* Support import data to tuner and advisor for tuning
* `Designated gpu devices for NNI trial jobs <Tutorial/ExperimentConfig.rst#localConfig>`__
* Specify GPU devices for NNI trial jobs by gpuIndices configuration, if gpuIndices is set in experiment configuration file, only the specified GPU devices are used for NNI trial jobs.
* Web Portal enhancement
* Decimal format of metrics other than default on the Web UI
* Hints in WebUI about Multi-phase
* Enable copy/paste for hyperparameters as python dict
* Enable early stopped trials data for tuners.
* NNICTL provide better error message
* nnictl provide more meaningful error message for YAML file format error
Bug fix
^^^^^^^
* Unable to kill all python threads after nnictl stop in async dispatcher mode
* nnictl --version does not work with make dev-install
* All trail jobs status stays on 'waiting' for long time on OpenPAI platform
Release 0.6 - 4/2/2019
----------------------
Major Features
^^^^^^^^^^^^^^
* `Version checking <TrainingService/PaiMode.rst>`__
* check whether the version is consistent between nniManager and trialKeeper
* `Report final metrics for early stop job <https://github.com/microsoft/nni/issues/776>`__
* If includeIntermediateResults is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of includeIntermediateResults is false.
* `Separate Tuner/Assessor <https://github.com/microsoft/nni/issues/841>`__
* Adds two pipes to separate message receiving channels for tuner and assessor.
* Make log collection feature configurable
* Add intermediate result graph for all trials
Bug fix
^^^^^^^
* `Add shmMB config key for OpenPAI <https://github.com/microsoft/nni/issues/842>`__
* Fix the bug that doesn't show any result if metrics is dict
* Fix the number calculation issue for float types in hyperband
* Fix a bug in the search space conversion in SMAC tuner
* Fix the WebUI issue when parsing experiment.json with illegal format
* Fix cold start issue in Metis Tuner
Release 0.5.2 - 3/4/2019
------------------------
Improvements
^^^^^^^^^^^^
* Curve fitting assessor performance improvement.
Documentation
^^^^^^^^^^^^^
* Chinese version document: https://nni.readthedocs.io/zh/latest/
* Debuggability/serviceability document: https://nni.readthedocs.io/en/latest/Tutorial/HowToDebug.html
* Tuner assessor reference: https://nni.readthedocs.io/en/latest/sdk_reference.html
Bug Fixes and Other Changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Fix a race condition bug that does not store trial job cancel status correctly.
* Fix search space parsing error when using SMAC tuner.
* Fix cifar10 example broken pipe issue.
* Add unit test cases for nnimanager and local training service.
* Add integration test azure pipelines for remote machine, OpenPAI and kubeflow training services.
* Support Pylon in OpenPAI webhdfs client.
Release 0.5.1 - 1/31/2018
-------------------------
Improvements
^^^^^^^^^^^^
* Making `log directory <https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.rst>`__ configurable
* Support `different levels of logs <https://github.com/microsoft/nni/blob/v0.5.1/docs/ExperimentConfig.rst>`__\ , making it easier for debugging
Documentation
^^^^^^^^^^^^^
* Reorganized documentation & New Homepage Released: https://nni.readthedocs.io/en/latest/
Bug Fixes and Other Changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Fix the bug of installation in python virtualenv, and refactor the installation logic
* Fix the bug of HDFS access failure on OpenPAI mode after OpenPAI is upgraded.
* Fix the bug that sometimes in-place flushed stdout makes experiment crash
Release 0.5.0 - 01/14/2019
--------------------------
Major Features
^^^^^^^^^^^^^^
New tuner and assessor supports
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Support `Metis tuner <Tuner/MetisTuner.rst>`__ as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support `ENAS customized tuner <https://github.com/countif/enas_nni>`__\ , a tuner contributed by github community user, is an algorithm for neural network search, it could learn neural network architecture via reinforcement learning and serve a better performance than NAS.
* Support `Curve fitting assessor <Assessor/CurvefittingAssessor.rst>`__ for early stop policy using learning curve extrapolation.
* Advanced Support of `Weight Sharing <https://github.com/microsoft/nni/blob/v0.5/docs/AdvancedNAS.rst>`__\ : Enable weight sharing for NAS tuners, currently through NFS.
Training Service Enhancement
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `FrameworkController Training service <TrainingService/FrameworkControllerMode.rst>`__\ : Support run experiments using frameworkcontroller on kubernetes
* FrameworkController is a Controller on kubernetes that is general enough to run (distributed) jobs with various machine learning frameworks, such as tensorflow, pytorch, MXNet.
* NNI provides unified and simple specification for job definition.
* MNIST example for how to use FrameworkController.
User Experience improvements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* A better trial logging support for NNI experiments in OpenPAI, Kubeflow and FrameworkController mode:
* An improved logging architecture to send stdout/stderr of trials to NNI manager via Http post. NNI manager will store trial's stdout/stderr messages in local log file.
* Show the link for trial log file on WebUI.
* Support to show final result's all key-value pairs.
Release 0.4.1 - 12/14/2018
--------------------------
Major Features
^^^^^^^^^^^^^^
New tuner supports
^^^^^^^^^^^^^^^^^^
* Support `network morphism <Tuner/NetworkmorphismTuner.rst>`__ as a new tuner
Training Service improvements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Migrate `Kubeflow training service <TrainingService/KubeflowMode.rst>`__\ 's dependency from kubectl CLI to `Kubernetes API <https://kubernetes.io/docs/concepts/overview/kubernetes-api/>`__ client
* `Pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ support for Kubeflow training service
* Improvement on local code files uploading to OpenPAI HDFS
* Fixed OpenPAI integration WebUI bug: WebUI doesn't show latest trial job status, which is caused by OpenPAI token expiration
NNICTL improvements
^^^^^^^^^^^^^^^^^^^
* Show version information both in nnictl and WebUI. You can run **nnictl -v** to show your current installed NNI version
WebUI improvements
^^^^^^^^^^^^^^^^^^
* Enable modify concurrency number during experiment
* Add feedback link to NNI github 'create issue' page
* Enable customize top 10 trials regarding to metric numbers (largest or smallest)
* Enable download logs for dispatcher & nnimanager
* Enable automatic scaling of axes for metric number
* Update annotation to support displaying real choice in searchspace
New examples
^^^^^^^^^^^^
* `FashionMnist <https://github.com/microsoft/nni/tree/v1.9/examples/trials/network_morphism>`__\ , work together with network morphism tuner
* `Distributed MNIST example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/mnist-distributed-pytorch>`__ written in PyTorch
Release 0.4 - 12/6/2018
-----------------------
Major Features
^^^^^^^^^^^^^^
* `Kubeflow Training service <TrainingService/KubeflowMode.rst>`__
* Support tf-operator
* `Distributed trial example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/mnist-distributed/dist_mnist.py>`__ on Kubeflow
* `Grid search tuner <Tuner/GridsearchTuner.rst>`__
* `Hyperband tuner <Tuner/HyperbandAdvisor.rst>`__
* Support launch NNI experiment on MAC
* WebUI
* UI support for hyperband tuner
* Remove tensorboard button
* Show experiment error message
* Show line numbers in search space and trial profile
* Support search a specific trial by trial number
* Show trial's hdfsLogPath
* Download experiment parameters
Others
^^^^^^
* Asynchronous dispatcher
* Docker file update, add pytorch library
* Refactor 'nnictl stop' process, send SIGTERM to nni manager process, rather than calling stop Rest API.
* OpenPAI training service bug fix
* Support NNI Manager IP configuration(nniManagerIp) in OpenPAI cluster config file, to fix the issue that user’s machine has no eth0 device
* File number in codeDir is capped to 1000 now, to avoid user mistakenly fill root dir for codeDir
* Don’t print useless ‘metrics is empty’ log in OpenPAI job’s stdout. Only print useful message once new metrics are recorded, to reduce confusion when user checks OpenPAI trial’s output for debugging purpose
* Add timestamp at the beginning of each log entry in trial keeper.
Release 0.3.0 - 11/2/2018
-------------------------
NNICTL new features and updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
Support running multiple experiments simultaneously.
Before v0.3, NNI only supports running single experiment once a time. After this release, users are able to run multiple experiments simultaneously. Each experiment will require a unique port, the 1st experiment will be set to the default port as previous versions. You can specify a unique port for the rest experiments as below:
.. code-block:: bash
nnictl create --port 8081 --config <config file path>
*
Support updating max trial number.
use ``nnictl update --help`` to learn more. Or refer to `NNICTL Spec <Tutorial/Nnictl.rst>`__ for the fully usage of NNICTL.
API new features and updates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
:raw-html:`<span style="color:red">**breaking change**</span>`\ : nn.get_parameters() is refactored to nni.get_next_parameter. All examples of prior releases can not run on v0.3, please clone nni repo to get new examples. If you had applied NNI to your own codes, please update the API accordingly.
*
New API **nni.get_sequence_id()**.
Each trial job is allocated a unique sequence number, which can be retrieved by nni.get_sequence_id() API.
.. code-block:: bash
git clone -b v0.3 https://github.com/microsoft/nni.git
*
**nni.report_final_result(result)** API supports more data types for result parameter.
It can be of following types:
* int
* float
* A python dict containing 'default' key, the value of 'default' key should be of type int or float. The dict can contain any other key value pairs.
New tuner support
^^^^^^^^^^^^^^^^^
* **Batch Tuner** which iterates all parameter combination, can be used to submit batch trial jobs.
New examples
^^^^^^^^^^^^
*
A NNI Docker image for public usage:
.. code-block:: bash
docker pull msranni/nni:latest
*
New trial example: `NNI Sklearn Example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/sklearn>`__
* New competition example: `Kaggle Competition TGS Salt Example <https://github.com/microsoft/nni/tree/v1.9/examples/trials/kaggle-tgs-salt>`__
Others
^^^^^^
* UI refactoring, refer to `WebUI doc <Tutorial/WebUI.rst>`__ for how to work with the new UI.
* Continuous Integration: NNI had switched to Azure pipelines
Release 0.2.0 - 9/29/2018
-------------------------
Major Features
^^^^^^^^^^^^^^
* Support `OpenPAI <https://github.com/microsoft/pai>`__ Training Platform (See `here <TrainingService/PaiMode.rst>`__ for instructions about how to submit NNI job in pai mode)
* Support training services on pai mode. NNI trials will be scheduled to run on OpenPAI cluster
* NNI trial's output (including logs and model file) will be copied to OpenPAI HDFS for further debugging and checking
* Support `SMAC <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__ tuner (See `here <Tuner/SmacTuner.rst>`__ for instructions about how to use SMAC tuner)
* `SMAC <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__ is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO to handle categorical parameters. The SMAC supported by NNI is a wrapper on `SMAC3 <https://github.com/automl/SMAC3>`__
* Support NNI installation on `conda <https://conda.io/docs/index.html>`__ and python virtual environment
* Others
* Update ga squad example and related documentation
* WebUI UX small enhancement and bug fix
Release 0.1.0 - 9/10/2018 (initial release)
-------------------------------------------
Initial release of Neural Network Intelligence (NNI).
Major Features
^^^^^^^^^^^^^^
* Installation and Deployment
* Support pip install and source codes install
* Support training services on local mode(including Multi-GPU mode) as well as multi-machines mode
* Tuners, Assessors and Trial
* Support AutoML algorithms including: hyperopt_tpe, hyperopt_annealing, hyperopt_random, and evolution_tuner
* Support assessor(early stop) algorithms including: medianstop algorithm
* Provide Python API for user defined tuners and assessors
* Provide Python API for user to wrap trial code as NNI deployable codes
* Experiments
* Provide a command line toolkit 'nnictl' for experiments management
* Provide a WebUI for viewing experiments details and managing experiments
* Continuous Integration
* Support CI by providing out-of-box integration with `travis-ci <https://github.com/travis-ci>`__ on ubuntu
* Others
* Support simple GPU job scheduling
Research and Publications
=========================
We are intensively working on both tool chain and research to make automatic model design and tuning really practical and powerful. On the one hand, our main work is tool chain oriented development. On the other hand, our research works aim to improve this tool chain, rethink challenging problems in AutoML (on both system and algorithm) and propose elegant solutions. Below we list some of our research works, we encourage more research works on this topic and encourage collaboration with us.
System Research
---------------
* `Retiarii: A Deep Learning Exploratory-Training Framework <https://www.usenix.org/system/files/osdi20-zhang_quanlu.pdf>`__
.. code-block:: bibtex
@inproceedings{zhang2020retiarii,
title={Retiarii: A Deep Learning Exploratory-Training Framework},
author={Zhang, Quanlu and Han, Zhenhua and Yang, Fan and Zhang, Yuge and Liu, Zhe and Yang, Mao and Zhou, Lidong},
booktitle={14th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 20)},
pages={919--936},
year={2020}
}
* `AutoSys: The Design and Operation of Learning-Augmented Systems <https://www.usenix.org/system/files/atc20-liang-chieh-jan.pdf>`__
.. code-block:: bibtex
@inproceedings{liang2020autosys,
title={AutoSys: The Design and Operation of Learning-Augmented Systems},
author={Liang, Chieh-Jan Mike and Xue, Hui and Yang, Mao and Zhou, Lidong and Zhu, Lifei and Li, Zhao Lucis and Wang, Zibo and Chen, Qi and Zhang, Quanlu and Liu, Chuanjie and others},
booktitle={2020 $\{$USENIX$\}$ Annual Technical Conference ($\{$USENIX$\}$$\{$ATC$\}$ 20)},
pages={323--336},
year={2020}
}
* `Gandiva: Introspective Cluster Scheduling for Deep Learning <https://www.usenix.org/system/files/osdi18-xiao.pdf>`__
.. code-block:: bibtex
@inproceedings{xiao2018gandiva,
title={Gandiva: Introspective cluster scheduling for deep learning},
author={Xiao, Wencong and Bhardwaj, Romil and Ramjee, Ramachandran and Sivathanu, Muthian and Kwatra, Nipun and Han, Zhenhua and Patel, Pratyush and Peng, Xuan and Zhao, Hanyu and Zhang, Quanlu and others},
booktitle={13th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 18)},
pages={595--610},
year={2018}
}
Algorithm Research
------------------
New Algorithms
^^^^^^^^^^^^^^
* `TextNAS: A Neural Architecture Search Space Tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__
.. code-block:: bibtex
@inproceedings{wang2020textnas,
title={TextNAS: A Neural Architecture Search Space Tailored for Text Representation.},
author={Wang, Yujing and Yang, Yaming and Chen, Yiren and Bai, Jing and Zhang, Ce and Su, Guinan and Kou, Xiaoyu and Tong, Yunhai and Yang, Mao and Zhou, Lidong},
booktitle={AAAI},
pages={9242--9249},
year={2020}
}
* `Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search <https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf>`__
.. code-block:: bibtex
@article{peng2020cream,
title={Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search},
author={Peng, Houwen and Du, Hao and Yu, Hongyuan and Li, Qi and Liao, Jing and Fu, Jianlong},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
* `Metis: Robustly tuning tail latencies of cloud systems <https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf>`__
.. code-block:: bibtex
@inproceedings{li2018metis,
title={Metis: Robustly tuning tail latencies of cloud systems},
author={Li, Zhao Lucis and Liang, Chieh-Jan Mike and He, Wenjia and Zhu, Lianjie and Dai, Wenjun and Jiang, Jin and Sun, Guangzhong},
booktitle={2018 $\{$USENIX$\}$ Annual Technical Conference ($\{$USENIX$\}$$\{$ATC$\}$ 18)},
pages={981--992},
year={2018}
}
* `OpEvo: An Evolutionary Method for Tensor Operator Optimization <https://arxiv.org/abs/2006.05664>`__
.. code-block:: bibtex
@article{gao2020opevo,
title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
author={Gao, Xiaotian and Wei, Cui and Zhang, Lintao and Yang, Mao},
journal={arXiv preprint arXiv:2006.05664},
year={2020}
}
Measurement and Understanding
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Deeper insights into weight sharing in neural architecture search <https://arxiv.org/pdf/2001.01431.pdf>`__
.. code-block:: bibtex
@article{zhang2020deeper,
title={Deeper insights into weight sharing in neural architecture search},
author={Zhang, Yuge and Lin, Zejun and Jiang, Junyang and Zhang, Quanlu and Wang, Yujing and Xue, Hui and Zhang, Chen and Yang, Yaming},
journal={arXiv preprint arXiv:2001.01431},
year={2020}
}
* `How Does Supernet Help in Neural Architecture Search? <https://arxiv.org/abs/2010.08219>`__
.. code-block:: bibtex
@article{zhang2020does,
title={How Does Supernet Help in Neural Architecture Search?},
author={Zhang, Yuge and Zhang, Quanlu and Yang, Yaming},
journal={arXiv preprint arXiv:2010.08219},
year={2020}
}
Applications
^^^^^^^^^^^^
* `AutoADR: Automatic Model Design for Ad Relevance <https://arxiv.org/pdf/2010.07075.pdf>`__
.. code-block:: bibtex
@inproceedings{chen2020autoadr,
title={AutoADR: Automatic Model Design for Ad Relevance},
author={Chen, Yiren and Yang, Yaming and Sun, Hong and Wang, Yujing and Xu, Yu and Shen, Wei and Zhou, Rong and Tong, Yunhai and Bai, Jing and Zhang, Ruofei},
booktitle={Proceedings of the 29th ACM International Conference on Information \& Knowledge Management},
pages={2365--2372},
year={2020}
}
.. role:: raw-html(raw)
:format: html
Framework and Library Supports
==============================
With the built-in Python API, NNI naturally supports the hyper parameter tuning and neural network search for all the AI frameworks and libraries who support Python models(\ ``version >= 3.6``\ ). NNI had also provided a set of examples and tutorials for some of the popular scenarios to make jump start easier.
Supported AI Frameworks
-----------------------
* :raw-html:`<b>[PyTorch]</b>` https://github.com/pytorch/pytorch
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-distributed-pytorch">MNIST-pytorch</a><br/></li>
<li><a href="TrialExample/Cifar10Examples.md">CIFAR-10</a><br/></li>
<li><a href="../../examples/trials/kaggle-tgs-salt/README.md">TGS salt identification chanllenge</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* :raw-html:`<b>[TensorFlow]</b>` https://github.com/tensorflow/tensorflow
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-distributed">MNIST-tensorflow</a><br/></li>
<li><a href="../../examples/trials/ga_squad/README.md">Squad</a><br/></li>
</ul>
* :raw-html:`<b>[Keras]</b>` https://github.com/keras-team/keras
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-keras">MNIST-keras</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* :raw-html:`<b>[MXNet]</b>` https://github.com/apache/incubator-mxnet
* :raw-html:`<b>[Caffe2]</b>` https://github.com/BVLC/caffe
* :raw-html:`<b>[CNTK (Python language)]</b>` https://github.com/microsoft/CNTK
* :raw-html:`<b>[Spark MLlib]</b>` http://spark.apache.org/mllib/
* :raw-html:`<b>[Chainer]</b>` https://chainer.org/
* :raw-html:`<b>[Theano]</b>` https://pypi.org/project/Theano/ :raw-html:`<br/>`
You are encouraged to `contribute more examples <Tutorial/Contributing.rst>`__ for other NNI users.
Supported Library
-----------------
NNI also supports all libraries written in python.Here are some common libraries, including some algorithms based on GBDT: XGBoost, CatBoost and lightGBM.
* :raw-html:`<b>[Scikit-learn]</b>` https://scikit-learn.org/stable/
.. raw:: html
<ul>
<li><a href="TrialExample/SklearnExamples.md">Scikit-learn</a><br/></li>
</ul>
* :raw-html:`<b>[XGBoost]</b>` https://xgboost.readthedocs.io/en/latest/
* :raw-html:`<b>[CatBoost]</b>` https://catboost.ai/
* :raw-html:`<b>[LightGBM]</b>` https://lightgbm.readthedocs.io/en/latest/
:raw-html:`<ul>
<li><a href="TrialExample/GbdtExample.md">Auto-gbdt</a><br/></li>
</ul>`
Here is just a small list of libraries that supported by NNI. If you are interested in NNI, you can refer to the `tutorial <TrialExample/Trials.rst>`__ to complete your own hacks.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.md>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
**Run an Experiment on Azure Machine Learning**
===================================================
NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
Setup environment
-----------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Step 2. Create an Azure account/subscription using this `link <https://azure.microsoft.com/en-us/free/services/machine-learning/>`__. If you already have an Azure account/subscription, skip this step.
Step 3. Install the Azure CLI on your machine, follow the install guide `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__.
Step 4. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
.. code-block:: bash
az login
Step 5. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download ``config.json`` which will be used later.
.. image:: ../../img/aml_workspace.png
:target: ../../img/aml_workspace.png
:alt:
Step 6. Create an AML cluster as the computeTarget.
.. image:: ../../img/aml_cluster.png
:target: ../../img/aml_cluster.png
:alt:
Step 7. Open a command line and install AML package environment.
.. code-block:: bash
python3 -m pip install azureml
python3 -m pip install azureml-sdk
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: aml
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
image: msranni/nni
gpuNum: 1
amlConfig:
subscriptionId: ${replace_to_your_subscriptionId}
resourceGroup: ${replace_to_your_resourceGroup}
workspaceName: ${replace_to_your_workspaceName}
computeTarget: ${replace_to_your_computeTarget}
Note: You should set ``trainingServicePlatform: aml`` in NNI config YAML file if you want to start experiment in aml mode.
Compared with `LocalMode <LocalMode.rst>`__ trial configuration in aml mode have these additional keys:
* image
* required key. The docker image name used in job. NNI support image ``msranni/nni`` for running aml jobs.
.. code-block:: bash
Note: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
amlConfig:
* subscriptionId
* required key, the subscriptionId of your account
* resourceGroup
* required key, the resourceGroup of your account
* workspaceName
* required key, the workspaceName of your account
* computeTarget
* required key, the compute cluster name you want to use in your AML workspace. `refer <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ See Step 6.
* maxTrialNumPerGpu
* optional key, default 1. Used to specify the max concurrency trial number on a GPU device.
* useActiveGpu
* optional key, default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU.
The required information of amlConfig could be found in the downloaded ``config.json`` in Step 5.
Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-tfv1
# modify config_aml.yml ...
nnictl create --config config_aml.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v1.9``.
Run an Experiment on AdaptDL
============================
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__. Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
Prerequisite for Kubernetes Service
-----------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Verify Prerequisites
^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
nnictl --version
# Expected: <version_number>
.. code-block:: bash
kubectl version
# Expected that the kubectl client version matches the server version.
.. code-block:: bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
Run an experiment
-----------------
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under ``examples/trials/cifar10_pytorch`` folder. (\ ``main_adl.py`` and ``config_adl.yaml``\ )
Here is a template configuration specification to use AdaptDL as a training service.
.. code-block:: yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageSize: 1Gi
Those configs not mentioned below, are following the
`default specs defined in the NNI doc </Tutorial/ExperimentConfig.html#configuration-spec>`__.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**\ : *Required* to get the correct info and metrics back from the cluster, for ``adl`` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**\ : *Recommended* to set as ``http``. It will collect the trial logs on cluster back to your machine via http.
* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**\ : It defines the specs of an ``adl`` trial.
* **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
* **image**\ : Docker image for the trial
* **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**\ : the working directory of the container. ``.`` means the default working directory defined by the image.
* **command**\ : the bash command to start the trial
* **gpuNum**\ : the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**\ : (*Optional*\ ) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
`default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
* **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*\ ) `storage settings <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
NFS Storage
^^^^^^^^^^^
As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
Note that ``adl`` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The ``adl`` training service can then mount it to the kubernetes for every trials, with the proper configurations:
* **server**\ : NFS server address, e.g. IP address or domain
* **path**\ : NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**\ : In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
Monitor via Log Stream
----------------------
Follow the log streaming of a certain trial:
.. code-block:: bash
nnictl log trial --trial_id=<trial_id>
.. code-block:: bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.
Monitor via TensorBoard
-----------------------
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.
You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.
In the trial container you may have access to two environment variables:
* ``ADAPTDL_TENSORBOARD_LOGDIR``\ : the TensorBoard logging directory for the current experiment,
* ``NNI_TRIAL_JOB_ID``\ : the ``trial`` job id for the current trial.
It is recommended for to have them joined as the directory for trial,
for example in Python:
.. code-block:: python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.
With the above setting, you can monitor the experiment easily
via TensorBoard by
.. code-block:: bash
nnictl tensorboard start
If having multiple experiment running at the same time, you may use
.. code-block:: bash
nnictl tensorboard start <experiment_id>
It will provide you the web url to access the tensorboard.
Note that you have the flexibility to set up the local ``--port``
for the TensorBoard.
**Run an Experiment on DLTS**
=================================
NNI supports running an experiment on `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
Setup Environment
-----------------
Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.
.. image:: ../../img/dlts-step1.png
:target: ../../img/dlts-step1.png
:alt: Choose Cluster
Step 2. Prepare a NNI config YAML like the following:
.. code-block:: yaml
# Set this field to "dlts"
trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
trialConcurrency: 2
maxExecDuration: 3h
maxTrialNum: 100
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
# Configuration to access DLTS
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
Remember to fill the cluster dashboard URL to the last line.
Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.
.. image:: ../../img/dlts-step3.png
:target: ../../img/dlts-step3.png
:alt: Copy Config
Step 4. Submit a NNI manager job to the specified cluster.
.. image:: ../../img/dlts-step4.png
:target: ../../img/dlts-step4.png
:alt: Submit Job
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to check trial's information.
.. image:: ../../img/dlts-step5.png
:target: ../../img/dlts-step5.png
:alt: View NNI WebUI
Run an Experiment on FrameworkController
========================================
===
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copies files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or **Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
#. NNI support Kubeflow based on Azure Kubernetes Service, follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
#. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**. Use ``az login`` to set azure account, and connect kubectl client to AKS, refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Setup FrameworkController
-------------------------
Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run>`__ to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode. If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController, and then pass the name of the FrameworkController service account to the NNI Experiment Config. `refer <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset>`__
Design
------
Please refer the design of `Kubeflow training service <KubeflowMode.rst>`__\ , FrameworkController training service pipeline is similar.
Example
-------
The FrameworkController config file format is:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: ~/nni/examples/trials/mnist-tfv1
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
server: {your_nfs_server}
path: {your_nfs_server_exported_path}
If you use Azure Kubernetes Service, you should set ``frameworkcontrollerConfig`` in your config YAML file as follows:
.. code-block:: yaml
frameworkcontrollerConfig:
storage: azureStorage
serviceAccountName: {your_frameworkcontroller_service_account_name}
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
Note: You should explicitly set ``trainingServicePlatform: frameworkcontroller`` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in Kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
* taskNum: the replica number of the task role.
* command: the users' command to be used in the container.
* gpuNum: the number of gpu device used in container.
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.rst#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
How to run example
------------------
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on FrameworkController is similar to Kubeflow, please refer the `document <KubeflowMode.rst>`__ for more information.
version check
-------------
NNI support version check feature in since version 0.6, `refer <PaiMode.rst>`__
How to Implement Training Service in NNI
========================================
Overview
--------
TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
System architecture
-------------------
.. image:: ../../img/NNIDesign.jpg
:target: ../../img/NNIDesign.jpg
:alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.md>`__\ , `remote platfrom <RemoteMachineMode.md>`__\ , `PAI platfrom <PaiMode.md>`__\ , `kubeflow platform <KubeflowMode.md>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
Folder structure of code
------------------------
NNI's folder structure is shown below:
.. code-block:: bash
nni
|- deployment
|- docs
|- examaples
|- src
| |- nni_manager
| | |- common
| | |- config
| | |- core
| | |- coverage
| | |- dist
| | |- rest_server
| | |- training_service
| | | |- common
| | | |- kubernetes
| | | |- local
| | | |- pai
| | | |- remote_machine
| | | |- test
| |- sdk
| |- webui
|- test
|- tools
| |-nni_annotation
| |-nni_cmd
| |-nni_gpu_tool
| |-nni_trial_tool
``nni/src/`` folder stores the most source code of NNI. The code in this folder is related to NNIManager, TrainingService, SDK, WebUI and other modules. Users could find the abstract class of TrainingService in ``nni/src/nni_manager/common/trainingService.ts`` file, and they should put their own implemented TrainingService in ``nni/src/nni_manager/training_service`` folder. If users have implemented their own TrainingService code, they should also supplement the unit test of the code, and place them in ``nni/src/nni_manager/training_service/test`` folder.
Function annotation of TrainingService
--------------------------------------
.. code-block:: bash
abstract class TrainingService {
public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract get isMultiPhaseJobSupported(): boolean;
public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
public abstract setClusterMetadata(key: string, value: string): Promise<void>;
public abstract getClusterMetadata(key: string): Promise<string>;
public abstract cleanUp(): Promise<void>;
public abstract run(): Promise<void>;
}
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
**setClusterMetadata(key: string, value: string)**
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
.. code-block:: bash
export class RemoteMachineMeta {
public readonly ip : string;
public readonly port : number;
public readonly username : string;
public readonly passwd?: string;
public readonly sshKeyPath?: string;
public readonly passphrase?: string;
public gpuSummary : GPUSummary | undefined;
/* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
public gpuReservation : Map<number, string>;
constructor(ip : string, port : number, username : string, passwd : string,
sshKeyPath : string, passphrase : string) {
this.ip = ip;
this.port = port;
this.username = username;
this.passwd = passwd;
this.sshKeyPath = sshKeyPath;
this.passphrase = passphrase;
this.gpuReservation = new Map<number, string>();
}
}
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
**getClusterMetadata(key: string)**
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
**submitTrialJob(form: JobApplicationForm)**
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
.. code-block:: bash
interface TrialJobDetail {
readonly id: string;
readonly status: TrialJobStatus;
readonly submitTime: number;
readonly startTime?: number;
readonly endTime?: number;
readonly tags?: string[];
readonly url?: string;
readonly workingDirectory: string;
readonly form: JobApplicationForm;
readonly sequenceId: number;
isEarlyStopped?: boolean;
}
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
**cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)**
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
**updateTrialJob(trialJobId: string, form: JobApplicationForm)**
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to ``RUNNING``\ , ``SUCCEED``\ , ``FAILED`` etc.
**getTrialJob(trialJobId: string)**
This function returns a trialJob detail instance according to trialJobId.
**listTrialJobs()**
Users should put all of trial job detail information into a list, and return the list.
**addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
**removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
Close the EventEmitter.
**run()**
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
**cleanUp()**
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
TrialKeeper tool
----------------
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in ``nni/tools/nni_trial_tool``. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
.. image:: ../../img/trialkeeper.jpg
:target: ../../img/trialkeeper.jpg
:alt:
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in ``nni/src/nni_manager/training_service/common/clusterJobRestServer.ts``.
Reference
---------
For more information about how to debug, please `refer <../Tutorial/HowToDebug.rst>`__.
The guideline of how to contribute, please `refer <../Tutorial/Contributing.rst>`__.
Run an Experiment on Kubeflow
=============================
===
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or**Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
#. NNI support Kubeflow based on Azure Kubernetes Service, follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
#. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**. Use ``az login`` to set azure account, and connect kubectl client to AKS, refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
#. Deploy Kubeflow on Azure Kubernetes Service, follow the `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__.
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI use `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Design
------
.. image:: ../../img/kubeflow_training_design.png
:target: ../../img/kubeflow_training_design.png
:alt:
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumes: `nfs <https://en.wikipedia.org/wiki/Network_File_System>`__ and `azure file storage <https://azure.microsoft.com/en-us/services/storage/files/>`__\ , you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs (\ `tf-operator <https://github.com/kubeflow/tf-operator>`__ job or `pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ job) in K8S, and mount your storage volume into the job's pod. Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
Supported operator
------------------
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators is not tested.
Users could set operator type in config file.
The setting of tf-operator:
.. code-block:: yaml
kubeflowConfig:
operator: tf-operator
The setting of pytorch-operator:
.. code-block:: yaml
kubeflowConfig:
operator: pytorch-operator
If users want to use tf-operator, he could set ``ps`` and ``worker`` in trial config. If users want to use pytorch-operator, he could set ``master`` and ``worker`` in trial config.
Supported storage type
----------------------
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
.. code-block:: yaml
kubeflowConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
If you use Azure storage, you should set ``kubeflowConfig`` in your config YAML file as follows:
.. code-block:: yaml
kubeflowConfig:
storage: azureStorage
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. This is a tensorflow job, and use tf-operator of Kubeflow. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 20
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 2
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
ps:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
apiVersion: v1alpha2
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
Note: You should explicitly set ``trainingServicePlatform: kubeflow`` in NNI config YAML file if you want to start experiment in kubeflow mode.
If you want to run PyTorch jobs, you could set your config files as follow:
.. code-block:: yaml
authorName: default
experimentName: example_mnist_distributed_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
codeDir: .
master:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
worker:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
kubeflowConfig:
operator: pytorch-operator
apiVersion: v1alpha2
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
Trial configuration in kubeflow mode have the following configuration keys:
* codeDir
* code directory, where you put training code and config files
* worker (required). This config section is used to configure tensorflow worker role
* replicas
* Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
* command
* Required key. Command to launch your trial job, like ``python mnist.py``
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in `Pod <https://kubernetes.io/docs/concepts/workloads/pods/pod/>`__. This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image :githublink:`msranni/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* privateRegistryAuthPath
* Optional field, specify ``config.json`` file path that holds an authorization token of docker registry, used to pull image from private registry. `Refer <https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/>`__.
* apiVersion
* Required key. The API version of your Kubeflow.
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_kubeflow.yml
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see the Kubeflow tfjob created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
version check
-------------
NNI support version check feature in since version 0.6, `refer <PaiMode.rst>`__
Any problems when using NNI in Kubeflow mode, please create issues on `NNI Github repo <https://github.com/Microsoft/nni>`__.
**Tutorial: Create and Run an Experiment on local with NNI API**
====================================================================
In this tutorial, we will use the example in [~/examples/trials/mnist-tfv1] to explain how to create and run an experiment on local with NNI API.
..
Before starts
You have an implementation for MNIST classifer using convolutional layers, the Python code is in ``mnist_before.py``.
..
Step 1 - Update model codes
To enable NNI API, make the following changes:
* Declare NNI API: include ``import nni`` in your trial code to use NNI APIs.
* Get predefined parameters
Use the following code snippet:
.. code-block:: python
RECEIVED_PARAMS = nni.get_next_parameter()
to get hyper-parameters' values assigned by tuner. ``RECEIVED_PARAMS`` is an object, for example:
.. code-block:: json
{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}
* Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor.
Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
We had made the changes and saved it to ``mnist.py``.
**NOTE**\ :
.. code-block:: bash
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
tuner - The tuner will generate next parameters/architecture based on the explore history (final result of all trials).
..
Step 2 - Define SearchSpace
The hyper-parameters used in ``Step 1.2 - Get predefined parameters`` is defined in a ``search_space.json`` file like below:
.. code-block:: bash
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search space.
..
Step 3 - Define Experiment
..
3.1 enable NNI API mode
To enable NNI API mode, you need to set useAnnotation to *false* and provide the path of SearchSpace file (you just defined in step 1):
.. code-block:: bash
useAnnotation: false
searchSpacePath: /path/to/your/search_space.json
To run an experiment in NNI, you only needed:
* Provide a runnable trial
* Provide or choose a tuner
* Provide a YAML experiment configure file
* (optional) Provide or choose an assessor
**Prepare trial**\ :
..
A set of examples can be found in ~/nni/examples after your installation, run ``ls ~/nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
.. code-block:: bash
python ~/nni/examples/trials/mnist-annotation/mnist.py
This command will be filled in the YAML configure file below. Please refer to `here <../TrialExample/Trials.rst>`__ for how to write your own trial.
**Prepare tuner**\ : NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to `here <../Tuner/CustomizeTuner.rst>`__\ ), but for simplicity, here we choose a tuner provided by NNI as below:
.. code-block:: bash
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found `here <../Tuner/BuiltinTuner.rst>`__\ ), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**\ : Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, ``cat ~/nni/examples/trials/mnist-annotation/config.yml`` to see it. Its content is basically shown below:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 1
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote
trainingServicePlatform: local
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
Here *useAnnotation* is true because this trial example uses our python annotation (refer to `here <../Tutorial/AnnotationSpec.rst>`__ for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
With all these steps done, we can run the experiment with the following command:
.. code-block:: bash
nnictl create --config ~/nni/examples/trials/mnist-annotation/config.yml
You can refer to `here <../Tutorial/Nnictl.rst>`__ for more usage guide of *nnictl* command line tool.
View experiment results
-----------------------
The experiment has been running now. Other than *nnictl*\ , NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
Using multiple local GPUs to speed up search
--------------------------------------------
The following steps assume that you have 4 NVIDIA GPUs installed at local and `tensorflow with GPU support <https://www.tensorflow.org/install/gpu>`__. The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**\ : NNI provides a demo configuration file for the setting above, ``cat ~/nni/examples/trials/mnist-annotation/config_gpu.yml`` to see it. The trailConcurrency and gpuNum are different from the basic configure file:
.. code-block:: bash
...
# how many trials could be concurrently running
trialConcurrency: 4
...
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 1
We can run the experiment with the following command:
.. code-block:: bash
nnictl create --config ~/nni/examples/trials/mnist-annotation/config_gpu.yml
You can use *nnictl* command line tool or WebUI to trace the training progress. *nvidia_smi* command line tool can also help you to monitor the GPU usage during training.
Training Service
================
What is Training Service?
-------------------------
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.md>`__\ , `remote machines <./RemoteMachineMode.md>`__\ , and on clusters like `PAI <./PaiMode.md>`__\ , `Kubeflow <./KubeflowMode.md>`__\ , `AdaptDL <./AdaptDLMode.md>`__\ , `FrameworkController <./FrameworkControllerMode.md>`__\ , `DLTS <./DLTSMode.md>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "\ `how to implement training service <./HowToImplementTrainingService>`__\ " for details.
How to use Training Service?
----------------------------
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig>`__ provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as ``codeDir`` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see :githublink:`this example <examples/trials/mnist-tfv1/.nniignore>` and the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
In case users intend to use large files in their experiment (like large-scaled datasets) and they are not using local mode, they can either: 1) download the data before each trial launches by putting it into trial command; or 2) use a shared storage that is accessible to worker nodes. Usually, training platforms are equipped with shared storage, and NNI allows users to easily use them. Refer to docs of each built-in training service for details.
Built-in Training Services
--------------------------
.. list-table::
:header-rows: 1
:widths: auto
* - TrainingService
- Brief Introduction
* - `**Local** <./LocalMode.rst>`__
- NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.
* - `**Remote** <./RemoteMachineMode.rst>`__
- NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.
* - `**PAI** <./PaiMode.rst>`__
- NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__ (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.
* - `**Kubeflow** <./KubeflowMode.rst>`__
- NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
* - `**AdaptDL** <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.
* - `**FrameworkController** <./FrameworkControllerMode.rst>`__
- NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
* - `**DLTS** <./DLTSMode.rst>`__
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `**AML** <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
What does Training Service do?
------------------------------
.. raw:: html
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
</p>
According to the architecture shown in `Overview <../Overview>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (\ ``codeDir``\ ) accessible to training platform.
.. Note:: Different training services have different ways to handle ``codeDir``. For example, local training service directly runs trials in ``codeDir``. Remote training service packs ``codeDir`` into a zip and uploads it to each machine. K8S-based training services copy ``codeDir`` onto a shared storage, which is either provided by training platform itself, or configured by users in config file.
Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from ``WAITING`` to ``RUNNING``\ , ``RUNNING`` to ``SUCCEEDED``\ ) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.
.. role:: raw-html(raw)
:format: html
**Run an Experiment on OpenPAI**
====================================
NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__\ , called pai mode. Before starting to use NNI pai mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
.. toctree::
Setup environment
-----------------
**Step 1. Install NNI, follow the install guide** `here <../Tutorial/QuickStart.rst>`__.
**Step 2. Get token.**
Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
.. image:: ../../img/pai_profile.jpg
:scale: 80%
Click ``copy`` button in the page to copy a jwt token.
.. image:: ../../img/pai_token.jpg
:scale: 67%
**Step 3. Mount NFS storage to local machine.**
Click ``Submit job`` button in web portal.
.. image:: ../../img/pai_job_submission_page.jpg
:scale: 50%
Find the data management region in job submission page.
.. image:: ../../img/pai_data_management_page.jpg
:scale: 33%
The ``Preview container paths`` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage.\ :raw-html:`<br>`
For example, use the following command:
.. code-block:: bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
Then the ``/data`` folder in container will be mounted to ``/local/mnt`` folder in your local machine.\ :raw-html:`<br>`
You could use the following configuration in your NNI's config file:
.. code-block:: yaml
nniManagerNFSMountPath: /local/mnt
**Step 4. Get OpenPAI's storage config name and nniManagerMountPath**
The ``Team share storage`` field is storage configuration used to specify storage value in OpenPAI. You can get ``paiStorageConfigName`` and ``containerNFSMountPath`` field in ``Team share storage``\ , for example:
.. code-block:: yaml
paiStorageConfigName: confignfs-data
containerNFSMountPath: /mnt/confignfs-data
Run an experiment
-----------------
Use ``examples/trials/mnist-annotation`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
virtualCluster: default
nniManagerNFSMountPath: /local/mnt
containerNFSMountPath: /mnt/confignfs-data
paiStorageConfigName: confignfs-data
# Configuration to access OpenPAI Cluster
paiConfig:
userName: your_pai_nni_user
token: your_pai_token
host: 10.1.1.1
# optional, experimental feature.
reuse: true
Note: You should set ``trainingServicePlatform: pai`` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like ``10.10.5.1``\ , the default http protocol in NNI is ``http``\ , if your PAI's cluster enabled https, please use the uri in ``https://10.10.5.1`` format.
Trial configurations
^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
*
cpuNum
Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
memoryMB
Optional key. Should be positive number based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
image
Optional key. In pai mode, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
virtualCluster
Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
*
nniManagerNFSMountPath
Required key. Set the mount path in your nniManager machine.
*
containerNFSMountPath
Required key. Set the mount path in your container used in OpenPAI.
*
paiStorageConfigName:
Optional key. Set the storage name used in OpenPAI. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
command
Optional key. Set the commands used in OpenPAI container.
*
paiConfigPath
Optional key. Set the file path of OpenPAI job configuration, the file is in yaml format.
If users set ``paiConfigPath`` in NNI's configuration file, no need to specify the fields ``command``\ , ``paiStorageConfigName``\ , ``virtualCluster``\ , ``image``\ , ``memoryMB``\ , ``cpuNum``\ , ``gpuNum`` in ``trial`` configuration. These fields will use the values from the config file specified by ``paiConfigPath``.
Note:
#.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni\ *exp*\ ${this.experimentId}*trial*\ ${trialJobId}.
#.
If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
OpenPAI configurations
^^^^^^^^^^^^^^^^^^^^^^
``paiConfig`` includes OpenPAI specific configurations,
*
userName
Required key. User name of OpenPAI platform.
*
token
Required key. Authentication key of OpenPAI platform.
*
host
Required key. The host of OpenPAI platform. It's OpenPAI's job submission page uri, like ``10.10.5.1``\ , the default http protocol in NNI is ``http``\ , if your OpenPAI cluster enabled https, please use the uri in ``https://10.10.5.1`` format.
*
reuse (experimental feature)
Optional key, default is false. If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_pai.yml
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
:scale: 30%
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:scale: 80%
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by ``paiStorageConfigName`` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the ``nniManagerNFSMountPath`` field in configuration file, NNI will generate bash files and copy data in ``codeDir`` to the ``nniManagerNFSMountPath`` folder, then NNI will start a trial job. The data in ``nniManagerNFSMountPath`` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in ``containerNFSMountPath``\ , NNI will enter this folder first, and then run scripts to start a trial job.
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/version_check.png
:scale: 80%
.. role:: raw-html(raw)
:format: html
**Run an Experiment on OpenpaiYarn**
========================================
The original ``pai`` mode is modificated to ``paiYarn`` mode, which is a distributed training platform based on Yarn.
Setup environment
-----------------
Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai, paiYarn
trainingServicePlatform: paiYarn
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-tfv1
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
# Configuration to access OpenpaiYarn Cluster
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
Note: You should set ``trainingServicePlatform: paiYarn`` in NNI config YAML file if you want to start experiment in paiYarn mode.
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , trial configuration in paiYarn mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In paiYarn mode, your trial program will be scheduled by OpenpaiYarn to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* virtualCluster
* Optional key. Set the virtualCluster of OpenpaiYarn. If omitted, the job will run on default virtual cluster.
* shmMB
* Optional key. Set the shmMB configuration of OpenpaiYarn, it set the shared memory for one task in the task role.
* authFile
* Optional key, Set the auth file path for private registry while using paiYarn mode, `Refer <https://github.com/microsoft/paiYarn/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpaiYarn-job>`__\ , you can prepare the authFile and simply provide the local path of this file, NNI will upload this file to HDFS for you.
*
portList
*
Optional key. Set the portList configuration of OpenpaiYarn, it specifies a list of port used in container, `Refer <https://github.com/microsoft/paiYarn/blob/b2324866d0280a2d22958717ea6025740f71b9f0/docs/job_tutorial.rst#specification>`__.\ :raw-html:`<br>`
The config schema in NNI is shown below:
.. code-block:: bash
portList:
- label: test
beginAt: 8080
portNumber: 2
Let's say you want to launch a tensorboard in the mnist example using the port. So the first step is to write a wrapper script ``launch_paiYarn.sh`` of ``mnist.py``.
.. code-block:: bash
export TENSORBOARD_PORT=paiYarn_PORT_LIST_${paiYarn_CURRENT_TASK_ROLE_NAME}_0_tensorboard
tensorboard --logdir . --port ${!TENSORBOARD_PORT} &
python3 mnist.py
The config file of portList should be filled as following:
.. code-block:: yaml
trial:
command: bash launch_paiYarn.sh
portList:
- label: tensorboard
beginAt: 0
portNumber: 1
NNI support two kind of authorization method in paiYarn, including password and paiYarn token, `refer <https://github.com/microsoft/paiYarn/blob/b6bd2ab1c8890f91b7ac5859743274d2aa923c22/docs/rest-server/API.rst#2-authentication>`__. The authorization is configured in ``paiYarnConfig`` field.\ :raw-html:`<br>`
For password authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
For paiYarn token authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
token: your_paiYarn_token
host: 10.1.1.1
Once complete to fill NNI experiment config file and save (for example, save as exp_paiYarn.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_paiYarn.yml
to start the experiment in paiYarn mode. NNI will create OpenpaiYarn job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenpaiYarn cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In paiYarn mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
:target: ../../img/nni_webui_joblist.jpg
:alt:
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:target: ../../img/nni_trial_hdfs_output.jpg
:alt:
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
If your training data is not too large, it could be put into codeDir, and nni will upload the data to hdfs, or you could build your own docker image with the data. If you have large dataset, it's not appropriate to put the data in codeDir, and you could follow the `guidance <https://github.com/microsoft/paiYarn/blob/master/docs/user/storage.rst>`__ to mount the data folder in container.
If you also want to save trial's other output into HDFS, like model files, you can use environment variable ``NNI_OUTPUT_DIR`` in your trial code to save your own output files, and NNI SDK will copy all the files in ``NNI_OUTPUT_DIR`` from trial's container to HDFS, the target path is ``hdfs://host:port/{username}/nni/{experiments}/{experimentId}/trials/{trialId}/nnioutput``
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/version_check.png
:target: ../../img/version_check.png
:alt:
Run an Experiment on Remote Machines
====================================
NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.
Requirements
------------
*
Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.
*
Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usages, please refer to `machineList part of configuration <../Tutorial/ExperimentConfig.rst>`__.
*
Make sure the NNI version on each machine is consistent.
*
Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
Linux
^^^^^
* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine.
Windows
^^^^^^^
*
Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine.
*
Install and start ``OpenSSH Server``.
#.
Open ``Settings`` app on Windows.
#.
Click ``Apps``\ , then click ``Optional features``.
#.
Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
#.
Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
sc config sshd start=auto
net start sshd
*
Make sure remote account is administrator, so that it can stop running trials.
*
Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.
The output like below is ok, when opening a new command window.
.. code-block:: text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
Run an experiment
-----------------
e.g. there are three machines, which can be logged in with username and password.
.. list-table::
:header-rows: 1
:widths: auto
* - IP
- Username
- Password
* - 10.1.1.1
- bob
- bob123
* - 10.1.1.2
- bob
- bob123
* - 10.1.1.3
- bob
- bob123
Install and run NNI on one of those three machines or another machine, which has network access to them.
Use ``examples/trials/mnist-annotation`` as the example. Below is content of ``examples/trials/mnist-annotation/config_remote.yml``\ :
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
# search space file
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: 10.1.1.1
username: bob
passwd: bob123
#port can be skip if using default ssh port 22
#port: 22
- ip: 10.1.1.2
username: bob
passwd: bob123
- ip: 10.1.1.3
username: bob
passwd: bob123
Files in ``codeDir`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
.. code-block:: bash
nnictl create --config examples/trials/mnist-annotation/config_remote.yml
Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **preCommand** to specify a python environment on your remote machine.
Use ``examples/trials/mnist-tfv2`` as the example. Below is content of ``examples/trials/mnist-tfv2/config_remote.yml``\ :
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: ${replace_to_your_remote_machine_ip}
username: ${replace_to_your_remote_machine_username}
sshKeyPath: ${replace_to_your_remote_machine_sshKeyPath}
# Pre-command will be executed before the remote machine executes other commands.
# Below is an example of specifying python environment.
# If you want to execute multiple commands, please use "&&" to connect them.
# preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
The **preCommand** will be executed before the remote machine executes other commands. So you can configure python environment path like this:
.. code-block:: yaml
# Linux remote machine
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
# Windows remote machine
preCommand: set path=${replace_to_python_environment_path_in_your_remote_machine};%path%
Or if you want to activate the ``virtualenv`` environment:
.. code-block:: yaml
# Linux remote machine
preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# Windows remote machine
preCommand: ${replace_to_absolute_path_recommended_here}\\scripts\\activate
Or if you want to activate the ``conda`` environment:
.. code-block:: yaml
# Linux remote machine
preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
# Windows remote machine
preCommand: call activate ${replace_to_conda_env_name}
If you want multiple commands to be executed, you can use ``&&`` to connect these commands:
.. code-block:: yaml
preCommand: command1 && command2 && command3
**Note**\ : Because **preCommand** will execute before other commands each time, it is strongly not recommended to set **preCommand** that will make changes to system, i.e. ``mkdir`` or ``touch``.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment