`Metis <https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/>`__ offers several benefits over other tuning algorithms. While most tools only predict the optimal configuration, Metis gives you two outputs, a prediction for the optimal configuration and a suggestion for the next trial. No more guess work!
While most tools assume training datasets do not have noisy data, Metis actually tells you if you need to resample a particular hyper-parameter.
While most tools have problems of being exploitation-heavy, Metis' search strategy balances exploration, exploitation, and (optional) resampling.
Metis belongs to the class of sequential model-based optimization (SMBO) algorithms and it is based on the Bayesian Optimization framework. To model the parameter-vs-performance space, Metis uses both a Gaussian Process and GMM. Since each trial can impose a high time cost, Metis heavily trades inference computations with naive trials. At each iteration, Metis does two tasks:
*
It finds the global optimal point in the Gaussian Process space. This point represents the optimal configuration.
*
It identifies the next hyper-parameter candidate. This is achieved by inferring the potential information gain of exploration, exploitation, and resampling.
Note that the only acceptable types within the search space are ``quniform``\ , ``uniform``\ , ``randint``\ , and numerical ``choice``.
More details can be found in our `paper <https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/>`__.
Population Based Training (PBT) comes from `Population Based Training of Neural Networks <https://arxiv.org/abs/1711.09846v1>`__. It's a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training.
.. image:: ../../img/pbt.jpg
:target: ../../img/pbt.jpg
:alt:
PBTTuner initializes a population with several trials (i.e., ``population_size``\ ). There are four steps in the above figure, each trial only runs by one step. How long is one step is controlled by trial code, e.g., one epoch. When a trial starts, it loads a checkpoint specified by PBTTuner and continues to run one step, then saves checkpoint to a directory specified by PBTTuner and exits. The trials in a population run steps synchronously, that is, after all the trials finish the ``i``\ -th step, the ``(i+1)``\ -th step can be started. Exploitation and exploration of PBT are executed between two consecutive steps.
Provide checkpoint directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since some trials need to load other trial's checkpoint, users should provide a directory (i.e., ``all_checkpoint_dir``\ ) which is accessible by every trial. It is easy for local mode, users could directly use the default directory or specify any directory on the local machine. For other training services, users should follow `the document of those training services <../TrainingService/Overview.rst>`__ to provide a directory in a shared storage, such as NFS, Azure storage.
Modify your trial code
^^^^^^^^^^^^^^^^^^^^^^
Before running a step, a trial needs to load a checkpoint, the checkpoint directory is specified in hyper-parameter configuration generated by PBTTuner, i.e., ``params['load_checkpoint_dir']``. Similarly, the directory for saving checkpoint is also included in the configuration, i.e., ``params['save_checkpoint_dir']``. Here, ``all_checkpoint_dir`` is base folder of ``load_checkpoint_dir`` and ``save_checkpoint_dir`` whose format is ``all_checkpoint_dir/<population-id>/<step>``.
This is a tuner geared for NNI's Neural Architecture Search (NAS) interface. It uses the `ppo algorithm <https://arxiv.org/abs/1707.06347>`__. The implementation inherits the main logic of the ppo2 OpenAI implementation `here <https://github.com/openai/baselines/tree/master/baselines/ppo2>`__ and is adapted for the NAS scenario.
We had successfully tuned the mnist-nas example and has the following result:
**NOTE: we are refactoring this example to the latest NAS interface, will publish the example codes after the refactor.**
.. image:: ../../img/ppo_mnist.png
:target: ../../img/ppo_mnist.png
:alt:
We also tune :githublink:`the macro search space for image classification in the enas paper <examples/trials/nas_cifar10>` (with a limited epoch number for each trial, i.e., 8 epochs), which is implemented using the NAS interface and tuned with PPOTuner. Here is Figure 7 from the `enas paper <https://arxiv.org/pdf/1802.03268.pdf>`__ to show what the search space looks like
.. image:: ../../img/enas_search_space.png
:target: ../../img/enas_search_space.png
:alt:
The figure above was the chosen architecture. Each square is a layer whose operation was chosen from 6 options. Each dashed line is a skip connection, each square layer can choose 0 or 1 skip connections, getting the output from a previous layer. **Note that**\ , in original macro search space, each square layer could choose any number of skip connections, while in our implementation, it is only allowed to choose 0 or 1.
The results are shown in figure below (see the experimenal config :githublink:`here <examples/trials/nas_cifar10/config_ppo.yml>`\ :
`SMAC <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__ is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO in order to handle categorical parameters. The SMAC supported by nni is a wrapper on `the SMAC3 github repo <https://github.com/automl/SMAC3>`__.
Note that SMAC on nni only supports a subset of the types in the `search space spec <../Tutorial/SearchSpaceSpec.rst>`__\ : ``choice``\ , ``randint``\ , ``uniform``\ , ``loguniform``\ , and ``quniform``.
To improve user experience and reduce user effort, we design an annotation grammar. Using NNI annotation, users can adapt their code to NNI just by adding some standalone annotating strings, which does not affect the execution of the original code.
The meaning of this example is that NNI will choose one of several values (0.1, 0.01, 0.001) to assign to the learning_rate variable. Specifically, this first line is an NNI annotation, which is a single string. Following is an assignment statement. What nni does here is to replace the right value of this assignment statement according to the information provided by the annotation line.
In this way, users could either run the python code directly or launch NNI to tune hyper-parameter in this code, without changing any codes.
Types of Annotation:
--------------------
In NNI, there are mainly four types of annotation:
1. Annotate variables
^^^^^^^^^^^^^^^^^^^^^
``'''@nni.variable(sampling_algo, name)'''``
``@nni.variable`` is used in NNI to annotate a variable.
**Arguments**
* **sampling_algo**\ : Sampling algorithm that specifies a search space. User should replace it with a built-in NNI sampling function whose name consists of an ``nni.`` identification and a search space type specified in `SearchSpaceSpec <SearchSpaceSpec.rst>`__ such as ``choice`` or ``uniform``.
* **name**\ : The name of the variable that the selected value will be assigned to. Note that this argument should be the same as the left value of the following assignment statement.
There are 10 types to express your search space as follows:
Which means the variable value is a value like round(uniform(low, high)). For now, the type of chosen value is float. If you want to use integer value, please convert it explicitly.
Which means the variable value is a value like clip(round(uniform(low, high) / q) * q, low, high), where the clip operation is used to constraint the generated value in the bound.
Which means the variable value is a value drawn according to exp(uniform(low, high)) so that the logarithm of the return value is uniformly distributed.
Which means the variable value is a value like clip(round(loguniform(low, high) / q) * q, low, high), where the clip operation is used to constraint the generated value in the bound.
``@nni.function_choice`` is used to choose one from several functions.
**Arguments**
* **functions**\ : Several functions that are waiting to be selected from. Note that it should be a complete function call with arguments. Such as ``max_pool(hidden_layer, pool_size)``.
* **name**\ : The name of the function that will be replaced in the following assignment statement.
``@nni.report_intermediate_result`` is used to report intermediate result, whose usage is the same as ``nni.report_intermediate_result`` in the doc of `Write a trial run on NNI <../TrialExample/Trials.rst>`__
4. Annotate final result
^^^^^^^^^^^^^^^^^^^^^^^^
``'''@nni.report_final_result(metrics)'''``
``@nni.report_final_result`` is used to report the final result of the current trial, whose usage is the same as ``nni.report_final_result`` in the doc of `Write a trial run on NNI <../TrialExample/Trials.rst>`__
Great!! We are always on the lookout for more contributors to our code base.
Firstly, if you are unsure or afraid of anything, just ask or submit the issue or pull request anyways. You won't be yelled at for giving your best effort. The worst that can happen is that you'll be politely asked to change something. We appreciate any sort of contributions and don't want a wall of rules to get in the way of that.
However, for those individuals who want a bit more guidance on the best way to contribute to the project, read on. This document will cover all the points we're looking for in your contributions, raising your chances of quickly merging or addressing your contributions.
Looking for a quickstart, get acquainted with our `Get Started <QuickStart.rst>`__ guide.
There are a few simple guidelines that you need to follow before providing your hacks.
Raising Issues
--------------
When raising issues, please specify the following:
* Setup details needs to be filled as specified in the issue template clearly for the reviewer to check.
* A scenario where the issue occurred (with details on how to reproduce it).
* Errors and log messages that are displayed by the software.
* Any other details that might be useful.
Submit Proposals for New Features
---------------------------------
*
There is always something more that is required, to make it easier to suit your use-cases. Feel free to join the discussion on new features or raise a PR with your proposed change.
*
Fork the repository under your own github handle. After cloning the repository. Add, commit, push and sqaush (if necessary) the changes with detailed commit messages to your fork. From where you can proceed to making a pull request.
Contributing to Source Code and Bug Fixes
-----------------------------------------
Provide PRs with appropriate tags for bug fixes or enhancements to the source code. Do follow the correct naming conventions and code styles when you work on and do try to implement all code reviews along the way.
If you are looking for How to develop and debug the NNI source code, you can refer to `How to set up NNI developer environment doc <./SetupNniDeveloperEnvironment.rst>`__ file in the ``docs`` folder.
Similarly for `Quick Start <QuickStart.rst>`__. For everything else, refer to `NNI Home page <http://nni.readthedocs.io>`__.
Solve Existing Issues
---------------------
Head over to `issues <https://github.com/Microsoft/nni/issues>`__ to find issues where help is needed from contributors. You can find issues tagged with 'good-first-issue' or 'help-wanted' to contribute in.
A person looking to contribute can take up an issue by claiming it as a comment/assign their Github ID to it. In case there is no PR or update in progress for a week on the said issue, then the issue reopens for anyone to take up again. We need to consider high priority issues/regressions where response time must be a day or so.
Code Styles & Naming Conventions
--------------------------------
* We follow `PEP8 <https://www.python.org/dev/peps/pep-0008/>`__ for Python code and naming conventions, do try to adhere to the same when making a pull request or making a change. One can also take the help of linters such as ``flake8`` or ``pylint``
* We also follow `NumPy Docstring Style <https://www.sphinx-doc.org/en/master/usage/extensions/example_numpy.html#example-numpy>`__ for Python Docstring Conventions. During the `documentation building <Contributing.rst#documentation>`__\ , we use `sphinx.ext.napoleon <https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html>`__ to generate Python API documentation from Docstring.
* For docstrings, please refer to `numpydoc docstring guide <https://numpydoc.readthedocs.io/en/latest/format.html>`__ and `pandas docstring guide <https://python-sprints.github.io/pandas/guide/pandas_docstring.html>`__
* For function docstring, **description**, **Parameters**, and **Returns** **Yields** are mandatory.
* For class docstring, **description**, **Attributes** are mandatory.
* For docstring to describe ``dict``, which is commonly used in our hyper-param format description, please refer to RiboKit Doc Standards
* `Internal Guideline on Writing Standards <https://ribokit.github.io/docs/text/>`__
Documentation
-------------
Our documentation is built with :githublink:`sphinx <docs>`.
* Before submitting the documentation change, please **build homepage locally**: ``cd docs/en_US && make html``, then you can see all the built documentation webpage under the folder ``docs/en_US/_build/html``. It's also highly recommended taking care of **every WARNING** during the build, which is very likely the signal of a **deadlink** and other annoying issues.
*
For links, please consider using **relative paths** first. However, if the documentation is written in Markdown format, and:
* It's an image link which needs to be formatted with embedded html grammar, please use global URL like ``https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png``, which can be automatically generated by dragging picture onto `Github Issue <https://github.com/Microsoft/nni/issues/new>`__ Box.
* It cannot be re-formatted by sphinx, such as source code, please use its global URL. For source code that links to our github repo, please use URLs rooted at ``https://github.com/Microsoft/nni/tree/v1.9/`` (:githublink:`mnist.py <examples/trials/mnist-tfv1/mnist.py>` for example).
* `Kubeflow with azure storage <#kubeflow-with-azure-storage>`__
Template
--------
* **Light weight (without Annotation and Assessor)**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
searchSpacePath:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
* **Use Assessor**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
searchSpacePath:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
assessor:
#choice: Medianstop
builtinAssessorName:
classArgs:
#choice: maximize, minimize
optimize_mode:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
* **Use Annotation**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
assessor:
#choice: Medianstop
builtinAssessorName:
classArgs:
#choice: maximize, minimize
optimize_mode:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
Configuration Spec
------------------
authorName
^^^^^^^^^^
Required. String.
The name of the author who create the experiment.
*TBD: add default value.*
experimentName
^^^^^^^^^^^^^^
Required. String.
The name of the experiment created.
*TBD: add default value.*
trialConcurrency
^^^^^^^^^^^^^^^^
Required. Integer between 1 and 99999.
Specifies the max num of trial jobs run simultaneously.
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach **trialConcurrency** number, some trial jobs will be put into a queue to wait for gpu allocation.
maxExecDuration
^^^^^^^^^^^^^^^
Optional. String. Default: 999d.
**maxExecDuration** specifies the max duration time of an experiment. The unit of the time is {**s**\ ,** m**\ ,** h**\ ,** d**\ }, which means {*seconds*\ , *minutes*\ , *hours*\ , *days*\ }.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
versionCheck
^^^^^^^^^^^^
Optional. Bool. Default: true.
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
debug
^^^^^
Optional. Bool. Default: false.
Debug mode will set versionCheck to false and set logLevel to be 'debug'.
maxTrialNum
^^^^^^^^^^^
Optional. Integer between 1 and 99999. Default: 99999.
Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
trainingServicePlatform
^^^^^^^^^^^^^^^^^^^^^^^
Required. String.
Specifies the platform to run the experiment, including **local**\ ,** remote**\ ,** pai**\ ,** kubeflow**\ ,** frameworkcontroller**.
*
**local** run an experiment on local ubuntu machine.
*
**remote** submit trial jobs to remote ubuntu machines, and** machineList** field should be filed in order to set up SSH connection to remote machine.
*
**pai** submit trial jobs to `OpenPAI <https://github.com/Microsoft/pai>`__ of Microsoft. For more details of pai configuration, please refer to `Guide to PAI Mode <../TrainingService/PaiMode.rst>`__
*
**kubeflow** submit trial jobs to `kubeflow <https://www.kubeflow.org/docs/about/kubeflow/>`__\ , NNI support kubeflow based on normal kubernetes and `azure kubernetes <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__. For detail please refer to `Kubeflow Docs <../TrainingService/KubeflowMode.rst>`__
*
**adl** submit trial jobs to `AdaptDL <https://www.kubeflow.org/docs/about/kubeflow/>`__\ , NNI support AdaptDL on Kubernetes cluster. For detail please refer to `AdaptDL Docs <../TrainingService/AdaptDLMode.rst>`__
*
TODO: explain frameworkcontroller.
searchSpacePath
^^^^^^^^^^^^^^^
Optional. Path to existing file.
Specifies the path of search space file, which should be a valid path in the local linux machine.
The only exception that **searchSpacePath** can be not fulfilled is when ``useAnnotation=True``.
useAnnotation
^^^^^^^^^^^^^
Optional. Bool. Default: false.
Use annotation to analysis trial code and generate search space.
Note: if **useAnnotation** is true, the searchSpacePath field should be removed.
multiThread
^^^^^^^^^^^
Optional. Bool. Default: false.
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
nniManagerIp
^^^^^^^^^^^^
Optional. String. Default: eth0 device IP.
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
Note: run ``ifconfig`` on NNI manager's machine to check if eth0 device exists. If not, **nniManagerIp** is recommended to set explicitly.
logDir
^^^^^^
Optional. Path to a directory. Default: ``<user home directory>/nni-experiments``.
Configures the directory to store logs and data of the experiment.
logLevel
^^^^^^^^
Optional. String. Default: ``info``.
Sets log level for the experiment. Available log levels are: ``trace``\ , ``debug``\ , ``info``\ , ``warning``\ , ``error``\ , ``fatal``.
logCollection
^^^^^^^^^^^^^
Optional. ``http`` or ``none``. Default: ``none``.
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from ``http``\ , trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is ``none``\ , trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be ``none``.
tuner
^^^^^
Required.
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set **builtinTunerName** and **classArgs**. Another way is to use users' own tuner file, in which case **codeDirectory**\ ,** classFileName**\ ,** className** and **classArgs** are needed. *Users must choose exactly one way.*
builtinTunerName
^^^^^^^^^^^^^^^^
Required if using built-in tuners. String.
Specifies the name of system tuner, NNI sdk provides different tuners introduced `here <../Tuner/BuiltinTuner.rst>`__.
codeDir
^^^^^^^
Required if using customized tuners. Path relative to the location of config file.
Specifies the directory of tuner code.
classFileName
^^^^^^^^^^^^^
Required if using customized tuners. File path relative to **codeDir**.
Specifies the name of tuner file.
className
^^^^^^^^^
Required if using customized tuners. String.
Specifies the name of tuner class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of tuner algorithm. Please refer to `this file <../Tuner/BuiltinTuner.rst>`__ for the configurable arguments of each built-in tuner.
gpuIndices
^^^^^^^^^^
Optional. String. Default: empty.
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma ``,``. For example, ``1``\ , or ``0,1,3``. If the field is not set, no GPU will be visible to tuner (by setting ``CUDA_VISIBLE_DEVICES`` to be an empty string).
includeIntermediateResults
^^^^^^^^^^^^^^^^^^^^^^^^^^
Optional. Bool. Default: false.
If **includeIntermediateResults** is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
assessor
^^^^^^^^
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set **builtinAssessorName** and **classArgs**. Another way is to use users' own assessor file, and users need to set **codeDirectory**\ ,** classFileName**\ ,** className** and **classArgs**. *Users must choose exactly one way.*
By default, there is no assessor enabled.
builtinAssessorName
^^^^^^^^^^^^^^^^^^^
Required if using built-in assessors. String.
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced `here <../Assessor/BuiltinAssessor.rst>`__.
codeDir
^^^^^^^
Required if using customized assessors. Path relative to the location of config file.
Specifies the directory of assessor code.
classFileName
^^^^^^^^^^^^^
Required if using customized assessors. File path relative to **codeDir**.
Specifies the name of assessor file.
className
^^^^^^^^^
Required if using customized assessors. String.
Specifies the name of assessor class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of assessor algorithm.
advisor
^^^^^^^
Optional.
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set **builtinAdvisorName** and **classArgs**. Another way is to use users' own advisor file, and need to set **codeDirectory**\ ,** classFileName**\ ,** className** and **classArgs**.
When advisor is enabled, settings of tuners and advisors will be bypassed.
builtinAdvisorName
^^^^^^^^^^^^^^^^^^
Specifies the name of a built-in advisor. NNI sdk provides `BOHB <../Tuner/BohbAdvisor.md>`__ and `Hyperband <../Tuner/HyperbandAdvisor.rst>`__.
codeDir
^^^^^^^
Required if using customized advisors. Path relative to the location of config file.
Specifies the directory of advisor code.
classFileName
^^^^^^^^^^^^^
Required if using customized advisors. File path relative to **codeDir**.
Specifies the name of advisor file.
className
^^^^^^^^^
Required if using customized advisors. String.
Specifies the name of advisor class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of advisor.
gpuIndices
^^^^^^^^^^
Optional. String. Default: empty.
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma ``,``. For example, ``1``\ , or ``0,1,3``. If the field is not set, no GPU will be visible to tuner (by setting ``CUDA_VISIBLE_DEVICES`` to be an empty string).
trial
^^^^^
Required. Key-value pairs.
In local and remote mode, the following keys are required.
*
**command**\ : Required string. Specifies the command to run trial process.
*
**codeDir**\ : Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
*
**gpuNum**\ : Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
In PAI mode, the following keys are required.
*
**command**\ : Required string. Specifies the command to run trial process.
*
**codeDir**\ : Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
*
**gpuNum**\ : Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
*
**cpuNum**\ : Required integer. Specifies the cpu number of cpu to be used in pai container.
*
**memoryMB**\ : Required integer. Set the memory size to be used in pai container, in megabytes.
*
**image**\ : Required string. Set the image to be used in pai.
*
**authFile**\ : Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. `Reference <https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job>`__.
*
**shmMB**\ : Optional integer. Shared memory size of container.
*
**portList**\ : List of key-values pairs with ``label``\ , ``beginAt``\ , ``portNumber``. See `job tutorial of PAI <https://github.com/microsoft/pai/blob/master/docs/job_tutorial.rst>`__ for details.
In Kubeflow mode, the following keys are required.
*
**codeDir**\ : The local directory where the code files are in.
*
**ps**\ : An optional configuration for kubeflow's tensorflow-operator, which includes
*
**replicas**\ : The replica number of **ps** role.
*
**command**\ : The run script in **ps**\ 's container.
*
**gpuNum**\ : The gpu number to be used in **ps** container.
*
**cpuNum**\ : The cpu number to be used in **ps** container.
*
**memoryMB**\ : The memory size of the container.
*
**image**\ : The image to be used in **ps**.
*
**worker**\ : An optional configuration for kubeflow's tensorflow-operator.
*
**replicas**\ : The replica number of **worker** role.
*
**command**\ : The run script in **worker**\ 's container.
*
**gpuNum**\ : The gpu number to be used in **worker** container.
*
**cpuNum**\ : The cpu number to be used in **worker** container.
*
**memoryMB**\ : The memory size of the container.
*
**image**\ : The image to be used in **worker**.
localConfig
^^^^^^^^^^^
Optional in local mode. Key-value pairs.
Only applicable if **trainingServicePlatform** is set to ``local``\ , otherwise there should not be** localConfig** section in configuration file.
gpuIndices
^^^^^^^^^^
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (\ ``,``\ ), such as ``1`` or ``0,1,3``. By default, all GPUs available will be used.
maxTrialNumPerGpu
^^^^^^^^^^^^^^^^^
Optional. Integer. Default: 1.
Used to specify the max concurrency trial number on a GPU device.
useActiveGpu
^^^^^^^^^^^^
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If **useActiveGpu** is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
machineList
^^^^^^^^^^^
Required in remote mode. A list of key-value pairs with the following keys.
ip
^^
Required. IP address or host name that is accessible from the current machine.
The IP address or host name of remote machine.
port
^^^^
Optional. Integer. Valid port. Default: 22.
The ssh port to be used to connect machine.
username
^^^^^^^^
Required if authentication with username/password. String.
The account of remote machine.
passwd
^^^^^^
Required if authentication with username/password. String.
Specifies the password of the account.
sshKeyPath
^^^^^^^^^^
Required if authentication with ssh key. Path to private key file.
If users use ssh key to login remote machine, **sshKeyPath** should be a valid path to a ssh key file.
*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*
passphrase
^^^^^^^^^^
Optional. String.
Used to protect ssh key, which could be empty if users don't have passphrase.
gpuIndices
^^^^^^^^^^
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (\ ``,``\ ), such as ``1`` or ``0,1,3``. By default, all GPUs available will be used.
maxTrialNumPerGpu
^^^^^^^^^^^^^^^^^
Optional. Integer. Default: 1.
Used to specify the max concurrency trial number on a GPU device.
useActiveGpu
^^^^^^^^^^^^
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If **useActiveGpu** is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
preCommand
^^^^^^^^^^
Optional. String.
Specifies the pre-command that will be executed before the remote machine executes other commands. Users can configure the experimental environment on remote machine by setting **preCommand**. If there are multiple commands need to execute, use ``&&`` to connect them, such as ``preCommand: command1 && command2 && ...``.
**Note**\ : Because **preCommand** will execute before other commands each time, it is strongly not recommended to set **preCommand** that will make changes to system, i.e. ``mkdir`` or ``touch``.
remoteConfig
^^^^^^^^^^^^
Optional field in remote mode. Users could set per machine information in ``machineList`` field, and set global configuration for remote mode in this field.
reuse
^^^^^
Optional. Bool. default: ``false``. It's an experimental feature.
If it's true, NNI will reuse remote jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
kubeflowConfig
^^^^^^^^^^^^^^
operator
^^^^^^^^
Required. String. Has to be ``tf-operator`` or ``pytorch-operator``.
Specifies the kubeflow's operator to be used, NNI support ``tf-operator`` in current version.
storage
^^^^^^^
Optional. String. Default. ``nfs``.
Specifies the storage type of kubeflow, including ``nfs`` and ``azureStorage``.
nfs
^^^
Required if using nfs. Key-value pairs.
*
**server** is the host of nfs server.
*
**path** is the mounted path of nfs.
keyVault
^^^^^^^^
Required if using azure storage. Key-value pairs.
Set **keyVault** to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
*
**vaultName** is the value of ``--vault-name`` used in az command.
*
**name** is the value of ``--name`` used in az command.
azureStorage
^^^^^^^^^^^^
Required if using azure storage. Key-value pairs.
Set azure storage account to store code files.
*
**accountName** is the name of azure storage account.
*
**azureShare** is the share of the azure file storage.
uploadRetryCount
^^^^^^^^^^^^^^^^
Required if using azure storage. Integer between 1 and 99999.
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
paiConfig
^^^^^^^^^
userName
^^^^^^^^
Required. String.
The user name of your pai account.
password
^^^^^^^^
Required if using password authentication. String.
The password of the pai account.
token
^^^^^
Required if using token authentication. String.
Personal access token that can be retrieved from PAI portal.
host
^^^^
Required. String.
The hostname of IP address of PAI.
reuse
^^^^^
Optional. Bool. default: ``false``. It's an experimental feature.
If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
Examples
--------
Local mode
^^^^^^^^^^
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
You can add assessor configuration.
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
#choice: Medianstop
builtinAssessorName: Medianstop
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
Or you could specify your own tuner and assessor file as following,
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
codeDir: /nni/tuner
classFileName: mytuner.py
className: MyTuner
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
codeDir: /nni/assessor
classFileName: myassessor.py
className: MyAssessor
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
Remote mode
^^^^^^^^^^^
If run trial jobs in remote machine, users could specify the remote machine information as following format:
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: remote
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: 10.10.10.10
port: 22
username: test
passwd: test
- ip: 10.10.10.11
port: 22
username: test
passwd: test
- ip: 10.10.10.12
port: 22
username: test
sshKeyPath: /nni/sshkey
passphrase: qwert
# Pre-command will be executed before the remote machine executes other commands.
# Below is an example of specifying python environment.
# If you want to execute multiple commands, please use "&&" to connect them.
This page is for frequent asked questions and answers.
tmp folder fulled
^^^^^^^^^^^^^^^^^
nnictl will use tmp folder as a temporary folder to copy files under codeDir when executing experimentation creation.
When met errors like below, try to clean up **tmp** folder first.
..
OSError: [Errno 28] No space left on device
Cannot get trials' metrics in OpenPAI mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In OpenPAI training mode, we start a rest server which listens on 51189 port in NNI Manager to receive metrcis reported from trials running in OpenPAI cluster. If you didn't see any metrics from WebUI in OpenPAI mode, check your machine where NNI manager runs on to make sure 51189 port is turned on in the firewall rule.
Your machine don't have eth0 device, please set `nniManagerIp <ExperimentConfig.rst>`__ in your config file manually.
Exceed the MaxDuration but didn't stop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment.
Could not stop an experiment using ``nnictl stop``
If you upgrade your NNI or you delete some config files of NNI when there is an experiment running, this kind of issue may happen because the loss of config file. You could use ``ps -ef | grep node`` to find the PID of your experiment, and use ``kill -9 {pid}`` to kill it manually.
Could not get ``default metric`` in webUI of virtual machines
Config the network mode to bridge mode or other mode that could make virtual machine's host accessible from external machine, and make sure the port of virtual machine is not forbidden by firewall.
Could not open webUI link
^^^^^^^^^^^^^^^^^^^^^^^^^
Unable to open the WebUI may have the following reasons:
* ``http://127.0.0.1``\ , ``http://172.17.0.1`` and ``http://10.0.0.15`` are referred to localhost, if you start your experiment on the server or remote machine. You can replace the IP to your server IP to view the WebUI, like ``http://[your_server_ip]:8080``
* If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment.
* Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: ``~/nni-experiments/[your_experiment_id]`` ``/log/nnimanager.log``
Restful server start failed
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Probably it's a problem with your network config. Here is a checklist.
* You might need to link ``127.0.0.1`` with ``localhost``. Add a line ``127.0.0.1 localhost`` to ``/etc/hosts``.
* It's also possible that you have set some proxy config. Check your environment for variables like ``HTTP_PROXY`` or ``HTTPS_PROXY`` and unset if they are set.
NNI on Windows problems
^^^^^^^^^^^^^^^^^^^^^^^
Please refer to `NNI on Windows <InstallationWin.rst>`__
More FAQ issues
^^^^^^^^^^^^^^^
`NNI Issues with FAQ labels <https://github.com/microsoft/nni/labels/FAQ>`__
Help us improve
^^^^^^^^^^^^^^^
Please inquiry the problem in https://github.com/Microsoft/nni/issues to see whether there are other people already reported the problem, create a new one if there are no existing issues been created.
There are three parts that might have logs in NNI. They are nnimanager, dispatcher and trial. Here we will introduce them succinctly. More information please refer to `Overview <../Overview.rst>`__.
* **NNI controller**\ : NNI controller (nnictl) is the nni command-line tool that is used to manage experiments (e.g., start an experiment).
* **nnimanager**\ : nnimanager is the core of NNI, whose log is important when the whole experiment fails (e.g., no webUI or training service fails)
* **Dispatcher**\ : Dispatcher calls the methods of **Tuner** and **Assessor**. Logs of dispatcher are related to the tuner or assessor code.
* **Tuner**\ : Tuner is an AutoML algorithm, which generates a new configuration for the next try. A new trial will run with this configuration.
* **Assessor**\ : Assessor analyzes trial's intermediate results (e.g., periodically evaluated accuracy on test dataset) to tell whether this trial can be early stopped or not.
* **Trial**\ : Trial code is the code you write to run your experiment, which is an individual attempt at applying a new configuration (e.g., a set of hyperparameter values, a specific nerual architecture).
Where is the log
----------------
There are three kinds of log in NNI. When creating a new experiment, you can specify log level as debug by adding ``--debug``. Besides, you can set more detailed log level in your configuration file by using
All possible errors that happen when launching an NNI experiment can be found here.
You can use ``nnictl log stderr`` to find error information. For more options please refer to `NNICTL <Nnictl.rst>`__
Experiment Root Directory
^^^^^^^^^^^^^^^^^^^^^^^^^
Every experiment has a root folder, which is shown on the right-top corner of webUI. Or you could assemble it by replacing the ``experiment_id`` with your actual experiment_id in path ``~/nni-experiments/experiment_id/`` in case of webUI failure. ``experiment_id`` could be seen when you run ``nnictl create ...`` to create a new experiment.
..
For flexibility, we also offer a ``logDir`` option in your configuration, which specifies the directory to store all experiments (defaults to ``~/nni-experiments``\ ). Please refer to `Configuration <ExperimentConfig.rst>`__ for more details.
Under that directory, there is another directory named ``log``\ , where ``nnimanager.log`` and ``dispatcher.log`` are placed.
Trial Root Directory
^^^^^^^^^^^^^^^^^^^^
Usually in webUI, you can click ``+`` in the left of every trial to expand it to see each trial's log path.
Besides, there is another directory under experiment root directory, named ``trials``\ , which stores all the trials.
Every trial has a unique id as its directory name. In this directory, a file named ``stderr`` records trial error and another named ``trial.log`` records this trial's log.
Different kinds of errors
-------------------------
There are different kinds of errors. However, they can be divided into three categories based on their severity. So when nni fails, check each part sequentially.
Generally, if webUI is started successfully, there is a ``Status`` in the ``Overview`` tab, serving as a possible indicator of what kind of error happens. Otherwise you should check manually.
**NNI** Fails
^^^^^^^^^^^^^^^^^
This is the most serious error. When this happens, the whole experiment fails and no trial will be run. Usually this might be related to some installation problem.
When this happens, you should check ``nnictl``\ 's error output file ``stderr`` (i.e., nnictl log stderr) and then the ``nnimanager``\ 's log to find if there is any error.
**Dispatcher** Fails
^^^^^^^^^^^^^^^^^^^^^^^^
Dispatcher fails. Usually, for some new users of NNI, it means that tuner fails. You could check dispatcher's log to see what happens to your dispatcher. For built-in tuner, some common errors might be invalid search space (unsupported type of search space or inconsistence between initializing args in configuration file and actual tuner's __init__ function args).
Take the later situation as an example. If you write a customized tuner who's __init__ function has an argument called ``optimize_mode``\ , which you do not provide in your configuration file, NNI will fail to run your tuner so the experiment fails. You can see errors in the webUI like:
.. image:: ../../img/dispatcher_error.jpg
:target: ../../img/dispatcher_error.jpg
:alt:
Here we can see it is a dispatcher error. So we can check dispatcher's log, which might look like:
In this situation, NNI can still run and create new trials.
It means your trial code (which is run by NNI) fails. This kind of error is strongly related to your trial code. Please check trial's log to fix any possible errors shown there.
A common example of this would be run the mnist example without installing tensorflow. Surely there is an Import Error (that is, not installing tensorflow but trying to import it in your trial code) and thus every trial fails.
.. image:: ../../img/trial_error.jpg
:target: ../../img/trial_error.jpg
:alt:
As it shows, every trial has a log path, where you can find trial's log and stderr.
In addition to experiment level debug, NNI also provides the capability for debugging a single trial without the need to start the entire experiment. Refer to `standalone mode <../TrialExample/Trials#standalone-mode-for-debugging>`__ for more information about debug single trial code.
`Docker <https://www.docker.com/>`__ is a tool to make it easier for users to deploy and run applications based on their own operating system by starting containers. Docker is not a virtual machine, it does not create a virtual operating system, but it allows different applications to use the same OS kernel and isolate different applications by container.
Users can start NNI experiments using Docker. NNI also provides an official Docker image `msranni/nni <https://hub.docker.com/r/msranni/nni>`__ on Docker Hub.
Using Docker in local machine
-----------------------------
Step 1: Installation of Docker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Before you start using Docker for NNI experiments, you should install Docker on your local machine. `See here <https://docs.docker.com/install/linux/docker-ce/ubuntu/>`__.
Step 2: Start a Docker container
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you have installed the Docker package in your local machine, you can start a Docker container instance to run NNI examples. You should notice that because NNI will start a web UI process in a container and continue to listen to a port, you need to specify the port mapping between your host machine and Docker container to give access to web UI outside the container. By visiting the host IP address and port, you can redirect to the web UI process started in Docker container and visit web UI content.
For example, you could start a new Docker container from the following command:
.. code-block:: bash
docker run -i -t -p [hostPort]:[containerPort] [image]
``-i:`` Start a Docker in an interactive mode.
``-t:`` Docker assign the container an input terminal.
``-p:`` Port mapping, map host port to a container port.
For more information about Docker commands, please `refer to this <https://docs.docker.com/v17.09/edge/engine/reference/run/>`__.
Note:
.. code-block:: bash
NNI only supports Ubuntu and MacOS systems in local mode for the moment, please use correct Docker image type. If you want to use gpu in a Docker container, please use nvidia-docker.
Step 3: Run NNI in a Docker container
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you start a Docker image using NNI's official image ``msranni/nni``\ , you can directly start NNI experiments by using the ``nnictl`` command. Our official image has NNI's running environment and basic python and deep learning frameworks preinstalled.
If you start your own Docker image, you may need to install the NNI package first; please refer to `NNI installation <InstallationLinux.rst>`__.
If you want to run NNI's official examples, you may need to clone the NNI repo in GitHub using
.. code-block:: bash
git clone https://github.com/Microsoft/nni.git
then you can enter ``nni/examples/trials`` to start an experiment.
After you prepare NNI's environment, you can start a new experiment using the ``nnictl`` command. `See here <QuickStart.rst>`__.
Using Docker on a remote platform
---------------------------------
NNI supports starting experiments in `remoteTrainingService <../TrainingService/RemoteMachineMode.rst>`__\ , and running trial jobs on remote machines. As Docker can start an independent Ubuntu system as an SSH server, a Docker container can be used as the remote machine in NNI's remote mode.
Step 1: Setting a Docker environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You should install the Docker software on your remote machine first, please `refer to this <https://docs.docker.com/install/linux/docker-ce/ubuntu/>`__.
To make sure your Docker container can be connected by NNI experiments, you should build your own Docker image to set an SSH server or use images with an SSH configuration. If you want to use a Docker container as an SSH server, you should configure the SSH password login or private key login; please `refer to this <https://docs.docker.com/engine/examples/running_ssh_service/>`__.
Note:
.. code-block:: text
NNI's official image msranni/nni does not support SSH servers for the time being; you should build your own Docker image with an SSH configuration or use other images as a remote server.
Step 2: Start a Docker container on a remote machine
An SSH server needs a port; you need to expose Docker's SSH port to NNI as the connection port. For example, if you set your container's SSH port as ``A``, you should map the container's port ``A`` to your remote host machine's other port ``B``, NNI will connect port ``B`` as an SSH port, and your host machine will map the connection from port ``B`` to port ``A`` then NNI could connect to your Docker container.
For example, you could start your Docker container using the following commands:
.. code-block:: bash
docker run -dit -p [hostPort]:[containerPort] [image]
The ``containerPort`` is the SSH port used in your Docker container and the ``hostPort`` is your host machine's port exposed to NNI. You can set your NNI's config file to connect to ``hostPort`` and the connection will be transmitted to your Docker container.
For more information about Docker commands, please `refer to this <https://docs.docker.com/v17.09/edge/engine/reference/run/>`__.
Note:
.. code-block:: bash
If you use your own Docker image as a remote server, please make sure that this image has a basic python environment and an NNI SDK runtime environment. If you want to use a GPU in a Docker container, please use nvidia-docker.
Step 3: Run NNI experiments
^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can set your config file as a remote platform and set the ``machineList`` configuration to connect to your Docker SSH server; `refer to this <../TrainingService/RemoteMachineMode.rst>`__. Note that you should set the correct ``port``\ , ``username``\ , and ``passWd`` or ``sshKeyPath`` of your host machine.
``port:`` The host machine's port, mapping to Docker's SSH port.
``username:`` The username of the Docker container.
``passWd:`` The password of the Docker container.
``sshKeyPath:`` The path of the private key of the Docker container.
After the configuration of the config file, you could start an experiment, `refer to this <QuickStart.rst>`__.
NNI provides a lot of `builtin tuners <../Tuner/BuiltinTuner.md>`__\ , `advisors <../Tuner/HyperbandAdvisor.md>`__ and `assessors <../Assessor/BuiltinAssessor.rst>`__ can be used directly for Hyper Parameter Optimization, and some extra algorithms can be installed via ``nnictl package install --name <name>`` after NNI is installed. You can check these extra algorithms via ``nnictl package list`` command.
NNI also provides the ability to build your own customized tuners, advisors and assessors. To use the customized algorithm, users can simply follow the spec in experiment config file to properly reference the algorithm, which has been illustrated in the tutorials of `customized tuners <../Tuner/CustomizeTuner.md>`__\ /\ `advisors <../Tuner/CustomizeAdvisor.md>`__\ /\ `assessors <../Assessor/CustomizeAssessor.rst>`__.
NNI also allows users to install the customized algorithm as a builtin algorithm, in order for users to use the algorithm in the same way as NNI builtin tuners/advisors/assessors. More importantly, it becomes much easier for users to share or distribute their implemented algorithm to others. Customized tuners/advisors/assessors can be installed into NNI as builtin algorithms, once they are installed into NNI, you can use your customized algorithms the same way as builtin tuners/advisors/assessors in your experiment configuration file. For example, you built a customized tuner and installed it into NNI using a builtin name ``mytuner``\ , then you can use this tuner in your configuration file like below:
.. code-block:: yaml
tuner:
builtinTunerName: mytuner
Install customized algorithms as builtin tuners, assessors and advisors
NNI provides a ``ClassArgsValidator`` interface for customized algorithms author to validate the classArgs parameters in experiment configuration file which are passed to customized algorithms constructors.
The ``ClassArgsValidator`` interface is defined as:
.. code-block:: python
class ClassArgsValidator(object):
def validate_class_args(self, **kwargs):
"""
The classArgs fields in experiment configuration are packed as a dict and
passed to validator as kwargs.
"""
pass
For example, you can implement your validator such as:
.. code-block:: python
from schema import Schema, Optional
from nni import ClassArgsValidator
class MedianstopClassArgsValidator(ClassArgsValidator):
The validator will be invoked before experiment is started to check whether the classArgs fields are valid for your customized algorithms.
3. Prepare package installation source
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to be installed as builtin tuners, assessors and advisors, the customized algorithms need to be packaged as installable source which can be recognized by ``pip`` command, under the hood nni calls ``pip`` command to install the package.
Besides being a common pip source, the package needs to provide meta information in the ``classifiers`` field.
Format of classifiers field is a following:
.. code-block:: bash
NNI Package :: <type> :: <builtin name> :: <full class name of tuner> :: <full class name of class args validator>
* ``type``\ : type of algorithms, could be one of ``tuner``\ , ``assessor``\ , ``advisor``
* ``builtin name``\ : builtin name used in experiment configuration file
* `full class name of tuner`: tuner class name, including its module name, for example: ``demo_tuner.DemoTuner``
* `full class name of class args validator`: class args validator class name, including its module name, for example: ``demo_tuner.MyClassArgsValidator``
Following is an example of classfiers in package's ``setup.py``\ :
Once you have the meta info in ``setup.py``\ , you can build your pip installation source via:
* Run command ``python setup.py develop`` from the package directory, this command will build the directory as a pip installation source.
* Run command ``python setup.py bdist_wheel`` from the package directory, this command build a whl file which is a pip installation source.
NNI will look for the classifier starts with ``NNI Package`` to retrieve the package meta information while the package being installed with ``nnictl package install <source>`` command.
Reference `customized tuner example <../Tuner/InstallCustomizedTuner.rst>`__ for a full example.
4. Install customized algorithms package into NNI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your installation source is prepared as a directory with ``python setup.py develop``\ , you can install the package by following command:
Once your customized algorithms is installed, you can use it in experiment configuration file the same way as other builtin tuners/assessors/advisors, for example:
.. code-block:: yaml
tuner:
builtinTunerName: demotuner
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
Manage packages using ``nnictl package``
--------------------------------------------
List installed packages
^^^^^^^^^^^^^^^^^^^^^^^
Run following command to list the installed packages:
You can also install NNI in a docker image. Please follow the instructions :githublink:`here <deployment/docker/README.rst>` to build an NNI docker image. The NNI docker image can also be retrieved from Docker Hub through the command ``docker pull msranni/nni:latest``.
Verify installation
-------------------
The following example is built on TensorFlow 1.x. Make sure **TensorFlow 1.x is used** when running it.
*
Download the examples via cloning the source code.
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the ``Web UI url``.
* Open the ``Web UI url`` in your browser, you can view detailed information about the experiment and all the submitted trial jobs as shown below. `Here <../Tutorial/WebUI.rst>`__ are more Web UI pages.
.. image:: ../../img/webui_overview_page.png
:target: ../../img/webui_overview_page.png
:alt: overview
.. image:: ../../img/webui_trialdetail_page.png
:target: ../../img/webui_trialdetail_page.png
:alt: detail
System requirements
-------------------
Due to potential programming changes, the minimum system requirements of NNI may change over time.
Linux
^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* -
- Recommended
- Minimum
* - **Operating System**
- Ubuntu 16.04 or above
-
* - **CPU**
- Intel® Core™ i5 or AMD Phenom™ II X3 or better
- Intel® Core™ i3 or AMD Phenom™ X3 8650
* - **GPU**
- NVIDIA® GeForce® GTX 660 or better
- NVIDIA® GeForce® GTX 460
* - **Memory**
- 6 GB RAM
- 4 GB RAM
* - **Storage**
- 30 GB available hare drive space
-
* - **Internet**
- Boardband internet connection
-
* - **Resolution**
- 1024 x 768 minimum display resolution
-
macOS
^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* -
- Recommended
- Minimum
* - **Operating System**
- macOS 10.14.1 or above
-
* - **CPU**
- Intel® Core™ i7-4770 or better
- Intel® Core™ i5-760 or better
* - **GPU**
- AMD Radeon™ R9 M395X or better
- NVIDIA® GeForce® GT 750M or AMD Radeon™ R9 M290 or better
* - **Memory**
- 8 GB RAM
- 4 GB RAM
* - **Storage**
- 70GB available space SSD
- 70GB available space 7200 RPM HDD
* - **Internet**
- Boardband internet connection
-
* - **Resolution**
- 1024 x 768 minimum display resolution
-
Further reading
---------------
* `Overview <../Overview.rst>`__
* `Use command line tool nnictl <Nnictl.rst>`__
* `Use NNIBoard <WebUI.rst>`__
* `Define search space <SearchSpaceSpec.rst>`__
* `Config an experiment <ExperimentConfig.rst>`__
* `How to run an experiment on local (with multiple GPUs)? <../TrainingService/LocalMode.rst>`__
* `How to run an experiment on multiple machines? <../TrainingService/RemoteMachineMode.rst>`__
* `How to run an experiment on OpenPAI? <../TrainingService/PaiMode.rst>`__
* `How to run an experiment on Kubernetes through Kubeflow? <../TrainingService/KubeflowMode.rst>`__
* `How to run an experiment on Kubernetes through FrameworkController? <../TrainingService/FrameworkControllerMode.rst>`__
* `How to run an experiment on Kubernetes through AdaptDL? <../TrainingService/AdaptDLMode.rst>`__
Python 3.6 (or above) 64-bit. `Anaconda <https://www.anaconda.com/products/individual>`__ or `Miniconda <https://docs.conda.io/en/latest/miniconda.html>`__ is highly recommended to manage multiple Python environments on Windows.
*
If it's a newly installed Python environment, it needs to install `Microsoft C++ Build Tools <https://visualstudio.microsoft.com/visual-cpp-build-tools/>`__ to support build NNI dependencies like ``scikit-learn``.
.. code-block:: bat
pip install cython wheel
*
git for verifying installation.
Install NNI
-----------
In most cases, you can install and upgrade NNI from pip package. It's easy and fast.
If you are interested in special or the latest code versions, you can install NNI through source code.
If you want to contribute to NNI, refer to `setup development environment <SetupNniDeveloperEnvironment.rst>`__.
Note: If you are familiar with other frameworks, you can choose corresponding example under ``examples\trials``. It needs to change trial command ``python3`` to ``python`` in each example YAML, since default installation has ``python.exe``\ , not ``python3.exe`` executable.
*
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the ``Web UI url``.
* Open the ``Web UI url`` in your browser, you can view detailed information about the experiment and all the submitted trial jobs as shown below. `Here <../Tutorial/WebUI.rst>`__ are more Web UI pages.
.. image:: ../../img/webui_overview_page.png
:target: ../../img/webui_overview_page.png
:alt: overview
.. image:: ../../img/webui_trialdetail_page.png
:target: ../../img/webui_trialdetail_page.png
:alt: detail
System requirements
-------------------
Below are the minimum system requirements for NNI on Windows, Windows 10.1809 is well tested and recommend. Due to potential programming changes, the minimum system requirements for NNI may change over time.
.. list-table::
:header-rows: 1
:widths: auto
* -
- Recommended
- Minimum
* - **Operating System**
- Windows 10 1809 or above
-
* - **CPU**
- Intel® Core™ i5 or AMD Phenom™ II X3 or better
- Intel® Core™ i3 or AMD Phenom™ X3 8650
* - **GPU**
- NVIDIA® GeForce® GTX 660 or better
- NVIDIA® GeForce® GTX 460
* - **Memory**
- 6 GB RAM
- 4 GB RAM
* - **Storage**
- 30 GB available hare drive space
-
* - **Internet**
- Boardband internet connection
-
* - **Resolution**
- 1024 x 768 minimum display resolution
-
FAQ
---
simplejson failed when installing NNI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Make sure a C++ 14.0 compiler is installed.
..
building 'simplejson._speedups' extension error: [WinError 3] The system cannot find the path specified
Trial failed with missing DLL in command line or PowerShell
This error is caused by missing LIBIFCOREMD.DLL and LIBMMD.DLL and failure to install SciPy. Using Anaconda or Miniconda with Python(64-bit) can solve it.
..
ImportError: DLL load failed
Trial failed on webUI
^^^^^^^^^^^^^^^^^^^^^
Please check the trial log file stderr for more details.
If there is a stderr file, please check it. Two possible cases are:
* forgetting to change the trial command ``python3`` to ``python`` in each experiment YAML.
* forgetting to install experiment dependencies such as TensorFlow, Keras and so on.
Fail to use BOHB on Windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Make sure a C++ 14.0 compiler is installed when trying to run ``nnictl package install --name=BOHB`` to install the dependencies.
Not supported tuner on Windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SMAC is not supported currently; for the specific reason refer to this `GitHub issue <https://github.com/automl/SMAC3/issues/483>`__.
Use Windows as a remote worker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Refer to `Remote Machine mode <../TrainingService/RemoteMachineMode.rst>`__.
Segmentation fault (core dumped) when installing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Refer to `FAQ <FAQ.rst>`__.
Further reading
---------------
* `Overview <../Overview.rst>`__
* `Use command line tool nnictl <Nnictl.rst>`__
* `Use NNIBoard <WebUI.rst>`__
* `Define search space <SearchSpaceSpec.rst>`__
* `Config an experiment <ExperimentConfig.rst>`__
* `How to run an experiment on local (with multiple GPUs)? <../TrainingService/LocalMode.rst>`__
* `How to run an experiment on multiple machines? <../TrainingService/RemoteMachineMode.rst>`__
* `How to run an experiment on OpenPAI? <../TrainingService/PaiMode.rst>`__
* `How to run an experiment on Kubernetes through Kubeflow? <../TrainingService/KubeflowMode.rst>`__
* `How to run an experiment on Kubernetes through FrameworkController? <../TrainingService/FrameworkControllerMode.rst>`__
**nnictl** is a command line tool, which can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc.
Commands
--------
nnictl support commands:
* `nnictl create <#create>`__
* `nnictl resume <#resume>`__
* `nnictl view <#view>`__
* `nnictl stop <#stop>`__
* `nnictl update <#update>`__
* `nnictl trial <#trial>`__
* `nnictl top <#top>`__
* `nnictl experiment <#experiment>`__
* `nnictl platform <#platform>`__
* `nnictl config <#config>`__
* `nnictl log <#log>`__
* `nnictl webui <#webui>`__
* `nnictl tensorboard <#tensorboard>`__
* `nnictl package <#package>`__
* `nnictl ss_gen <#ss_gen>`__
* `nnictl --version <#version>`__
Manage an experiment
^^^^^^^^^^^^^^^^^^^^
:raw-html:`<a name="create"></a>`
nnictl create
^^^^^^^^^^^^^
*
Description
You can use this command to create a new experiment, using the configuration specified in config file.
After this command is successfully done, the context will be set as this experiment, which means the following command you issued is associated with this experiment, unless you explicitly changes the context(not supported yet).
*
Usage
.. code-block:: bash
nnictl create [OPTIONS]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - --config, -c
- True
-
- YAML configure file of the experiment
* - --port, -p
- False
-
- the port of restful server
* - --debug, -d
- False
-
- set debug mode
* - --foreground, -f
- False
-
- set foreground mode, print log content to terminal
*
Examples
..
create a new experiment with the default port: 8080
- The interval to update the experiment status, the unit of time is second, and the default value is 3 second.
:raw-html:`<a name="experiment"></a>`
Manage experiment information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*
**nnictl experiment show**
*
Description
Show the information of experiment.
*
Usage
.. code-block:: bash
nnictl experiment show
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- ID of the experiment you want to set
*
**nnictl experiment status**
*
Description
Show the status of experiment.
*
Usage
.. code-block:: bash
nnictl experiment status
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- ID of the experiment you want to set
*
**nnictl experiment list**
*
Description
Show the information of all the (running) experiments.
*
Usage
.. code-block:: bash
nnictl experiment list [OPTIONS]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - --all
- False
-
- list all of experiments
*
**nnictl experiment delete**
*
Description
Delete one or all experiments, it includes log, result, environment information and cache. It uses to delete useless experiment result, or save disk space.
*
Usage
.. code-block:: bash
nnictl experiment delete [OPTIONS]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- ID of the experiment
* - --all
- False
-
- delete all of experiments
*
**nnictl experiment export**
*
Description
You can use this command to export reward & hyper-parameter of trial jobs to a csv file.
*
Usage
.. code-block:: bash
nnictl experiment export [OPTIONS]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- ID of the experiment
* - --filename, -f
- True
-
- File path of the output file
* - --type
- True
-
- Type of output file, only support "csv" and "json"
* - --intermediate, -i
- False
-
- Are intermediate results included
*
Examples
..
export all trial data in an experiment as json format
You can use this command to import several prior or supplementary trial hyperparameters & results for NNI hyperparameter tuning. The data are fed to the tuning algorithm (e.g., tuner or advisor).
*
Usage
.. code-block:: bash
nnictl experiment import [OPTIONS]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- The id of the experiment you want to import data into
* - --filename, -f
- True
-
- a file with data you want to import in json format
*
Details
NNI supports users to import their own data, please express the data in the correct format. An example is shown below:
Every element in the top level list is a sample. For our built-in tuners/advisors, each sample should have at least two keys: ``parameter`` and ``value``. The ``parameter`` must match this experiment's search space, that is, all the keys (or hyperparameters) in ``parameter`` must match the keys in the search space. Otherwise, tuner/advisor may have unpredictable behavior. ``Value`` should follow the same rule of the input in ``nni.report_final_result``\ , that is, either a number or a dict with a key named ``default``. For your customized tuner/advisor, the file could have any json content depending on how you implement the corresponding methods (e.g., ``import_data``\ ).
You also can use `nnictl experiment export <#export>`__ to export a valid json file including previous experiment trial hyperparameters and results.
Currently, following tuner and advisor support import data:
*If you want to import data to BOHB advisor, user are suggested to add "TRIAL_BUDGET" in parameter as NNI do, otherwise, BOHB will use max_budget as "TRIAL_BUDGET". Here is an example:*
It uses to clean up disk on a target platform. The provided YAML file includes the information of target platform, and it follows the same schema as the NNI configuration file.
*
Note
if the target platform is being used by other users, it may cause unexpected errors to others.
*
Usage
.. code-block:: bash
nnictl platform clean [OPTIONS]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - --config
- True
-
- the path of yaml config file used when create an experiment
- ID of the trial to be found the log path, required when id is not empty.
:raw-html:`<a name="webui"></a>`
Manage webui
^^^^^^^^^^^^
*
**nnictl webui url**
*
Description
Show an experiment's webui url
*
Usage
.. code-block:: bash
nnictl webui url [options]
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- Experiment ID
:raw-html:`<a name="tensorboard"></a>`
Manage tensorboard
^^^^^^^^^^^^^^^^^^
*
**nnictl tensorboard start**
*
Description
Start the tensorboard process.
*
Usage
.. code-block:: bash
nnictl tensorboard start
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- ID of the experiment you want to set
* - --trial_id, -T
- False
-
- ID of the trial
* - --port
- False
- 6006
- The port of the tensorboard process
*
Detail
#. NNICTL support tensorboard function in local and remote platform for the moment, other platforms will be supported later.
#. If you want to use tensorboard, you need to write your tensorboard log data to environment variable [NNI_OUTPUT_DIR] path.
#. In local mode, nnictl will set --logdir=[NNI_OUTPUT_DIR] directly and start a tensorboard process.
#. In remote mode, nnictl will create a ssh client to copy log data from remote machine to local temp directory firstly, and then start a tensorboard process in your local machine. You need to notice that nnictl only copy the log data one time when you use the command, if you want to see the later result of tensorboard, you should execute nnictl tensorboard command again.
#. If there is only one trial job, you don't need to set trial id. If there are multiple trial jobs running, you should set the trial id, or you could use [nnictl tensorboard start --trial_id all] to map --logdir to all trial log paths.
*
**nnictl tensorboard stop**
*
Description
Stop all of the tensorboard process.
*
Usage
.. code-block:: bash
nnictl tensorboard stop
*
Options
.. list-table::
:header-rows: 1
:widths: auto
* - Name, shorthand
- Required
- Default
- Description
* - id
- False
-
- ID of the experiment you want to set
:raw-html:`<a name="package"></a>`
Manage package
^^^^^^^^^^^^^^
*
**nnictl package install**
*
Description
Install a package (customized algorithms or nni provided algorithms) as builtin tuner/assessor/advisor.
*
Usage
.. code-block:: bash
nnictl package install --name <package name>
The available ``<package name>`` can be checked via ``nnictl package list`` command.
or
.. code-block:: bash
nnictl package install <installation source>
Reference `Install customized algorithms <InstallCustomizedAlgos.rst>`__ to prepare the installation source.
We currently support Linux, macOS, and Windows. Ubuntu 16.04 or higher, macOS 10.14.1, and Windows 10.1809 are tested and supported. Simply run the following ``pip install`` in an environment that has ``python >= 3.6``.
Linux and macOS
^^^^^^^^^^^^^^^
.. code-block:: bash
python3 -m pip install --upgrade nni
Windows
^^^^^^^
.. code-block:: bash
python -m pip install --upgrade nni
.. Note:: For Linux and macOS, ``--user`` can be added if you want to install NNI in your home directory; this does not require any special privileges.
.. Note:: If there is an error like ``Segmentation fault``, please refer to the :doc:`FAQ <FAQ>`.
.. Note:: For the system requirements of NNI, please refer to :doc:`Install NNI on Linux & Mac <InstallationLinux>` or :doc:`Windows <InstallationWin>`.
Enable NNI Command-line Auto-Completion (Optional)
After the installation, you may want to enable the auto-completion feature for **nnictl** commands. Please refer to this `tutorial <../CommunitySharings/AutoCompletion.rst>`__.
"Hello World" example on MNIST
------------------------------
NNI is a toolkit to help users run automated machine learning experiments. It can automatically do the cyclic process of getting hyperparameters, running trials, testing results, and tuning hyperparameters. Here, we'll show how to use NNI to help you find the optimal hyperparameters for a MNIST model.
Here is an example script to train a CNN on the MNIST dataset **without NNI**\ :
If you want to see the full implementation, please refer to :githublink:`examples/trials/mnist-tfv1/mnist_before.py <examples/trials/mnist-tfv1/mnist_before.py>`.
The above code can only try one set of parameters at a time; if we want to tune learning rate, we need to manually modify the hyperparameter and start the trial again and again.
NNI is born to help the user do tuning jobs; the NNI working process is presented below:
.. code-block:: text
input: search space, trial code, config file
output: one optimal hyperparameter configuration
1: For t = 0, 1, 2, ..., maxTrialNum,
2: hyperparameter = chose a set of parameter from search space
3: final result = run_trial_and_evaluate(hyperparameter)
4: report final result to NNI
5: If reach the upper limit time,
6: Stop the experiment
7: return hyperparameter value with best final result
If you want to use NNI to automatically train your model and find the optimal hyper-parameters, you need to do three changes based on your code:
Three steps to start an experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**Step 1**\ : Write a ``Search Space`` file in JSON, including the ``name`` and the ``distribution`` (discrete-valued or continuous-valued) of all the hyperparameters you need to search.
**Step 3**\ : Define a ``config`` file in YAML which declares the ``path`` to the search space and trial files. It also gives other information such as the tuning algorithm, max trial number, and max duration arguments.
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: local
# The path to Search Space
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
# The path and the running command of trial
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
.. Note:: If you are planning to use remote machines or clusters as your :doc:`training service <../TrainingService/Overview>`, to avoid too much pressure on network, we limit the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
.. Note:: If you're using NNI on Windows, you probably need to change ``python3`` to ``python`` in the config.yml file or use the config_windows.yml file to start the experiment.
.. Note:: ``nnictl`` is a command line tool that can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc. Click :doc:`here <Nnictl>` for more usage of ``nnictl``.
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. And this is what we expect to get:
If you prepared ``trial``\ , ``search space``\ , and ``config`` according to the above steps and successfully created an NNI job, NNI will automatically tune the optimal hyper-parameters and run different hyper-parameter sets for each trial according to the requirements you set. You can clearly see its progress through the NNI WebUI.
WebUI
-----
After you start your experiment in NNI successfully, you can find a message in the command-line interface that tells you the ``Web UI url`` like this:
.. code-block:: text
The Web UI urls are: [Your IP]:8080
Open the ``Web UI url`` (Here it's: ``[Your IP]:8080``\ ) in your browser; you can view detailed information about the experiment and all the submitted trial jobs as shown below. If you cannot open the WebUI link in your terminal, please refer to the `FAQ <FAQ.rst>`__.
View summary page
^^^^^^^^^^^^^^^^^
Click the "Overview" tab.
Information about this experiment will be shown in the WebUI, including the experiment trial profile and search space message. NNI also supports downloading this information and the parameters through the **Download** button. You can download the experiment results anytime while the experiment is running, or you can wait until the end of the execution, etc.
.. image:: ../../img/QuickStart1.png
:target: ../../img/QuickStart1.png
:alt:
The top 10 trials will be listed on the Overview page. You can browse all the trials on the "Trials Detail" page.
.. image:: ../../img/QuickStart2.png
:target: ../../img/QuickStart2.png
:alt:
View trials detail page
^^^^^^^^^^^^^^^^^^^^^^^
Click the "Default Metric" tab to see the point graph of all trials. Hover to see specific default metrics and search space messages.
.. image:: ../../img/QuickStart3.png
:target: ../../img/QuickStart3.png
:alt:
Click the "Hyper Parameter" tab to see the parallel graph.
* You can select the percentage to see the top trials.
* Choose two axis to swap their positions.
.. image:: ../../img/QuickStart4.png
:target: ../../img/QuickStart4.png
:alt:
Click the "Trial Duration" tab to see the bar graph.
.. image:: ../../img/QuickStart5.png
:target: ../../img/QuickStart5.png
:alt:
Below is the status of all trials. Specifically:
* Trial detail: trial's id, duration, start time, end time, status, accuracy, and search space file.
* If you run on the OpenPAI platform, you can also see the hdfsLogPath.
* Kill: you can kill a job that has the ``Running`` status.
* Support: Used to search for a specific trial.
.. image:: ../../img/QuickStart6.png
:target: ../../img/QuickStart6.png
:alt:
* Intermediate Result Graph
.. image:: ../../img/QuickStart7.png
:target: ../../img/QuickStart7.png
:alt:
Related Topic
-------------
* `Try different Tuners <../Tuner/BuiltinTuner.rst>`__
* `Try different Assessors <../Assessor/BuiltinAssessor.rst>`__
* `How to use command line tool nnictl <Nnictl.rst>`__
* `How to write a trial <../TrialExample/Trials.rst>`__
* `How to run an experiment on local (with multiple GPUs)? <../TrainingService/LocalMode.rst>`__
* `How to run an experiment on multiple machines? <../TrainingService/RemoteMachineMode.rst>`__
* `How to run an experiment on OpenPAI? <../TrainingService/PaiMode.rst>`__
* `How to run an experiment on Kubernetes through Kubeflow? <../TrainingService/KubeflowMode.rst>`__
* `How to run an experiment on Kubernetes through FrameworkController? <../TrainingService/FrameworkControllerMode.rst>`__
* `How to run an experiment on Kubernetes through AdaptDL? <../TrainingService/AdaptDLMode.rst>`__
Take the first line as an example. ``dropout_rate`` is defined as a variable whose priori distribution is a uniform distribution with a range from ``0.1`` to ``0.5``.
Note that the available sampling strategies within a search space depend on the tuner you want to use. We list the supported types for each builtin tuner below. For a customized tuner, you don't have to follow our convention and you will have the flexibility to define any type you want.
Types
-----
All types of sampling strategies and their parameter are listed here:
*
``{"_type": "choice", "_value": options}``
* The variable's value is one of the options. Here ``options`` should be a list of numbers or a list of strings. Using arbitrary objects as members of this list (like sublists, a mixture of numbers and strings, or null values) should work in most cases, but may trigger undefined behaviors.
* ``options`` can also be a nested sub-search-space, this sub-search-space takes effect only when the corresponding element is chosen. The variables in this sub-search-space can be seen as conditional variables. Here is an simple :githublink:`example of nested search space definition <examples/trials/mnist-nested-search-space/search_space.json>`. If an element in the options list is a dict, it is a sub-search-space, and for our built-in tuners you have to add a ``_name`` key in this dict, which helps you to identify which element is chosen. Accordingly, here is a :githublink:`sample <examples/trials/mnist-nested-search-space/sample.json>` which users can get from nni with nested search space definition. See the table below for the tuners which support nested search spaces.
* The variable value is determined using ``clip(round(uniform(low, high) / q) * q, low, high)``\ , where the clip operation is used to constrain the generated value within the bounds. For example, for ``_value`` specified as [0, 10, 2.5], possible values are [0, 2.5, 5.0, 7.5, 10.0]; For ``_value`` specified as [2, 10, 5], possible values are [2, 5, 10].
* Suitable for a discrete value with respect to which the objective is still somewhat "smooth", but which should be bounded both above and below. If you want to uniformly choose an integer from a range [low, high], you can write ``_value`` like this: ``[low, high, 1]``.
* The variable value is drawn from a range [low, high] according to a loguniform distribution like exp(uniform(log(low), log(high))), so that the logarithm of the return value is uniformly distributed.
* When optimizing, this variable is constrained to be positive.
* The variable value is determined using ``clip(round(loguniform(low, high) / q) * q, low, high)``\ , where the clip operation is used to constrain the generated value within the bounds.
* Suitable for a discrete variable with respect to which the objective is "smooth" and gets smoother with the size of the value, but which should be bounded both above and below.
*
``{"_type": "normal", "_value": [mu, sigma]}``
* The variable value is a real value that's normally-distributed with mean mu and standard deviation sigma. When optimizing, this is an unconstrained variable.
* The variable value is determined using ``round(normal(mu, sigma) / q) * q``
* Suitable for a discrete variable that probably takes a value around mu, but is fundamentally unbounded.
*
``{"_type": "lognormal", "_value": [mu, sigma]}``
* The variable value is drawn according to ``exp(normal(mu, sigma))`` so that the logarithm of the return value is normally distributed. When optimizing, this variable is constrained to be positive.
* The variable value is determined using ``round(exp(normal(mu, sigma)) / q) * q``
* Suitable for a discrete variable with respect to which the objective is smooth and gets smoother with the size of the variable, which is bounded from one side.
Search Space Types Supported by Each Tuner
------------------------------------------
.. list-table::
:header-rows: 1
:widths: auto
* -
- choice
- choice(nested)
- randint
- uniform
- quniform
- loguniform
- qloguniform
- normal
- qnormal
- lognormal
- qlognormal
* - TPE Tuner
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
* - Random Search Tuner
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
* - Anneal Tuner
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
* - Evolution Tuner
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
* - SMAC Tuner
- :raw-html:`✓`
-
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
-
-
-
-
-
* - Batch Tuner
- :raw-html:`✓`
-
-
-
-
-
-
-
-
-
-
* - Grid Search Tuner
- :raw-html:`✓`
-
- :raw-html:`✓`
-
- :raw-html:`✓`
-
-
-
-
-
-
* - Hyperband Advisor
- :raw-html:`✓`
-
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
* - Metis Tuner
- :raw-html:`✓`
-
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
-
-
-
-
-
-
* - GP Tuner
- :raw-html:`✓`
-
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
- :raw-html:`✓`
-
-
-
-
Known Limitations:
*
GP Tuner and Metis Tuner support only **numerical values** in search space (\ ``choice`` type values can be no-numerical with other tuners, e.g. string values). Both GP Tuner and Metis Tuner use Gaussian Process Regressor(GPR). GPR make predictions based on a kernel function and the 'distance' between different points, it's hard to get the true distance between no-numerical values.
*
Note that for nested search space:
* Only Random Search/TPE/Anneal/Evolution tuner supports nested search space
NNI development environment supports Ubuntu 1604 (or above), and Windows 10 with Python3 64bit.
Installation
------------
The installation steps are similar with installing from source code. But the installation links to code directory, so that code changes can be applied to installation as easy as possible.
1. Clone source code
^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
git clone https://github.com/Microsoft/nni.git
Note, if you want to contribute code back, it needs to fork your own NNI repo, and clone from there.
Nothing to do, the code is already linked to package folders.
TypeScript
^^^^^^^^^^
* If ``src/nni_manager`` is changed, run ``yarn watch`` under this folder. It will watch and build code continually. The ``nnictl`` need to be restarted to reload NNI manager.
* If ``src/webui`` is changed, run ``yarn dev``\ , which will run a mock API server and a webpack dev server simultaneously. Use ``EXPERIMENT`` environment variable (e.g., ``mnist-tfv1-running``\ ) to specify the mock data being used. Built-in mock experiments are listed in ``src/webui/mock``. An example of the full command is ``EXPERIMENT=mnist-tfv1-running yarn dev``.
* If ``src/nasui`` is changed, run ``yarn start`` under the corresponding folder. The web UI will refresh automatically if code is changed. There is also a mock API server that is useful when developing. It can be launched via ``node server.js``.
5. Submit Pull Request
^^^^^^^^^^^^^^^^^^^^^^
All changes are merged to master branch from your forked repo. The description of Pull Request must be meaningful, and useful.
We will review the changes as soon as possible. Once it passes review, we will merge it to master branch.
For more contribution guidelines and coding styles, you can refer to the `contributing document <Contributing.rst>`__.
* On the overview tab, you can see the experiment information and status and the performance of top trials. If you want to see config and search space, please click the right button "Config" and "Search space".
.. image:: ../../img/webui-img/full-oview.png
:target: ../../img/webui-img/full-oview.png
:alt:
* If your experiment has many trials, you can change the refresh interval here.
The trial may have many intermediate results in the training process. In order to see the trend of some trials more clearly, we set a filtering function for the intermediate result graph.
You may find that these trials will get better or worse at an intermediate result. This indicates that it is an important and relevant intermediate result. To take a closer look at the point here, you need to enter its corresponding X-value at #Intermediate. Then input the range of metrics on this intermedia result. In the picture below, we choose the No. 4 intermediate result and set the range of metrics to 0.8-1.
Click the tab "Trials Detail" to see the status of all trials. Specifically:
* Trial detail: trial's id, trial's duration, start time, end time, status, accuracy, and search space file.
.. image:: ../../img/webui-img/detail-local.png
:target: ../../img/webui-img/detail-local.png
:alt:
* The button named "Add column" can select which column to show on the table. If you run an experiment whose final result is a dict, you can see other keys in the table. You can choose the column "Intermediate count" to watch the trial's progress.
.. image:: ../../img/webui-img/addColumn.png
:target: ../../img/webui-img/addColumn.png
:alt:
* If you want to compare some trials, you can select them and then click "Compare" to see the results.
.. image:: ../../img/webui-img/select-trial.png
:target: ../../img/webui-img/select-trial.png
:alt:
.. image:: ../../img/webui-img/compare.png
:target: ../../img/webui-img/compare.png
:alt:
* Support to search for a specific trial by it's id, status, Trial No. and parameters.
.. image:: ../../img/webui-img/search-trial.png
:target: ../../img/webui-img/search-trial.png
:alt:
* You can use the button named "Copy as python" to copy the trial's parameters.
.. image:: ../../img/webui-img/copyParameter.png
:target: ../../img/webui-img/copyParameter.png
:alt:
* If you run on the OpenPAI or Kubeflow platform, you can also see the nfs log.
.. image:: ../../img/webui-img/detail-pai.png
:target: ../../img/webui-img/detail-pai.png
:alt:
* Intermediate Result Graph: you can see the default metric in this graph by clicking the intermediate button.
.. image:: ../../img/webui-img/intermediate.png
:target: ../../img/webui-img/intermediate.png
:alt:
* Kill: you can kill a job that status is running.