NNI supports running an experiment on `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
Setup Environment
-----------------
Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.
.. image:: ../../img/dlts-step1.png
:target: ../../img/dlts-step1.png
:alt: Choose Cluster
Step 2. Prepare a NNI config YAML like the following:
.. code-block:: yaml
# Set this field to "dlts"
trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
trialConcurrency: 2
maxExecDuration: 3h
maxTrialNum: 100
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
# Configuration to access DLTS
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
Remember to fill the cluster dashboard URL to the last line.
Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.
.. image:: ../../img/dlts-step3.png
:target: ../../img/dlts-step3.png
:alt: Copy Config
Step 4. Submit a NNI manager job to the specified cluster.
.. image:: ../../img/dlts-step4.png
:target: ../../img/dlts-step4.png
:alt: Submit Job
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to check trial's information.
1.3 Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor. Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
**NOTE**\ :
.. code-block:: bash
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
tuner - The tuner will generate next parameters/architecture based on the explore history (final result of all trials).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
..
Step 2 - Define SearchSpace
The hyper-parameters used in ``Step 1.2 - Get predefined parameters`` is defined in a ``search_space.json`` file like below:
Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search space.
..
Step 3 - Define Experiment
..
To run an experiment in NNI, you only needed:
* Provide a runnable trial
* Provide or choose a tuner
* Provide a YAML experiment configure file
* (optional) Provide or choose an assessor
**Prepare trial**\ :
..
You can download nni source code and a set of examples can be found in ``nni/examples``, run ``ls nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you cloned NNI source, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
This command will be filled in the YAML configure file below. Please refer to `here <../TrialExample/Trials.rst>`__ for how to write your own trial.
**Prepare tuner**\ : NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to `here <../Tuner/CustomizeTuner.rst>`__\ ), but for simplicity, here we choose a tuner provided by NNI as below:
.. code-block:: bash
tuner:
name: TPE
classArgs:
optimize_mode: maximize
*name* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found `here <../Tuner/BuiltinTuner.rst>`__\ ), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**\ : Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, ``cat ~/nni/examples/trials/mnist-pytorch/config.yml`` to see it. Its content is basically shown below:
You can refer to `here <../Tutorial/Nnictl.rst>`__ for more usage guide of *nnictl* command line tool.
View experiment results
-----------------------
The experiment has been running now. Other than *nnictl*\ , NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
Using multiple local GPUs to speed up search
--------------------------------------------
The following steps assume that you have 4 NVIDIA GPUs installed at local and PyTorch with CUDA support. The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**\ : NNI provides a demo configuration file for the setting above, ``cat ~/nni/examples/trials/mnist-pytorch/config_detailed.yml`` to see it. The trailConcurrency and trialGpuNumber are different from the basic configure file:
.. code-block:: bash
...
trialGpuNumber: 1
trialConcurrency: 4
...
trainingService:
platform: local
useActiveGpu: false # set to "true" if you are using graphical OS like Windows 10 and Ubuntu desktop
We can run the experiment with the following command:
You can use *nnictl* command line tool or WebUI to trace the training progress. *nvidia_smi* command line tool can also help you to monitor the GPU usage during training.
NNI 支持独立模式,使 Trial 代码无需启动 NNI 实验即可运行。 这样能更容易的找出 Trial 代码中的 Bug。 NNI Annotation 天然支持独立模式,因为添加的 NNI 相关的行都是注释的形式。 NNI Trial API 在独立模式下的行为有所变化,某些 API 返回虚拟值,而某些 API 不报告值。 有关这些 API 的完整列表,请参阅下表。
This simple annealing algorithm begins by sampling from the prior but tends over time to sample from points closer and closer to the best ones observed. This algorithm is a simple variation on random search that leverages smoothness in the response surface. The annealing rate is not adaptive.
Usage
-----
classArgs Requirements
^^^^^^^^^^^^^^^^^^^^^^
* **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will try to maximize metrics. If 'minimize', the tuner will try to minimize metrics.
Batch tuner allows users to simply provide several configurations (i.e., choices of hyper-parameters) for their trial code. After finishing all the configurations, the experiment is done. Batch tuner only supports the type ``choice`` in the `search space spec <../Tutorial/SearchSpaceSpec.rst>`__.
Suggested scenario: If the configurations you want to try have been decided, you can list them in the SearchSpace file (using ``choice``) and run them using the batch tuner.
Usage
-----
Example Configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
# config.yml
tuner:
name: BatchTuner
Note that the search space for BatchTuner should look like:
The search space file should include the high-level key ``combine_params``. The type of params in the search space must be ``choice`` and the ``values`` must include all the combined params values.
In `Random Search for Hyper-Parameter Optimization <http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf>`__ we show that Random Search might be surprisingly effective despite its simplicity.
We suggest using Random Search as a baseline when no knowledge about the prior distribution of hyper-parameters is available.
The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach.
SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements,
and then subsequently choose new hyperparameters to test based on this model.
The TPE approach models P(x|y) and P(y) where x represents hyperparameters and y the associated evaluation matric.
P(x|y) is modeled by transforming the generative process of hyperparameters,
replacing the distributions of the configuration prior with non-parametric densities.
This optimization approach is described in detail in `Algorithms for Hyper-Parameter Optimization <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__.
Parallel TPE optimization
^^^^^^^^^^^^^^^^^^^^^^^^^
TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete.
The original algorithm design was optimized for sequential computation.
If we were to use TPE with much concurrency, its performance will be bad.
We have optimized this case using the Constant Liar algorithm.
For these principles of optimization, please refer to our `research blog <../CommunitySharings/ParallelizingTpeSearch.rst>`__.
Usage
-----
To use TPE, you should add the following spec in your experiment's YAML config file:
.. code-block:: yaml
## minimal config ##
tuner:
name: TPE
classArgs:
optimize_mode: minimize
.. code-block:: yaml
## advanced config ##
tuner:
name: TPE
classArgs:
optimize_mode: maximize
seed: 12345
tpe_args:
constant_liar_type: 'mean'
n_startup_jobs: 10
n_ei_candidates: 20
linear_forgetting: 100
prior_weight: 0
gamma: 0.5
classArgs
^^^^^^^^^
.. list-table::
:widths: 10 20 10 60
:header-rows: 1
* - Field
- Type
- Default
- Description
* - ``optimize_mode``
- ``'minimize' | 'maximize'``
- ``'minimize'``
- Whether to minimize or maximize trial metrics.
* - ``seed``
- ``int | null``
- ``null``
- The random seed.
* - ``tpe_args.constant_liar_type``
- ``'best' | 'worst' | 'mean' | null``
- ``'best'``
- TPE algorithm itself does not support parallel tuning. This parameter specifies how to optimize for trial_concurrency > 1. How each liar works is explained in paper's section 6.1.
In general ``best`` suit for small trial number and ``worst`` suit for large trial number.
* - ``tpe_args.n_startup_jobs``
- ``int``
- ``20``
- The first N hyper-parameters are generated fully randomly for warming up.
If the search space is large, you can increase this value. Or if max_trial_number is small, you may want to decrease it.
* - ``tpe_args.n_ei_candidates``
- ``int``
- ``24``
- For each iteration TPE samples EI for N sets of parameters and choose the best one. (loosely speaking)
* - ``tpe_args.linear_forgetting``
- ``int``
- ``25``
- TPE will lower the weights of old trials. This controls how many iterations it takes for a trial to start decay.
* - ``tpe_args.prior_weight``
- ``float``
- ``1.0``
- TPE treats user provided search space as prior.
When generating new trials, it also incorporates the prior in trial history by transforming the search space to
one trial configuration (i.e., each parameter of this configuration chooses the mean of its candidate range).
Here, prior_weight determines the weight of this trial configuration in the history trial configurations.
With prior weight 1.0, the search space is treated as one good trial.
For example, "normal(0, 1)" effectly equals to a trial with x = 0 which has yielded good result.
* - ``tpe_args.gamma``
- ``float``
- ``0.25``
- Controls how many trials are considered "good".
The number is calculated as "min(gamma * sqrt(N), linear_forgetting)".
This is the previous version (V1) of experiment configuration specification. It is still supported for now, but we recommend users to use `the new version of experiment configuration (V2) <../reference/experiment_config.rst>`_.
A config file is needed when creating an experiment. The path of the config file is provided to ``nnictl``.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
* `Kubeflow with azure storage <#kubeflow-with-azure-storage>`__
Template
--------
* **Light weight (without Annotation and Assessor)**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
searchSpacePath:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
* **Use Assessor**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
searchSpacePath:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
assessor:
#choice: Medianstop
builtinAssessorName:
classArgs:
#choice: maximize, minimize
optimize_mode:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
* **Use Annotation**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
assessor:
#choice: Medianstop
builtinAssessorName:
classArgs:
#choice: maximize, minimize
optimize_mode:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
Configuration Spec
------------------
authorName
^^^^^^^^^^
Required. String.
The name of the author who create the experiment.
*TBD: add default value.*
experimentName
^^^^^^^^^^^^^^
Required. String.
The name of the experiment created.
*TBD: add default value.*
trialConcurrency
^^^^^^^^^^^^^^^^
Required. Integer between 1 and 99999.
Specifies the max num of trial jobs run simultaneously.
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach **trialConcurrency** number, some trial jobs will be put into a queue to wait for gpu allocation.
maxExecDuration
^^^^^^^^^^^^^^^
Optional. String. Default: 999d.
**maxExecDuration** specifies the max duration time of an experiment. The unit of the time is {**s**\ , **m**\ , **h**\ , **d**\ }, which means {*seconds*\ , *minutes*\ , *hours*\ , *days*\ }.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
versionCheck
^^^^^^^^^^^^
Optional. Bool. Default: true.
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
debug
^^^^^
Optional. Bool. Default: false.
Debug mode will set versionCheck to false and set logLevel to be 'debug'.
maxTrialNum
^^^^^^^^^^^
Optional. Integer between 1 and 99999. Default: 99999.
Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
maxTrialDuration
^^^^^^^^^^^^^^^^
Optional. String. Default: 999d.
**maxTrialDuration** specifies the max duration time of each trial job. The unit of the time is {**s**\ , **m**\ , **h**\ , **d**\ }, which means {*seconds*\ , *minutes*\ , *hours*\ , *days*\ }. If current trial job reach the max duration time, this trial job will stop.
trainingServicePlatform
^^^^^^^^^^^^^^^^^^^^^^^
Required. String.
Specifies the platform to run the experiment, including **local**\ , **remote**\ , **pai**\ , **kubeflow**\ , **frameworkcontroller**.
*
**local** run an experiment on local ubuntu machine.
*
**remote** submit trial jobs to remote ubuntu machines, and **machineList** field should be filed in order to set up SSH connection to remote machine.
*
**pai** submit trial jobs to `OpenPAI <https://github.com/Microsoft/pai>`__ of Microsoft. For more details of pai configuration, please refer to `Guide to PAI Mode <../TrainingService/PaiMode.rst>`__
*
**kubeflow** submit trial jobs to `kubeflow <https://www.kubeflow.org/docs/about/kubeflow/>`__\ , NNI support kubeflow based on normal kubernetes and `azure kubernetes <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__. For detail please refer to `Kubeflow Docs <../TrainingService/KubeflowMode.rst>`__
*
**adl** submit trial jobs to `AdaptDL <https://www.kubeflow.org/docs/about/kubeflow/>`__\ , NNI support AdaptDL on Kubernetes cluster. For detail please refer to `AdaptDL Docs <../TrainingService/AdaptDLMode.rst>`__
*
TODO: explain frameworkcontroller.
searchSpacePath
^^^^^^^^^^^^^^^
Optional. Path to existing file.
Specifies the path of search space file, which should be a valid path in the local linux machine.
The only exception that **searchSpacePath** can be not fulfilled is when ``useAnnotation=True``.
useAnnotation
^^^^^^^^^^^^^
Optional. Bool. Default: false.
Use annotation to analysis trial code and generate search space.
Note: if **useAnnotation** is true, the searchSpacePath field should be removed.
multiThread
^^^^^^^^^^^
Optional. Bool. Default: false.
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
nniManagerIp
^^^^^^^^^^^^
Optional. String. Default: eth0 device IP.
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
Note: run ``ifconfig`` on NNI manager's machine to check if eth0 device exists. If not, **nniManagerIp** is recommended to set explicitly.
logDir
^^^^^^
Optional. Path to a directory. Default: ``<user home directory>/nni-experiments``.
Configures the directory to store logs and data of the experiment.
logLevel
^^^^^^^^
Optional. String. Default: ``info``.
Sets log level for the experiment. Available log levels are: ``trace``\ , ``debug``\ , ``info``\ , ``warning``\ , ``error``\ , ``fatal``.
logCollection
^^^^^^^^^^^^^
Optional. ``http`` or ``none``. Default: ``none``.
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from ``http``\ , trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is ``none``\ , trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be ``none``.
tuner
^^^^^
Required.
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set **builtinTunerName** and **classArgs**. Another way is to use users' own tuner file, in which case **codeDirectory**\ , **classFileName**\ , **className** and **classArgs** are needed. *Users must choose exactly one way.*
builtinTunerName
^^^^^^^^^^^^^^^^
Required if using built-in tuners. String.
Specifies the name of system tuner, NNI sdk provides different tuners introduced `here <../Tuner/BuiltinTuner.rst>`__.
codeDir
^^^^^^^
Required if using customized tuners. Path relative to the location of config file.
Specifies the directory of tuner code.
classFileName
^^^^^^^^^^^^^
Required if using customized tuners. File path relative to **codeDir**.
Specifies the name of tuner file.
className
^^^^^^^^^
Required if using customized tuners. String.
Specifies the name of tuner class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of tuner algorithm. Please refer to `this file <../Tuner/BuiltinTuner.rst>`__ for the configurable arguments of each built-in tuner.
gpuIndices
^^^^^^^^^^
Optional. String. Default: empty.
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma ``,``. For example, ``1``\ , or ``0,1,3``. If the field is not set, no GPU will be visible to tuner (by setting ``CUDA_VISIBLE_DEVICES`` to be an empty string).
includeIntermediateResults
^^^^^^^^^^^^^^^^^^^^^^^^^^
Optional. Bool. Default: false.
If **includeIntermediateResults** is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
assessor
^^^^^^^^
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set **builtinAssessorName** and **classArgs**. Another way is to use users' own assessor file, and users need to set **codeDirectory**\ , **classFileName**\ , **className** and **classArgs**. *Users must choose exactly one way.*
By default, there is no assessor enabled.
builtinAssessorName
^^^^^^^^^^^^^^^^^^^
Required if using built-in assessors. String.
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced `here <../Assessor/BuiltinAssessor.rst>`__.
codeDir
^^^^^^^
Required if using customized assessors. Path relative to the location of config file.
Specifies the directory of assessor code.
classFileName
^^^^^^^^^^^^^
Required if using customized assessors. File path relative to **codeDir**.
Specifies the name of assessor file.
className
^^^^^^^^^
Required if using customized assessors. String.
Specifies the name of assessor class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of assessor algorithm.
advisor
^^^^^^^
Optional.
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set **builtinAdvisorName** and **classArgs**. Another way is to use users' own advisor file, and need to set **codeDirectory**\ , **classFileName**\ , **className** and **classArgs**.
When advisor is enabled, settings of tuners and advisors will be bypassed.
builtinAdvisorName
^^^^^^^^^^^^^^^^^^
Specifies the name of a built-in advisor. NNI sdk provides `BOHB <../Tuner/BohbAdvisor.rst>`__ and `Hyperband <../Tuner/HyperbandAdvisor.rst>`__.
codeDir
^^^^^^^
Required if using customized advisors. Path relative to the location of config file.
Specifies the directory of advisor code.
classFileName
^^^^^^^^^^^^^
Required if using customized advisors. File path relative to **codeDir**.
Specifies the name of advisor file.
className
^^^^^^^^^
Required if using customized advisors. String.
Specifies the name of advisor class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of advisor.
gpuIndices
^^^^^^^^^^
Optional. String. Default: empty.
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma ``,``. For example, ``1``\ , or ``0,1,3``. If the field is not set, no GPU will be visible to tuner (by setting ``CUDA_VISIBLE_DEVICES`` to be an empty string).
trial
^^^^^
Required. Key-value pairs.
In local and remote mode, the following keys are required.
*
**command**\ : Required string. Specifies the command to run trial process.
*
**codeDir**\ : Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
*
**gpuNum**\ : Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
In PAI mode, the following keys are required.
*
**command**\ : Required string. Specifies the command to run trial process.
*
**codeDir**\ : Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
*
**gpuNum**\ : Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
*
**cpuNum**\ : Required integer. Specifies the cpu number of cpu to be used in pai container.
*
**memoryMB**\ : Required integer. Set the memory size to be used in pai container, in megabytes.
*
**image**\ : Required string. Set the image to be used in pai.
*
**authFile**\ : Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. `Reference <https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job>`__.
*
**shmMB**\ : Optional integer. Shared memory size of container.
*
**portList**\ : List of key-values pairs with ``label``\ , ``beginAt``\ , ``portNumber``. See `job tutorial of PAI <https://github.com/microsoft/pai/blob/master/docs/job_tutorial.rst>`__ for details.
.. cannot find `Reference <https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job>`__ and `job tutorial of PAI <https://github.com/microsoft/pai/blob/master/docs/job_tutorial.rst>`__
In Kubeflow mode, the following keys are required.
*
**codeDir**\ : The local directory where the code files are in.
*
**ps**\ : An optional configuration for kubeflow's tensorflow-operator, which includes
*
**replicas**\ : The replica number of **ps** role.
*
**command**\ : The run script in **ps**\ 's container.
*
**gpuNum**\ : The gpu number to be used in **ps** container.
*
**cpuNum**\ : The cpu number to be used in **ps** container.
*
**memoryMB**\ : The memory size of the container.
*
**image**\ : The image to be used in **ps**.
*
**worker**\ : An optional configuration for kubeflow's tensorflow-operator.
*
**replicas**\ : The replica number of **worker** role.
*
**command**\ : The run script in **worker**\ 's container.
*
**gpuNum**\ : The gpu number to be used in **worker** container.
*
**cpuNum**\ : The cpu number to be used in **worker** container.
*
**memoryMB**\ : The memory size of the container.
*
**image**\ : The image to be used in **worker**.
localConfig
^^^^^^^^^^^
Optional in local mode. Key-value pairs.
Only applicable if **trainingServicePlatform** is set to ``local``\ , otherwise there should not be **localConfig** section in configuration file.
gpuIndices
^^^^^^^^^^
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (\ ``,``\ ), such as ``1`` or ``0,1,3``. By default, all GPUs available will be used.
maxTrialNumPerGpu
^^^^^^^^^^^^^^^^^
Optional. Integer. Default: 1.
Used to specify the max concurrency trial number on a GPU device.
useActiveGpu
^^^^^^^^^^^^
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If **useActiveGpu** is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
machineList
^^^^^^^^^^^
Required in remote mode. A list of key-value pairs with the following keys.
ip
^^
Required. IP address or host name that is accessible from the current machine.
The IP address or host name of remote machine.
port
^^^^
Optional. Integer. Valid port. Default: 22.
The ssh port to be used to connect machine.
username
^^^^^^^^
Required if authentication with username/password. String.
The account of remote machine.
passwd
^^^^^^
Required if authentication with username/password. String.
Specifies the password of the account.
sshKeyPath
^^^^^^^^^^
Required if authentication with ssh key. Path to private key file.
If users use ssh key to login remote machine, **sshKeyPath** should be a valid path to a ssh key file.
*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*
passphrase
^^^^^^^^^^
Optional. String.
Used to protect ssh key, which could be empty if users don't have passphrase.
gpuIndices
^^^^^^^^^^
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (\ ``,``\ ), such as ``1`` or ``0,1,3``. By default, all GPUs available will be used.
maxTrialNumPerGpu
^^^^^^^^^^^^^^^^^
Optional. Integer. Default: 1.
Used to specify the max concurrency trial number on a GPU device.
useActiveGpu
^^^^^^^^^^^^
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If **useActiveGpu** is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
pythonPath
^^^^^^^^^^
Optional. String.
Users can configure the python path environment on remote machine by setting **pythonPath**.
remoteConfig
^^^^^^^^^^^^
Optional field in remote mode. Users could set per machine information in ``machineList`` field, and set global configuration for remote mode in this field.
reuse
^^^^^
Optional. Bool. default: ``false``. It's an experimental feature.
If it's true, NNI will reuse remote jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
kubeflowConfig
^^^^^^^^^^^^^^
operator
^^^^^^^^
Required. String. Has to be ``tf-operator`` or ``pytorch-operator``.
Specifies the kubeflow's operator to be used, NNI support ``tf-operator`` in current version.
storage
^^^^^^^
Optional. String. Default. ``nfs``.
Specifies the storage type of kubeflow, including ``nfs`` and ``azureStorage``.
nfs
^^^
Required if using nfs. Key-value pairs.
*
**server** is the host of nfs server.
*
**path** is the mounted path of nfs.
keyVault
^^^^^^^^
Required if using azure storage. Key-value pairs.
Set **keyVault** to storage the private key of your azure storage account. Refer to `the doc <https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2>`__ .
*
**vaultName** is the value of ``--vault-name`` used in az command.
*
**name** is the value of ``--name`` used in az command.
azureStorage
^^^^^^^^^^^^
Required if using azure storage. Key-value pairs.
Set azure storage account to store code files.
*
**accountName** is the name of azure storage account.
*
**azureShare** is the share of the azure file storage.
uploadRetryCount
^^^^^^^^^^^^^^^^
Required if using azure storage. Integer between 1 and 99999.
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
paiConfig
^^^^^^^^^
userName
^^^^^^^^
Required. String.
The user name of your pai account.
password
^^^^^^^^
Required if using password authentication. String.
The password of the pai account.
token
^^^^^
Required if using token authentication. String.
Personal access token that can be retrieved from PAI portal.
host
^^^^
Required. String.
The hostname of IP address of PAI.
reuse
^^^^^
Optional. Bool. default: ``false``. It's an experimental feature.
If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
sharedStorage
^^^^^^^^^^^^^
storageType
^^^^^^^^^^^
Required. String.
The type of the storage, support ``NFS`` and ``AzureBlob``.
localMountPoint
^^^^^^^^^^^^^^^
Required. String.
The absolute or relative path that the storage has been or will be mounted in local. If the path does not exist, it will be created automatically. Recommended to use an absolute path. i.e. ``/tmp/nni-shared-storage``.
remoteMountPoint
^^^^^^^^^^^^^^^^
Required. String.
The absolute or relative path that the storage will be mounted in remote. If the path does not exist, it will be created automatically. Note that the directory must be empty if using AzureBlob. Recommended to use a relative path. i.e. ``./nni-shared-storage``.
localMounted
^^^^^^^^^^^^
Required. String.
One of ``usermount``, ``nnimount`` or ``nomount``. ``usermount`` means you have already mount this storage on localMountPoint. ``nnimount`` means nni will try to mount this storage on localMountPoint. ``nomount`` means storage will not mount in local machine, will support partial storages in the future.
nfsServer
^^^^^^^^^
Optional. String.
Required if using NFS storage. The NFS server host.
exportedDirectory
^^^^^^^^^^^^^^^^^
Optional. String.
Required if using NFS storage. The exported directory of NFS server.
storageAccountName
^^^^^^^^^^^^^^^^^^
Optional. String.
Required if using AzureBlob storage. The azure storage account name.
storageAccountKey
^^^^^^^^^^^^^^^^^
Optional. String.
Required if using AzureBlob storage. The azure storage account key.
containerName
^^^^^^^^^^^^^
Optional. String.
Required if using AzureBlob storage. The AzureBlob container name.
Examples
--------
Local mode
^^^^^^^^^^
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
You can add assessor configuration.
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
#choice: Medianstop
builtinAssessorName: Medianstop
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
Or you could specify your own tuner and assessor file as following,
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
codeDir: /nni/tuner
classFileName: mytuner.py
className: MyTuner
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
codeDir: /nni/assessor
classFileName: myassessor.py
className: MyAssessor
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
Remote mode
^^^^^^^^^^^
If run trial jobs in remote machine, users could specify the remote machine information as following format:
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: remote
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: 10.10.10.10
port: 22
username: test
passwd: test
- ip: 10.10.10.11
port: 22
username: test
passwd: test
- ip: 10.10.10.12
port: 22
username: test
sshKeyPath: /nni/sshkey
passphrase: qwert
# Below is an example of specifying python environment.
NNIisatoolkittohelpusersrunautomatedmachinelearningexperiments.Itcanautomaticallydothecyclicprocessofgettinghyperparameters,runningtrials,testingresults,andtuninghyperparameters.Here,we'll show how to use NNI to help you find the optimal hyperparameters on the MNIST dataset.
Here is an example script to train a CNN on the MNIST dataset **without NNI**:
The above code can only try one set of parameters at a time. If you want to tune the learning rate, you need to manually modify the hyperparameter and start the trial again and again.
NNI is born to help users tune jobs, whose working process is presented below:
.. code-block:: text
input: search space, trial code, config file
output: one optimal hyperparameter configuration
1: For t = 0, 1, 2, ..., maxTrialNum,
2: hyperparameter = chose a set of parameter from search space
3: final result = run_trial_and_evaluate(hyperparameter)
4: report final result to NNI
5: If reach the upper limit time,
6: Stop the experiment
7: return hyperparameter value with best final result
.. note::
If you want to use NNI to automatically train your model and find the optimal hyper-parameters, there are two approaches:
1. Write a config file and start the experiment from the command line.
2. Config and launch the experiment directly from a Python file
In the this part, we will focus on the first approach. For the second approach, please refer to `this tutorial <HowToLaunchFromPython.rst>`__\ .
Step 1: Modify the ``Trial`` Code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Modify your ``Trial`` file to get the hyperparameter set from NNI and report the final results to NNI.
Define a ``Search Space`` in a YAML file, including the ``name`` and the ``distribution`` (discrete-valued or continuous-valued) of all the hyperparameters you want to search.
You can also write your search space in a JSON file and specify the file path in the configuration. For detailed tutorial on how to write the search space, please see `here <SearchSpaceSpec.rst>`__.
Step 3: Config the Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In addition to the search_space defined in the `step2 <step-2-define-the-search-space>`__, you need to config the experiment in the YAML file. It specifies the key information of the experiment, such as the trial files, tuning algorithm, max trial number, and max duration, etc.
.. code-block:: yaml
experimentName: MNIST # An optional name to distinguish the experiments
trialCommand: python3 mnist.py # NOTE: change "python3" to "python" if you are using Windows
trialConcurrency: 2 # Run 2 trials concurrently
maxTrialNumber: 10 # Generate at most 10 trials
maxExperimentDuration: 1h # Stop generating trials after 1 hour
tuner: # Configure the tuning algorithm
name: TPE
classArgs: # Algorithm specific arguments
optimize_mode: maximize
trainingService: # Configure the training platform
platform: local
Experiment config reference could be found `here <../reference/experiment_config.rst>`__.
.. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your :doc:`training service <../TrainingService/Overview>`, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
All the code above is already prepared and stored in :githublink:`examples/trials/mnist-pytorch/<examples/trials/mnist-pytorch>`.
Step 4: Launch the Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Linux and macOS
***************
Run the **config_detailed.yml** file from your command line to start the experiment.
Change ``python3`` to ``python`` of the ``trialCommand`` field in the **config_detailed.yml** file, and run the **config_detailed.yml** file from your command line to start the experiment.
.. Note:: ``nnictl`` is a command line tool that can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc. Click :doc:`here <../reference/nnictl>` for more usage of ``nnictl``.
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. And this is what we expect to get:
If you prepared ``trial``\ , ``search space``\ , and ``config`` according to the above steps and successfully created an NNI job, NNI will automatically tune the optimal hyper-parameters and run different hyper-parameter sets for each trial according to the defined search space. You can see its progress through the WebUI clearly.
Step 5: View the Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^
After starting the experiment successfully, you can find a message in the command-line interface that tells you the ``Web UI url`` like this:
.. code-block:: text
The Web UI urls are: [Your IP]:8080
Open the ``Web UI url`` (Here it's:``[YourIP]:8080``\)inyourbrowser,youcanviewdetailedinformationabouttheexperimentandallthesubmittedtrialjobsasshownbelow.IfyoucannotopentheWebUIlinkinyourterminal,pleaserefertothe`FAQ<FAQ.rst#could-not-open-webui-link>`__.
In order to save on computing resources, NNI supports an early stopping policy and has an interface called **Assessor** to do this job.
Assessor receives the intermediate result from a trial and decides whether the trial should be killed using a specific algorithm. Once the trial experiment meets the early stopping conditions (which means Assessor is pessimistic about the final results), the assessor will kill the trial and the status of the trial will be `EARLY_STOPPED`.
Here is an experimental result of MNIST after using the 'Curvefitting' Assessor in 'maximize' mode. You can see that Assessor successfully **early stopped** many trials with bad hyperparameters in advance. If you use Assessor, you may get better hyperparameters using the same computing resources.
NNI provides an easy way to adopt an approach to set up parameter tuning algorithms, we call them **Tuner**.
Tuner receives metrics from `Trial` to evaluate the performance of a specific parameters/architecture configuration. Tuner sends the next hyper-parameter or architecture configuration to Trial.
The following table briefly describes the built-in tuners provided by NNI. Click the **Tuner's name** to get the Tuner's installation requirements, suggested scenario, and an example configuration. A link for a detailed description of each algorithm is located at the end of the suggested scenario for each tuner. Here is an `article <../CommunitySharings/HpoComparison.rst>`__ comparing different Tuners on several problems.
.. list-table::
:header-rows: 1
:widths: auto
* - Tuner
- Brief Introduction of Algorithm
* - `TPE <./TpeTuner.rst>`__
- The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach. SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements, and then subsequently choose new hyperparameters to test based on this model. `Reference Paper <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
* - `Random Search <./RandomTuner.rst>`__
- In Random Search for Hyper-Parameter Optimization show that Random Search might be surprisingly simple and effective. We suggest that we could use Random Search as the baseline when we have no knowledge about the prior distribution of hyper-parameters. `Reference Paper <http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf>`__
* - `Anneal <./AnnealTuner.rst>`__
- This simple annealing algorithm begins by sampling from the prior, but tends over time to sample from points closer and closer to the best ones observed. This algorithm is a simple variation on the random search that leverages smoothness in the response surface. The annealing rate is not adaptive.
* - `Naïve Evolution <./EvolutionTuner.rst>`__
- Naïve Evolution comes from Large-Scale Evolution of Image Classifiers. It randomly initializes a population-based on search space. For each generation, it chooses better ones and does some mutation (e.g., change a hyperparameter, add/remove one layer) on them to get the next generation. Naïve Evolution requires many trials to work, but it's very simple and easy to expand new features. `Reference paper <https://arxiv.org/pdf/1703.01041.pdf>`__
* - `SMAC <./SmacTuner.rst>`__
- SMAC is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO, in order to handle categorical parameters. The SMAC supported by NNI is a wrapper on the SMAC3 GitHub repo.
Notice, SMAC needs to be installed by ``pip install nni[SMAC]`` command. `Reference Paper, <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__ `GitHub Repo <https://github.com/automl/SMAC3>`__
* - `Batch tuner <./BatchTuner.rst>`__
- Batch tuner allows users to simply provide several configurations (i.e., choices of hyper-parameters) for their trial code. After finishing all the configurations, the experiment is done. Batch tuner only supports the type choice in search space spec.
* - `Grid Search <./GridsearchTuner.rst>`__
- Grid Search performs an exhaustive searching through the search space.
* - `Hyperband <./HyperbandAdvisor.rst>`__
- Hyperband tries to use limited resources to explore as many configurations as possible and returns the most promising ones as a final result. The basic idea is to generate many configurations and run them for a small number of trials. The half least-promising configurations are thrown out, the remaining are further trained along with a selection of new configurations. The size of these populations is sensitive to resource constraints (e.g. allotted search time). `Reference Paper <https://arxiv.org/pdf/1603.06560.pdf>`__
* - `Metis Tuner <./MetisTuner.rst>`__
- Metis offers the following benefits when it comes to tuning parameters: While most tools only predict the optimal configuration, Metis gives you two outputs: (a) current prediction of optimal configuration, and (b) suggestion for the next trial. No more guesswork. While most tools assume training datasets do not have noisy data, Metis actually tells you if you need to re-sample a particular hyper-parameter. `Reference Paper <https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/>`__
* - `BOHB <./BohbAdvisor.rst>`__
- BOHB is a follow-up work to Hyperband. It targets the weakness of Hyperband that new configurations are generated randomly without leveraging finished trials. For the name BOHB, HB means Hyperband, BO means Bayesian Optimization. BOHB leverages finished trials by building multiple TPE models, a proportion of new configurations are generated through these models. `Reference Paper <https://arxiv.org/abs/1807.01774>`__
* - `GP Tuner <./GPTuner.rst>`__
- Gaussian Process Tuner is a sequential model-based optimization (SMBO) approach with Gaussian Process as the surrogate. `Reference Paper <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__, `Github Repo <https://github.com/fmfn/BayesianOptimization>`__
* - `PBT Tuner <./PBTTuner.rst>`__
- PBT Tuner is a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. `Reference Paper <https://arxiv.org/abs/1711.09846v1>`__
* - `DNGO Tuner <./DngoTuner.rst>`__
- Use of neural networks as an alternative to GPs to model distributions over functions in bayesian optimization.