Unverified Commit 843b642f authored by Yuge Zhang's avatar Yuge Zhang Committed by GitHub
Browse files

Docs improvement: configurations and more (#1823)

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* update

* update
parent 2c17da7d
......@@ -16,9 +16,9 @@ Install NNI on each of your machines following the install guide [here](../Tutor
## Run an experiment
Install NNI on another machine which has network accessibility to those three machines above, or you can just use any machine above to run nnictl command line tool.
Install NNI on another machine which has network accessibility to those three machines above, or you can just run `nnictl` on any one of the three to launch the experiment.
We use `examples/trials/mnist-annotation` as an example here. `cat ~/nni/examples/trials/mnist-annotation/config_remote.yml` to see the detailed configuration file:
We use `examples/trials/mnist-annotation` as an example here. Shown here is `examples/trials/mnist-annotation/config_remote.yml`:
```yaml
authorName: default
......@@ -57,24 +57,15 @@ machineList:
username: bob
passwd: bob123
```
You can use different systems to run experiments on the remote machine.
#### Linux and MacOS
Simply filling the `machineList` section and then run:
```bash
nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml
```
to start the experiment.
#### Windows
Simply filling the `machineList` section and then run:
Files in `codeDir` will be automatically uploaded to the remote machine. You can run NNI on different operating systems (Windows, Linux, MacOS) to spawn experiments on the remote machines (only Linux allowed):
```bash
nnictl create --config %userprofile%\nni\examples\trials\mnist-annotation\config_remote.yml
nnictl create --config examples/trials/mnist-annotation/config_remote.yml
```
to start the experiment.
You can also use public/private key pairs instead of username/password for authentication. For advanced usages, please refer to [Experiment Config Reference](../Tutorial/ExperimentConfig.md).
## Version check
## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
\ No newline at end of file
NNI support version check feature in since version 0.6, [reference](PaiMode.md).
\ No newline at end of file
# Experiment config reference
# Experiment Config Reference
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
- [Experiment config reference](#experiment-config-reference)
- [Template](#template)
- [Configuration spec](#configuration-spec)
- [Examples](#examples)
- [Experiment Config Reference](#experiment-config-reference)
* [Template](#template)
* [Configuration Spec](#configuration-spec)
+ [authorName](#authorname)
+ [experimentName](#experimentname)
+ [trialConcurrency](#trialconcurrency)
+ [maxExecDuration](#maxexecduration)
+ [versionCheck](#versioncheck)
+ [debug](#debug)
+ [maxTrialNum](#maxtrialnum)
+ [trainingServicePlatform](#trainingserviceplatform)
+ [searchSpacePath](#searchspacepath)
+ [useAnnotation](#useannotation)
+ [multiPhase](#multiphase)
+ [multiThread](#multithread)
+ [nniManagerIp](#nnimanagerip)
+ [logDir](#logdir)
+ [logLevel](#loglevel)
+ [logCollection](#logcollection)
+ [tuner](#tuner)
- [builtinTunerName](#builtintunername)
- [codeDir](#codedir)
- [classFileName](#classfilename)
- [className](#classname)
- [classArgs](#classargs)
- [gpuIndices](#gpuindices)
- [includeIntermediateResults](#includeintermediateresults)
+ [assessor](#assessor)
- [builtinAssessorName](#builtinassessorname)
- [codeDir](#codedir-1)
- [classFileName](#classfilename-1)
- [className](#classname-1)
- [classArgs](#classargs-1)
+ [advisor](#advisor)
- [builtinAdvisorName](#builtinadvisorname)
- [codeDir](#codedir-2)
- [classFileName](#classfilename-2)
- [className](#classname-2)
- [classArgs](#classargs-2)
- [gpuIndices](#gpuindices-1)
+ [trial](#trial)
+ [localConfig](#localconfig)
- [gpuIndices](#gpuindices-2)
- [maxTrialNumPerGpu](#maxtrialnumpergpu)
- [useActiveGpu](#useactivegpu)
+ [machineList](#machinelist)
- [ip](#ip)
- [port](#port)
- [username](#username)
- [passwd](#passwd)
- [sshKeyPath](#sshkeypath)
- [passphrase](#passphrase)
- [gpuIndices](#gpuindices-3)
- [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
- [useActiveGpu](#useactivegpu-1)
+ [kubeflowConfig](#kubeflowconfig)
- [operator](#operator)
- [storage](#storage)
- [nfs](#nfs)
- [keyVault](#keyvault)
- [azureStorage](#azurestorage)
- [uploadRetryCount](#uploadretrycount)
+ [paiConfig](#paiconfig)
- [userName](#username)
- [password](#password)
- [token](#token)
- [host](#host)
* [Examples](#examples)
+ [Local mode](#local-mode)
+ [Remote mode](#remote-mode)
+ [PAI mode](#pai-mode)
+ [Kubeflow mode](#kubeflow-mode)
+ [Kubeflow with azure storage](#kubeflow-with-azure-storage)
## Template
* __light weight(without Annotation and Assessor)__
* __Light weight (without Annotation and Assessor)__
```yaml
authorName:
......@@ -130,434 +199,481 @@ machineList:
passwd:
```
## Configuration spec
## Configuration Spec
* __authorName__
* Description
### authorName
__authorName__ is the name of the author who create the experiment.
Required. String.
TBD: add default value
The name of the author who create the experiment.
* __experimentName__
* Description
*TBD: add default value.*
__experimentName__ is the name of the experiment created.
### experimentName
TBD: add default value
Required. String.
* __trialConcurrency__
* Description
The name of the experiment created.
__trialConcurrency__ specifies the max num of trial jobs run simultaneously.
*TBD: add default value.*
Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.
### trialConcurrency
* __maxExecDuration__
* Description
Required. Integer between 1 and 99999.
__maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Specifies the max num of trial jobs run simultaneously.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
* __versionCheck__
* Description
### maxExecDuration
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
Optional. String. Default: 999d.
* __debug__
* Description
__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Debug mode will set versionCheck be False and set logLevel be 'debug'
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
* __maxTrialNum__
* Description
### versionCheck
__maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
Optional. Bool. Default: false.
* __trainingServicePlatform__
* Description
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
__trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
### debug
* __local__ run an experiment on local ubuntu machine.
Optional. Bool. Default: false.
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
Debug mode will set versionCheck to false and set logLevel to be 'debug'.
* __pai__ submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](../TrainingService/PaiMode.md)
### maxTrialNum
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). Detail please reference [KubeflowDoc](../TrainingService/KubeflowMode.md)
Optional. Integer between 1 and 99999. Default: 99999.
* __searchSpacePath__
* Description
Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
__searchSpacePath__ specifies the path of search space file, which should be a valid path in the local linux machine.
### trainingServicePlatform
Note: if set useAnnotation=True, the searchSpacePath field should be removed.
Required. String.
* __useAnnotation__
* Description
Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
__useAnnotation__ use annotation to analysis trial code and generate search space.
* __local__ run an experiment on local ubuntu machine.
Note: if set useAnnotation=True, the searchSpacePath field should be removed.
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
* __multiPhase__
* Description
* __pai__ submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
__multiPhase__ enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md).
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
* __multiThread__
* Description
* TODO: explain frameworkcontroller.
__multiThread__ enable multi-thread mode for dispatcher, if multiThread is set to `true`, dispatcher will start a thread to process each command from NNI Manager.
### searchSpacePath
* __nniManagerIp__
* Description
Optional. Path to existing file.
__nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
Specifies the path of search space file, which should be a valid path in the local linux machine.
Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly.
The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
* __logDir__
* Description
### useAnnotation
__logDir__ configures the directory to store logs and data of the experiment. The default value is `<user home directory>/nni/experiment`
Optional. Bool. Default: false.
* __logLevel__
* Description
Use annotation to analysis trial code and generate search space.
__logLevel__ sets log level for the experiment, available log levels are: `trace, debug, info, warning, error, fatal`. The default value is `info`.
Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
* __logCollection__
* Description
### multiPhase
__logCollection__ set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
Optional. Bool. Default: false.
* __tuner__
* Description
Enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md).
__tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __builtinTunerName__ and __classArgs__
* __builtinTunerName__
### multiThread
__builtinTunerName__ specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
Optional. Bool. Default: false.
* __classArgs__
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
__classArgs__ specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
* __codeDir__, __classFileName__, __className__ and __classArgs__
* __codeDir__
### nniManagerIp
__codeDir__ specifies the directory of tuner code.
* __classFileName__
Optional. String. Default: eth0 device IP.
__classFileName__ specifies the name of tuner file.
* __className__
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
__className__ specifies the name of tuner class.
* __classArgs__
Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
__classArgs__ specifies the arguments of tuner algorithm.
### logDir
* __gpuIndices__
Optional. Path to a directory. Default: `<user home directory>/nni/experiment`.
__gpuIndices__ specifies the gpus that can be used by the tuner process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner.
Configures the directory to store logs and data of the experiment.
* __includeIntermediateResults__
### logLevel
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of __includeIntermediateResults__ is false.
Optional. String. Default: `info`.
Note: users could only use one way to specify tuner, either specifying `builtinTunerName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`.
Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
* __assessor__
### logCollection
* Description
Optional. `http` or `none`. Default: `none`.
__assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __builtinAssessorName__ and __classArgs__
* __builtinAssessorName__
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
__builtinAssessorName__ specifies the name of built-in assessor, NNI sdk provides different assessors introducted [here](../Assessor/BuiltinAssessor.md).
* __classArgs__
### tuner
__classArgs__ specifies the arguments of assessor algorithm
Required.
* __codeDir__, __classFileName__, __className__ and __classArgs__
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
* __codeDir__
#### builtinTunerName
__codeDir__ specifies the directory of assessor code.
Required if using built-in tuners. String.
* __classFileName__
Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
__classFileName__ specifies the name of assessor file.
#### codeDir
* __className__
Required if using customized tuners. Path relative to the location of config file.
__className__ specifies the name of assessor class.
Specifies the directory of tuner code.
* __classArgs__
#### classFileName
__classArgs__ specifies the arguments of assessor algorithm.
Required if using customized tuners. File path relative to __codeDir__.
Note: users could only use one way to specify assessor, either specifying `builtinAssessorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`. If users do not want to use assessor, assessor fileld should leave to empty.
Specifies the name of tuner file.
* __advisor__
* Description
#### className
__advisor__ specifies the advisor algorithm in the experiment, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __builtinAdvisorName__ and __classArgs__
* __builtinAdvisorName__
Required if using customized tuners. String.
__builtinAdvisorName__ specifies the name of a built-in advisor, NNI sdk provides [different advisors](../Tuner/BuiltinTuner.md).
Specifies the name of tuner class.
* __classArgs__
#### classArgs
__classArgs__ specifies the arguments of the advisor algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in advisor.
* __codeDir__, __classFileName__, __className__ and __classArgs__
* __codeDir__
Optional. Key-value pairs. Default: empty.
__codeDir__ specifies the directory of advisor code.
* __classFileName__
Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
__classFileName__ specifies the name of advisor file.
* __className__
#### gpuIndices
__className__ specifies the name of advisor class.
* __classArgs__
Optional. String. Default: empty.
__classArgs__ specifies the arguments of advisor algorithm.
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
* __gpuIndices__
#### includeIntermediateResults
__gpuIndices__ specifies the gpus that can be used by the advisor process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner.
Optional. Bool. Default: false.
Note: users could only use one way to specify advisor, either specifying `builtinAdvisorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`.
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
* __trial(local, remote)__
### assessor
* __command__
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
__command__ specifies the command to run trial process.
By default, there is no assessor enabled.
* __codeDir__
#### builtinAssessorName
__codeDir__ specifies the directory of your own trial file.
Required if using built-in assessors. String.
* __gpuNum__
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
__gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.
#### codeDir
* __trial(pai)__
Required if using customized assessors. Path relative to the location of config file.
* __command__
Specifies the directory of assessor code.
__command__ specifies the command to run trial process.
#### classFileName
* __codeDir__
Required if using customized assessors. File path relative to __codeDir__.
__codeDir__ specifies the directory of the own trial file.
Specifies the name of assessor file.
* __gpuNum__
#### className
__gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.
Required if using customized assessors. String.
* __cpuNum__
Specifies the name of assessor class.
__cpuNum__ is the cpu number of cpu to be used in pai container.
#### classArgs
* __memoryMB__
Optional. Key-value pairs. Default: empty.
__memoryMB__ set the momory size to be used in pai's container.
Specifies the arguments of assessor algorithm.
* __image__
### advisor
__image__ set the image to be used in pai.
Optional.
* __trial(kubeflow)__
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __codeDir__
When advisor is enabled, settings of tuners and advisors will be bypassed.
__codeDir__ is the local directory where the code files in.
#### builtinAdvisorName
* __ps(optional)__
Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
__ps__ is the configuration for kubeflow's tensorflow-operator.
#### codeDir
* __replicas__
Required if using customized advisors. Path relative to the location of config file.
__replicas__ is the replica number of __ps__ role.
Specifies the directory of advisor code.
* __command__
#### classFileName
__command__ is the run script in __ps__'s container.
Required if using customized advisors. File path relative to __codeDir__.
* __gpuNum__
Specifies the name of advisor file.
__gpuNum__ set the gpu number to be used in __ps__ container.
#### className
* __cpuNum__
Required if using customized advisors. String.
__cpuNum__ set the cpu number to be used in __ps__ container.
Specifies the name of advisor class.
* __memoryMB__
#### classArgs
__memoryMB__ set the memory size of the container.
Optional. Key-value pairs. Default: empty.
* __image__
Specifies the arguments of advisor.
__image__ set the image to be used in __ps__.
#### gpuIndices
* __worker__
Optional. String. Default: empty.
__worker__ is the configuration for kubeflow's tensorflow-operator.
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
* __replicas__
### trial
__replicas__ is the replica number of __worker__ role.
Required. Key-value pairs.
* __command__
In local and remote mode, the following keys are required.
__command__ is the run script in __worker__'s container.
* __command__: Required string. Specifies the command to run trial process.
* __gpuNum__
* __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
__gpuNum__ set the gpu number to be used in __worker__ container.
* __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
* __cpuNum__
In PAI mode, the following keys are required.
__cpuNum__ set the cpu number to be used in __worker__ container.
* __command__: Required string. Specifies the command to run trial process.
* __memoryMB__
* __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
__memoryMB__ set the memory size of the container.
* __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
* __image__
* __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
__image__ set the image to be used in __worker__.
* __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
* __localConfig__
* __image__: Required string. Set the image to be used in pai.
__localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
* __gpuIndices__
* __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
__gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`.
* __shmMB__: Optional integer. Shared memory size of container.
* __maxTrialNumPerGpu__
* __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
__maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.
In Kubeflow mode, the following keys are required.
* __useActiveGpu__
* __codeDir__: The local directory where the code files are in.
__useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
* __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
* __replicas__: The replica number of __ps__ role.
* __machineList__
* __command__: The run script in __ps__'s container.
__machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
* __gpuNum__: The gpu number to be used in __ps__ container.
* __ip__
* __cpuNum__: The cpu number to be used in __ps__ container.
__ip__ is the ip address of remote machine.
* __memoryMB__: The memory size of the container.
* __port__
* __image__: The image to be used in __ps__.
__port__ is the ssh port to be used to connect machine.
* __worker__: An optional configuration for kubeflow's tensorflow-operator.
Note: if users set port empty, the default value will be 22.
* __username__
* __replicas__: The replica number of __worker__ role.
__username__ is the account of remote machine.
* __passwd__
* __command__: The run script in __worker__'s container.
__passwd__ specifies the password of the account.
* __gpuNum__: The gpu number to be used in __worker__ container.
* __sshKeyPath__
* __cpuNum__: The cpu number to be used in __worker__ container.
If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid.
* __memoryMB__: The memory size of the container.
Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd.
* __image__: The image to be used in __worker__.
* __passphrase__
### localConfig
__passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.
Optional in local mode. Key-value pairs.
* __gpuIndices__
Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
__gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`.
#### gpuIndices
* __maxTrialNumPerGpu__
Optional. String. Default: none.
__maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or `0,1,3`. By default, all GPUs available will be used.
* __useActiveGpu__
#### maxTrialNumPerGpu
__useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
Optional. Integer. Default: 99999.
* __kubeflowConfig__:
Used to specify the max concurrency trial number on a GPU device.
* __operator__
#### useActiveGpu
__operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version.
Optional. Bool. Default: false.
* __storage__
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
__storage__ specify the storage type of kubeflow, including {__nfs__, __azureStorage__}. This field is optional, and the default value is __nfs__. If the config use azureStorage, this field must be completed.
### machineList
* __nfs__
Required in remote mode. A list of key-value pairs with the following keys.
__server__ is the host of nfs server
#### ip
__path__ is the mounted path of nfs
Required. IP address that is accessible from the current machine.
* __keyVault__
The IP address of remote machine.
If users want to use azure kubernetes service, they should set keyVault to storage the private key of your azure storage account. Refer: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
#### port
* __vaultName__
Optional. Integer. Valid port. Default: 22.
__vaultName__ is the value of `--vault-name` used in az command.
The ssh port to be used to connect machine.
* __name__
#### username
__name__ is the value of `--name` used in az command.
Required if authentication with username/password. String.
* __azureStorage__
The account of remote machine.
If users use azure kubernetes service, they should set azure storage account to store code files.
#### passwd
* __accountName__
Required if authentication with username/password. String.
__accountName__ is the name of azure storage account.
Specifies the password of the account.
* __azureShare__
#### sshKeyPath
__azureShare__ is the share of the azure file storage.
Required if authentication with ssh key. Path to private key file.
* __uploadRetryCount__
If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*
* __paiConfig__
#### passphrase
* __userName__
Optional. String.
__userName__ is the user name of your pai account.
Used to protect ssh key, which could be empty if users don't have passphrase.
* __password__
#### gpuIndices
__password__ is the password of the pai account.
Optional. String. Default: none.
* __host__
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or `0,1,3`. By default, all GPUs available will be used.
__host__ is the host of pai.
#### maxTrialNumPerGpu
Optional. Integer. Default: 99999.
Used to specify the max concurrency trial number on a GPU device.
#### useActiveGpu
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
### kubeflowConfig
#### operator
Required. String. Has to be `tf-operator` or `pytorch-operator`.
Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.
#### storage
Optional. String. Default. `nfs`.
Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.
#### nfs
Required if using nfs. Key-value pairs.
* __server__ is the host of nfs server.
* __path__ is the mounted path of nfs.
#### keyVault
Required if using azure storage. Key-value pairs.
Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
* __vaultName__ is the value of `--vault-name` used in az command.
* __name__ is the value of `--name` used in az command.
#### azureStorage
Required if using azure storage. Key-value pairs.
Set azure storage account to store code files.
* __accountName__ is the name of azure storage account.
* __azureShare__ is the share of the azure file storage.
#### uploadRetryCount
Required if using azure storage. Integer between 1 and 99999.
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
### paiConfig
#### userName
Required. String.
The user name of your pai account.
#### password
Required if using password authentication. String.
The password of the pai account.
#### token
Required if using token authentication. String.
Personal access token that can be retrieved from PAI portal.
#### host
Required. String.
The hostname of IP address of PAI.
## Examples
* __local mode__
### Local mode
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
```yaml
authorName: test
......@@ -581,7 +697,7 @@ machineList:
gpuNum: 0
```
You can add assessor configuration.
You can add assessor configuration.
```yaml
authorName: test
......@@ -612,7 +728,7 @@ machineList:
gpuNum: 0
```
Or you could specify your own tuner and assessor file as following,
Or you could specify your own tuner and assessor file as following,
```yaml
authorName: test
......@@ -645,9 +761,9 @@ machineList:
gpuNum: 0
```
* __remote mode__
### Remote mode
If run trial jobs in remote machine, users could specify the remote machine information as following format:
If run trial jobs in remote machine, users could specify the remote machine information as following format:
```yaml
authorName: test
......@@ -687,7 +803,7 @@ machineList:
passphrase: qwert
```
* __pai mode__
### PAI mode
```yaml
authorName: test
......@@ -724,7 +840,7 @@ machineList:
host: 10.10.10.10
```
* __kubeflow mode__
### Kubeflow mode
kubeflow with nfs storage.
......@@ -761,7 +877,7 @@ machineList:
path: /var/nfs/general
```
kubeflow with azure storage
### Kubeflow with azure storage
```yaml
authorName: default
......
......@@ -32,9 +32,17 @@ Config the network mode to bridge mode or other mode that could make virtual mac
### Could not open webUI link
Unable to open the WebUI may have the following reasons:
* http://127.0.0.1, http://172.17.0.1 and http://10.0.0.15 are referred to localhost, if you start your experiment on the server or remote machine. You can replace the IP to your server IP to view the WebUI, like http://[your_server_ip]:8080
* `http://127.0.0.1`, `http://172.17.0.1` and `http://10.0.0.15` are referred to localhost, if you start your experiment on the server or remote machine. You can replace the IP to your server IP to view the WebUI, like `http://[your_server_ip]:8080`
* If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment.
* Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log
* Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: `~/nni/experiment/[your_experiment_id]` `/log/nnimanager.log`
### Restful server start failed
Probably it's a problem with your network config. Here is a checklist.
* You might need to link `127.0.0.1` with `localhost`. Add a line `127.0.0.1 localhost` to `/etc/hosts`.
* It's also possible that you have set some proxy config. Check your environment for variables like `HTTP_PROXY` or `HTTPS_PROXY` and unset if they are set.
### NNI on Windows problems
Please refer to [NNI on Windows](NniOnWindows.md)
......
# Installation of NNI
Currently we support installation on Linux, Mac and Windows(local, remote and pai mode).
Currently we support installation on Linux, Mac and Windows.
## **Installation on Linux & Mac**
......
# NNI on Windows (experimental feature)
Currently we support local, remote and pai mode on Windows. Windows 10.1809 is well tested and recommended.
Running NNI on Windows is an experimental feature. Windows 10.1809 is well tested and recommended.
## **Installation on Windows**
......@@ -41,6 +41,9 @@ Make sure C++ 14.0 compiler installed then try to run `nnictl package install --
### Not supported tuner on Windows
SMAC is not supported currently, the specific reason can be referred to this [GitHub issue](https://github.com/automl/SMAC3/issues/483).
### Use a Windows server as a remote worker
Currently you can't.
Note:
* If there is any error like `Segmentation fault`, please refer to [FAQ](FAQ.md)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment