Unverified Commit 843b642f authored by Yuge Zhang's avatar Yuge Zhang Committed by GitHub
Browse files

Docs improvement: configurations and more (#1823)

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* docs improvement

* update

* update
parent 2c17da7d
...@@ -16,9 +16,9 @@ Install NNI on each of your machines following the install guide [here](../Tutor ...@@ -16,9 +16,9 @@ Install NNI on each of your machines following the install guide [here](../Tutor
## Run an experiment ## Run an experiment
Install NNI on another machine which has network accessibility to those three machines above, or you can just use any machine above to run nnictl command line tool. Install NNI on another machine which has network accessibility to those three machines above, or you can just run `nnictl` on any one of the three to launch the experiment.
We use `examples/trials/mnist-annotation` as an example here. `cat ~/nni/examples/trials/mnist-annotation/config_remote.yml` to see the detailed configuration file: We use `examples/trials/mnist-annotation` as an example here. Shown here is `examples/trials/mnist-annotation/config_remote.yml`:
```yaml ```yaml
authorName: default authorName: default
...@@ -57,24 +57,15 @@ machineList: ...@@ -57,24 +57,15 @@ machineList:
username: bob username: bob
passwd: bob123 passwd: bob123
``` ```
You can use different systems to run experiments on the remote machine.
#### Linux and MacOS
Simply filling the `machineList` section and then run:
```bash Files in `codeDir` will be automatically uploaded to the remote machine. You can run NNI on different operating systems (Windows, Linux, MacOS) to spawn experiments on the remote machines (only Linux allowed):
nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml
```
to start the experiment.
#### Windows
Simply filling the `machineList` section and then run:
```bash ```bash
nnictl create --config %userprofile%\nni\examples\trials\mnist-annotation\config_remote.yml nnictl create --config examples/trials/mnist-annotation/config_remote.yml
``` ```
to start the experiment. You can also use public/private key pairs instead of username/password for authentication. For advanced usages, please refer to [Experiment Config Reference](../Tutorial/ExperimentConfig.md).
## Version check
## version check NNI support version check feature in since version 0.6, [reference](PaiMode.md).
NNI support version check feature in since version 0.6, [refer](PaiMode.md) \ No newline at end of file
\ No newline at end of file
# Experiment config reference # Experiment Config Reference
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`. A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format. The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates. This document describes the rules to write the config file, and provides some examples and templates.
- [Experiment config reference](#experiment-config-reference) - [Experiment Config Reference](#experiment-config-reference)
- [Template](#template) * [Template](#template)
- [Configuration spec](#configuration-spec) * [Configuration Spec](#configuration-spec)
- [Examples](#examples) + [authorName](#authorname)
+ [experimentName](#experimentname)
+ [trialConcurrency](#trialconcurrency)
+ [maxExecDuration](#maxexecduration)
+ [versionCheck](#versioncheck)
+ [debug](#debug)
+ [maxTrialNum](#maxtrialnum)
+ [trainingServicePlatform](#trainingserviceplatform)
+ [searchSpacePath](#searchspacepath)
+ [useAnnotation](#useannotation)
+ [multiPhase](#multiphase)
+ [multiThread](#multithread)
+ [nniManagerIp](#nnimanagerip)
+ [logDir](#logdir)
+ [logLevel](#loglevel)
+ [logCollection](#logcollection)
+ [tuner](#tuner)
- [builtinTunerName](#builtintunername)
- [codeDir](#codedir)
- [classFileName](#classfilename)
- [className](#classname)
- [classArgs](#classargs)
- [gpuIndices](#gpuindices)
- [includeIntermediateResults](#includeintermediateresults)
+ [assessor](#assessor)
- [builtinAssessorName](#builtinassessorname)
- [codeDir](#codedir-1)
- [classFileName](#classfilename-1)
- [className](#classname-1)
- [classArgs](#classargs-1)
+ [advisor](#advisor)
- [builtinAdvisorName](#builtinadvisorname)
- [codeDir](#codedir-2)
- [classFileName](#classfilename-2)
- [className](#classname-2)
- [classArgs](#classargs-2)
- [gpuIndices](#gpuindices-1)
+ [trial](#trial)
+ [localConfig](#localconfig)
- [gpuIndices](#gpuindices-2)
- [maxTrialNumPerGpu](#maxtrialnumpergpu)
- [useActiveGpu](#useactivegpu)
+ [machineList](#machinelist)
- [ip](#ip)
- [port](#port)
- [username](#username)
- [passwd](#passwd)
- [sshKeyPath](#sshkeypath)
- [passphrase](#passphrase)
- [gpuIndices](#gpuindices-3)
- [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
- [useActiveGpu](#useactivegpu-1)
+ [kubeflowConfig](#kubeflowconfig)
- [operator](#operator)
- [storage](#storage)
- [nfs](#nfs)
- [keyVault](#keyvault)
- [azureStorage](#azurestorage)
- [uploadRetryCount](#uploadretrycount)
+ [paiConfig](#paiconfig)
- [userName](#username)
- [password](#password)
- [token](#token)
- [host](#host)
* [Examples](#examples)
+ [Local mode](#local-mode)
+ [Remote mode](#remote-mode)
+ [PAI mode](#pai-mode)
+ [Kubeflow mode](#kubeflow-mode)
+ [Kubeflow with azure storage](#kubeflow-with-azure-storage)
## Template ## Template
* __light weight(without Annotation and Assessor)__ * __Light weight (without Annotation and Assessor)__
```yaml ```yaml
authorName: authorName:
...@@ -130,434 +199,481 @@ machineList: ...@@ -130,434 +199,481 @@ machineList:
passwd: passwd:
``` ```
## Configuration spec ## Configuration Spec
* __authorName__ ### authorName
* Description
__authorName__ is the name of the author who create the experiment. Required. String.
TBD: add default value The name of the author who create the experiment.
* __experimentName__ *TBD: add default value.*
* Description
__experimentName__ is the name of the experiment created. ### experimentName
TBD: add default value Required. String.
* __trialConcurrency__ The name of the experiment created.
* Description
__trialConcurrency__ specifies the max num of trial jobs run simultaneously. *TBD: add default value.*
Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation. ### trialConcurrency
* __maxExecDuration__ Required. Integer between 1 and 99999.
* Description
__maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}. Specifies the max num of trial jobs run simultaneously.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more. If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
* __versionCheck__ ### maxExecDuration
* Description
Optional. String. Default: 999d.
__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
### versionCheck
Optional. Bool. Default: false.
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false. NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
### debug
Optional. Bool. Default: false.
Debug mode will set versionCheck to false and set logLevel to be 'debug'.
### maxTrialNum
Optional. Integer between 1 and 99999. Default: 99999.
Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
### trainingServicePlatform
Required. String.
* __debug__ Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
* Description
Debug mode will set versionCheck be False and set logLevel be 'debug' * __local__ run an experiment on local ubuntu machine.
* __maxTrialNum__ * __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
* Description
__maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs. * __pai__ submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
* __trainingServicePlatform__ * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
* Description
__trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}. * TODO: explain frameworkcontroller.
* __local__ run an experiment on local ubuntu machine. ### searchSpacePath
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine. Optional. Path to existing file.
* __pai__ submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](../TrainingService/PaiMode.md) Specifies the path of search space file, which should be a valid path in the local linux machine.
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). Detail please reference [KubeflowDoc](../TrainingService/KubeflowMode.md) The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
* __searchSpacePath__ ### useAnnotation
* Description
__searchSpacePath__ specifies the path of search space file, which should be a valid path in the local linux machine. Optional. Bool. Default: false.
Note: if set useAnnotation=True, the searchSpacePath field should be removed. Use annotation to analysis trial code and generate search space.
* __useAnnotation__ Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
* Description
__useAnnotation__ use annotation to analysis trial code and generate search space. ### multiPhase
Note: if set useAnnotation=True, the searchSpacePath field should be removed. Optional. Bool. Default: false.
* __multiPhase__ Enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md).
* Description
__multiPhase__ enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md). ### multiThread
* __multiThread__ Optional. Bool. Default: false.
* Description
__multiThread__ enable multi-thread mode for dispatcher, if multiThread is set to `true`, dispatcher will start a thread to process each command from NNI Manager. Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
* __nniManagerIp__ ### nniManagerIp
* Description
__nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead. Optional. String. Default: eth0 device IP.
Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly. Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
* __logDir__ Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
* Description
__logDir__ configures the directory to store logs and data of the experiment. The default value is `<user home directory>/nni/experiment` ### logDir
* __logLevel__ Optional. Path to a directory. Default: `<user home directory>/nni/experiment`.
* Description
__logLevel__ sets log level for the experiment, available log levels are: `trace, debug, info, warning, error, fatal`. The default value is `info`. Configures the directory to store logs and data of the experiment.
* __logCollection__ ### logLevel
* Description
__logCollection__ set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`. Optional. String. Default: `info`.
* __tuner__ Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
* Description
__tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. ### logCollection
* __builtinTunerName__ and __classArgs__
* __builtinTunerName__
__builtinTunerName__ specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md). Optional. `http` or `none`. Default: `none`.
* __classArgs__ Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
__classArgs__ specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner. ### tuner
* __codeDir__, __classFileName__, __className__ and __classArgs__
* __codeDir__
__codeDir__ specifies the directory of tuner code. Required.
* __classFileName__
__classFileName__ specifies the name of tuner file. Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
* __className__
__className__ specifies the name of tuner class. #### builtinTunerName
* __classArgs__
__classArgs__ specifies the arguments of tuner algorithm. Required if using built-in tuners. String.
* __gpuIndices__ Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
__gpuIndices__ specifies the gpus that can be used by the tuner process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner. #### codeDir
* __includeIntermediateResults__ Required if using customized tuners. Path relative to the location of config file.
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of __includeIntermediateResults__ is false. Specifies the directory of tuner code.
Note: users could only use one way to specify tuner, either specifying `builtinTunerName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`. #### classFileName
* __assessor__ Required if using customized tuners. File path relative to __codeDir__.
* Description Specifies the name of tuner file.
__assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. #### className
* __builtinAssessorName__ and __classArgs__
* __builtinAssessorName__
__builtinAssessorName__ specifies the name of built-in assessor, NNI sdk provides different assessors introducted [here](../Assessor/BuiltinAssessor.md). Required if using customized tuners. String.
* __classArgs__
__classArgs__ specifies the arguments of assessor algorithm Specifies the name of tuner class.
* __codeDir__, __classFileName__, __className__ and __classArgs__ #### classArgs
* __codeDir__ Optional. Key-value pairs. Default: empty.
__codeDir__ specifies the directory of assessor code. Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
* __classFileName__ #### gpuIndices
__classFileName__ specifies the name of assessor file. Optional. String. Default: empty.
* __className__ Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
__className__ specifies the name of assessor class. #### includeIntermediateResults
* __classArgs__ Optional. Bool. Default: false.
__classArgs__ specifies the arguments of assessor algorithm. If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
Note: users could only use one way to specify assessor, either specifying `builtinAssessorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`. If users do not want to use assessor, assessor fileld should leave to empty. ### assessor
* __advisor__ Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
* Description
__advisor__ specifies the advisor algorithm in the experiment, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. By default, there is no assessor enabled.
* __builtinAdvisorName__ and __classArgs__
* __builtinAdvisorName__
__builtinAdvisorName__ specifies the name of a built-in advisor, NNI sdk provides [different advisors](../Tuner/BuiltinTuner.md). #### builtinAssessorName
* __classArgs__ Required if using built-in assessors. String.
__classArgs__ specifies the arguments of the advisor algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in advisor. Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
* __codeDir__, __classFileName__, __className__ and __classArgs__
* __codeDir__
__codeDir__ specifies the directory of advisor code. #### codeDir
* __classFileName__
__classFileName__ specifies the name of advisor file. Required if using customized assessors. Path relative to the location of config file.
* __className__
__className__ specifies the name of advisor class. Specifies the directory of assessor code.
* __classArgs__
__classArgs__ specifies the arguments of advisor algorithm. #### classFileName
* __gpuIndices__ Required if using customized assessors. File path relative to __codeDir__.
__gpuIndices__ specifies the gpus that can be used by the advisor process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner. Specifies the name of assessor file.
Note: users could only use one way to specify advisor, either specifying `builtinAdvisorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`. #### className
* __trial(local, remote)__ Required if using customized assessors. String.
* __command__ Specifies the name of assessor class.
__command__ specifies the command to run trial process. #### classArgs
* __codeDir__ Optional. Key-value pairs. Default: empty.
__codeDir__ specifies the directory of your own trial file. Specifies the arguments of assessor algorithm.
* __gpuNum__ ### advisor
__gpuNum__ specifies the num of gpu to run the trial process. Default value is 0. Optional.
* __trial(pai)__ Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __command__ When advisor is enabled, settings of tuners and advisors will be bypassed.
__command__ specifies the command to run trial process. #### builtinAdvisorName
* __codeDir__ Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
__codeDir__ specifies the directory of the own trial file. #### codeDir
* __gpuNum__ Required if using customized advisors. Path relative to the location of config file.
__gpuNum__ specifies the num of gpu to run the trial process. Default value is 0. Specifies the directory of advisor code.
* __cpuNum__ #### classFileName
__cpuNum__ is the cpu number of cpu to be used in pai container. Required if using customized advisors. File path relative to __codeDir__.
* __memoryMB__ Specifies the name of advisor file.
__memoryMB__ set the momory size to be used in pai's container. #### className
* __image__ Required if using customized advisors. String.
__image__ set the image to be used in pai. Specifies the name of advisor class.
* __trial(kubeflow)__ #### classArgs
* __codeDir__ Optional. Key-value pairs. Default: empty.
__codeDir__ is the local directory where the code files in. Specifies the arguments of advisor.
* __ps(optional)__ #### gpuIndices
__ps__ is the configuration for kubeflow's tensorflow-operator. Optional. String. Default: empty.
* __replicas__ Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
__replicas__ is the replica number of __ps__ role. ### trial
* __command__ Required. Key-value pairs.
__command__ is the run script in __ps__'s container. In local and remote mode, the following keys are required.
* __gpuNum__ * __command__: Required string. Specifies the command to run trial process.
__gpuNum__ set the gpu number to be used in __ps__ container. * __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
* __cpuNum__ * __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
__cpuNum__ set the cpu number to be used in __ps__ container. In PAI mode, the following keys are required.
* __memoryMB__ * __command__: Required string. Specifies the command to run trial process.
__memoryMB__ set the memory size of the container. * __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
* __image__ * __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
__image__ set the image to be used in __ps__. * __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
* __worker__ * __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
__worker__ is the configuration for kubeflow's tensorflow-operator. * __image__: Required string. Set the image to be used in pai.
* __replicas__ * __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
__replicas__ is the replica number of __worker__ role. * __shmMB__: Optional integer. Shared memory size of container.
* __command__ * __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
__command__ is the run script in __worker__'s container. In Kubeflow mode, the following keys are required.
* __gpuNum__ * __codeDir__: The local directory where the code files are in.
__gpuNum__ set the gpu number to be used in __worker__ container. * __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
* __cpuNum__ * __replicas__: The replica number of __ps__ role.
__cpuNum__ set the cpu number to be used in __worker__ container. * __command__: The run script in __ps__'s container.
* __memoryMB__ * __gpuNum__: The gpu number to be used in __ps__ container.
__memoryMB__ set the memory size of the container. * __cpuNum__: The cpu number to be used in __ps__ container.
* __image__ * __memoryMB__: The memory size of the container.
__image__ set the image to be used in __worker__. * __image__: The image to be used in __ps__.
* __localConfig__ * __worker__: An optional configuration for kubeflow's tensorflow-operator.
__localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file. * __replicas__: The replica number of __worker__ role.
* __gpuIndices__
__gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. * __command__: The run script in __worker__'s container.
* __maxTrialNumPerGpu__ * __gpuNum__: The gpu number to be used in __worker__ container.
* __cpuNum__: The cpu number to be used in __worker__ container.
* __memoryMB__: The memory size of the container.
* __image__: The image to be used in __worker__.
### localConfig
Optional in local mode. Key-value pairs.
Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
#### gpuIndices
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or `0,1,3`. By default, all GPUs available will be used.
#### maxTrialNumPerGpu
Optional. Integer. Default: 99999.
__maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device. Used to specify the max concurrency trial number on a GPU device.
* __useActiveGpu__ #### useActiveGpu
__useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
* __machineList__ Optional. Bool. Default: false.
__machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
* __ip__ ### machineList
__ip__ is the ip address of remote machine. Required in remote mode. A list of key-value pairs with the following keys.
* __port__ #### ip
__port__ is the ssh port to be used to connect machine. Required. IP address that is accessible from the current machine.
Note: if users set port empty, the default value will be 22. The IP address of remote machine.
* __username__
__username__ is the account of remote machine. #### port
* __passwd__
__passwd__ specifies the password of the account. Optional. Integer. Valid port. Default: 22.
* __sshKeyPath__ The ssh port to be used to connect machine.
If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid. #### username
Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd. Required if authentication with username/password. String.
* __passphrase__ The account of remote machine.
__passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase. #### passwd
* __gpuIndices__ Required if authentication with username/password. String.
__gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. Specifies the password of the account.
* __maxTrialNumPerGpu__ #### sshKeyPath
__maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.
* __useActiveGpu__ Required if authentication with ssh key. Path to private key file.
__useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows. If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.
*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*
#### passphrase
Optional. String.
Used to protect ssh key, which could be empty if users don't have passphrase.
#### gpuIndices
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or `0,1,3`. By default, all GPUs available will be used.
#### maxTrialNumPerGpu
Optional. Integer. Default: 99999.
Used to specify the max concurrency trial number on a GPU device.
#### useActiveGpu
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
### kubeflowConfig
#### operator
Required. String. Has to be `tf-operator` or `pytorch-operator`.
Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.
#### storage
Optional. String. Default. `nfs`.
Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.
#### nfs
* __kubeflowConfig__: Required if using nfs. Key-value pairs.
* __operator__ * __server__ is the host of nfs server.
__operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version. * __path__ is the mounted path of nfs.
* __storage__ #### keyVault
__storage__ specify the storage type of kubeflow, including {__nfs__, __azureStorage__}. This field is optional, and the default value is __nfs__. If the config use azureStorage, this field must be completed. Required if using azure storage. Key-value pairs.
* __nfs__ Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
__server__ is the host of nfs server * __vaultName__ is the value of `--vault-name` used in az command.
__path__ is the mounted path of nfs * __name__ is the value of `--name` used in az command.
* __keyVault__ #### azureStorage
If users want to use azure kubernetes service, they should set keyVault to storage the private key of your azure storage account. Refer: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2 Required if using azure storage. Key-value pairs.
* __vaultName__ Set azure storage account to store code files.
__vaultName__ is the value of `--vault-name` used in az command. * __accountName__ is the name of azure storage account.
* __name__ * __azureShare__ is the share of the azure file storage.
__name__ is the value of `--name` used in az command. #### uploadRetryCount
* __azureStorage__ Required if using azure storage. Integer between 1 and 99999.
If users use azure kubernetes service, they should set azure storage account to store code files. If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
* __accountName__ ### paiConfig
__accountName__ is the name of azure storage account. #### userName
* __azureShare__ Required. String.
__azureShare__ is the share of the azure file storage. The user name of your pai account.
* __uploadRetryCount__ #### password
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files. Required if using password authentication. String.
* __paiConfig__ The password of the pai account.
* __userName__ #### token
__userName__ is the user name of your pai account. Required if using token authentication. String.
* __password__ Personal access token that can be retrieved from PAI portal.
__password__ is the password of the pai account. #### host
* __host__ Required. String.
__host__ is the host of pai. The hostname of IP address of PAI.
## Examples ## Examples
* __local mode__ ### Local mode
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config: If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
```yaml ```yaml
authorName: test authorName: test
...@@ -581,7 +697,7 @@ machineList: ...@@ -581,7 +697,7 @@ machineList:
gpuNum: 0 gpuNum: 0
``` ```
You can add assessor configuration. You can add assessor configuration.
```yaml ```yaml
authorName: test authorName: test
...@@ -612,7 +728,7 @@ machineList: ...@@ -612,7 +728,7 @@ machineList:
gpuNum: 0 gpuNum: 0
``` ```
Or you could specify your own tuner and assessor file as following, Or you could specify your own tuner and assessor file as following,
```yaml ```yaml
authorName: test authorName: test
...@@ -645,9 +761,9 @@ machineList: ...@@ -645,9 +761,9 @@ machineList:
gpuNum: 0 gpuNum: 0
``` ```
* __remote mode__ ### Remote mode
If run trial jobs in remote machine, users could specify the remote machine information as following format: If run trial jobs in remote machine, users could specify the remote machine information as following format:
```yaml ```yaml
authorName: test authorName: test
...@@ -687,7 +803,7 @@ machineList: ...@@ -687,7 +803,7 @@ machineList:
passphrase: qwert passphrase: qwert
``` ```
* __pai mode__ ### PAI mode
```yaml ```yaml
authorName: test authorName: test
...@@ -724,7 +840,7 @@ machineList: ...@@ -724,7 +840,7 @@ machineList:
host: 10.10.10.10 host: 10.10.10.10
``` ```
* __kubeflow mode__ ### Kubeflow mode
kubeflow with nfs storage. kubeflow with nfs storage.
...@@ -761,7 +877,7 @@ machineList: ...@@ -761,7 +877,7 @@ machineList:
path: /var/nfs/general path: /var/nfs/general
``` ```
kubeflow with azure storage ### Kubeflow with azure storage
```yaml ```yaml
authorName: default authorName: default
......
...@@ -32,9 +32,17 @@ Config the network mode to bridge mode or other mode that could make virtual mac ...@@ -32,9 +32,17 @@ Config the network mode to bridge mode or other mode that could make virtual mac
### Could not open webUI link ### Could not open webUI link
Unable to open the WebUI may have the following reasons: Unable to open the WebUI may have the following reasons:
* http://127.0.0.1, http://172.17.0.1 and http://10.0.0.15 are referred to localhost, if you start your experiment on the server or remote machine. You can replace the IP to your server IP to view the WebUI, like http://[your_server_ip]:8080 * `http://127.0.0.1`, `http://172.17.0.1` and `http://10.0.0.15` are referred to localhost, if you start your experiment on the server or remote machine. You can replace the IP to your server IP to view the WebUI, like `http://[your_server_ip]:8080`
* If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment. * If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment.
* Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log * Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: `~/nni/experiment/[your_experiment_id]` `/log/nnimanager.log`
### Restful server start failed
Probably it's a problem with your network config. Here is a checklist.
* You might need to link `127.0.0.1` with `localhost`. Add a line `127.0.0.1 localhost` to `/etc/hosts`.
* It's also possible that you have set some proxy config. Check your environment for variables like `HTTP_PROXY` or `HTTPS_PROXY` and unset if they are set.
### NNI on Windows problems ### NNI on Windows problems
Please refer to [NNI on Windows](NniOnWindows.md) Please refer to [NNI on Windows](NniOnWindows.md)
......
# Installation of NNI # Installation of NNI
Currently we support installation on Linux, Mac and Windows(local, remote and pai mode). Currently we support installation on Linux, Mac and Windows.
## **Installation on Linux & Mac** ## **Installation on Linux & Mac**
......
# NNI on Windows (experimental feature) # NNI on Windows (experimental feature)
Currently we support local, remote and pai mode on Windows. Windows 10.1809 is well tested and recommended. Running NNI on Windows is an experimental feature. Windows 10.1809 is well tested and recommended.
## **Installation on Windows** ## **Installation on Windows**
...@@ -41,6 +41,9 @@ Make sure C++ 14.0 compiler installed then try to run `nnictl package install -- ...@@ -41,6 +41,9 @@ Make sure C++ 14.0 compiler installed then try to run `nnictl package install --
### Not supported tuner on Windows ### Not supported tuner on Windows
SMAC is not supported currently, the specific reason can be referred to this [GitHub issue](https://github.com/automl/SMAC3/issues/483). SMAC is not supported currently, the specific reason can be referred to this [GitHub issue](https://github.com/automl/SMAC3/issues/483).
### Use a Windows server as a remote worker
Currently you can't.
Note: Note:
* If there is any error like `Segmentation fault`, please refer to [FAQ](FAQ.md) * If there is any error like `Segmentation fault`, please refer to [FAQ](FAQ.md)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment