Unverified Commit 45c6508e authored by Chi Song's avatar Chi Song Committed by GitHub
Browse files

fix format of doc, change nni to NNI, yaml to yml. (#660)

fix indents of doc,
change nni to NNI
yaml to yml(file) and YAML(doc)
parent bc9eab33
Dockerfile Dockerfile
=== ===
## 1.Description ## 1.Description
This is the Dockerfile of nni project. It includes serveral popular deep learning frameworks and NNI. It is tested on `Ubuntu 16.04 LTS`: This is the Dockerfile of NNI project. It includes serveral popular deep learning frameworks and NNI. It is tested on `Ubuntu 16.04 LTS`:
``` ```
CUDA 9.0, CuDNN 7.0 CUDA 9.0, CuDNN 7.0
......
...@@ -8,7 +8,8 @@ Currently we recommend sharing weights through NFS (Network File System), which ...@@ -8,7 +8,8 @@ Currently we recommend sharing weights through NFS (Network File System), which
### Weight Sharing through NFS file ### Weight Sharing through NFS file
With the NFS setup (see below), trial code can share model weight through loading & saving files. Here we recommend that user feed the tuner with the storage path: With the NFS setup (see below), trial code can share model weight through loading & saving files. Here we recommend that user feed the tuner with the storage path:
```yaml
```yml
tuner: tuner:
codeDir: path/to/customer_tuner codeDir: path/to/customer_tuner
classFileName: customer_tuner.py classFileName: customer_tuner.py
...@@ -17,6 +18,7 @@ tuner: ...@@ -17,6 +18,7 @@ tuner:
... ...
save_dir_root: /nfs/storage/path/ save_dir_root: /nfs/storage/path/
``` ```
And let tuner decide where to save & load weights and feed the paths to trials through `nni.get_next_parameters()`: And let tuner decide where to save & load weights and feed the paths to trials through `nni.get_next_parameters()`:
<img src="https://user-images.githubusercontent.com/23273522/51817667-93ebf080-2306-11e9-8395-b18b322062bc.png" alt="drawing" width="700"/> <img src="https://user-images.githubusercontent.com/23273522/51817667-93ebf080-2306-11e9-8395-b18b322062bc.png" alt="drawing" width="700"/>
......
...@@ -32,7 +32,7 @@ It is applicable in a wide range of performance curves, thus, can be used in var ...@@ -32,7 +32,7 @@ It is applicable in a wide range of performance curves, thus, can be used in var
**Usage example:** **Usage example:**
```yaml ```yml
# config.yml # config.yml
assessor: assessor:
builtinAssessorName: Medianstop builtinAssessorName: Medianstop
...@@ -62,7 +62,7 @@ It is applicable in a wide range of performance curves, thus, can be used in var ...@@ -62,7 +62,7 @@ It is applicable in a wide range of performance curves, thus, can be used in var
**Usage example:** **Usage example:**
```yaml ```yml
# config.yml # config.yml
assessor: assessor:
builtinAssessorName: Curvefitting builtinAssessorName: Curvefitting
......
...@@ -39,7 +39,7 @@ TPE, as a black-box optimization, can be used in various scenarios and shows goo ...@@ -39,7 +39,7 @@ TPE, as a black-box optimization, can be used in various scenarios and shows goo
**Usage example:** **Usage example:**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: TPE builtinTunerName: TPE
...@@ -65,7 +65,7 @@ Random search is suggested when each trial does not take too long (e.g., each tr ...@@ -65,7 +65,7 @@ Random search is suggested when each trial does not take too long (e.g., each tr
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: Random builtinTunerName: Random
...@@ -91,7 +91,7 @@ Anneal is suggested when each trial does not take too long, and you have enough ...@@ -91,7 +91,7 @@ Anneal is suggested when each trial does not take too long, and you have enough
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: Anneal builtinTunerName: Anneal
...@@ -117,7 +117,7 @@ Its requirement of computation resource is relatively high. Specifically, it req ...@@ -117,7 +117,7 @@ Its requirement of computation resource is relatively high. Specifically, it req
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: Evolution builtinTunerName: Evolution
...@@ -143,7 +143,7 @@ Similar to TPE, SMAC is also a black-box tuner which can be tried in various sce ...@@ -143,7 +143,7 @@ Similar to TPE, SMAC is also a black-box tuner which can be tried in various sce
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: SMAC builtinTunerName: SMAC
...@@ -165,7 +165,7 @@ If the configurations you want to try have been decided, you can list them in se ...@@ -165,7 +165,7 @@ If the configurations you want to try have been decided, you can list them in se
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: BatchTuner builtinTunerName: BatchTuner
...@@ -206,7 +206,7 @@ It is suggested when search space is small, it is feasible to exhaustively sweep ...@@ -206,7 +206,7 @@ It is suggested when search space is small, it is feasible to exhaustively sweep
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: GridSearch builtinTunerName: GridSearch
...@@ -232,7 +232,7 @@ It is suggested when you have limited computation resource but have relatively l ...@@ -232,7 +232,7 @@ It is suggested when you have limited computation resource but have relatively l
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
advisor: advisor:
builtinAdvisorName: Hyperband builtinAdvisorName: Hyperband
...@@ -268,7 +268,7 @@ It is suggested that you want to apply deep learning methods to your task (your ...@@ -268,7 +268,7 @@ It is suggested that you want to apply deep learning methods to your task (your
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: NetworkMorphism builtinTunerName: NetworkMorphism
...@@ -304,7 +304,7 @@ Similar to TPE and SMAC, Metis is a black-box tuner. If your system takes a long ...@@ -304,7 +304,7 @@ Similar to TPE and SMAC, Metis is a black-box tuner. If your system takes a long
**Usage example** **Usage example**
```yaml ```yml
# config.yml # config.yml
tuner: tuner:
builtinTunerName: MetisTuner builtinTunerName: MetisTuner
......
...@@ -6,7 +6,7 @@ So, if user want to implement a customized Advisor, she/he only need to: ...@@ -6,7 +6,7 @@ So, if user want to implement a customized Advisor, she/he only need to:
1. Define an Advisor inheriting from the MsgDispatcherBase class 1. Define an Advisor inheriting from the MsgDispatcherBase class
1. Implement the methods with prefix `handle_` except `handle_request` 1. Implement the methods with prefix `handle_` except `handle_request`
1. Configure your customized Advisor in experiment yaml config file 1. Configure your customized Advisor in experiment YAML config file
Here is an example: Here is an example:
...@@ -24,11 +24,11 @@ class CustomizedAdvisor(MsgDispatcherBase): ...@@ -24,11 +24,11 @@ class CustomizedAdvisor(MsgDispatcherBase):
Please refer to the implementation of Hyperband ([src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py](../src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py)) for how to implement the methods. Please refer to the implementation of Hyperband ([src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py](../src/sdk/pynni/nni/hyperband_advisor/hyperband_advisor.py)) for how to implement the methods.
**3) Configure your customized Advisor in experiment yaml config file** **3) Configure your customized Advisor in experiment YAML config file**
Similar to tuner and assessor. NNI needs to locate your customized Advisor class and instantiate the class, so you need to specify the location of the customized Advisor class and pass literal values as parameters to the \_\_init__ constructor. Similar to tuner and assessor. NNI needs to locate your customized Advisor class and instantiate the class, so you need to specify the location of the customized Advisor class and pass literal values as parameters to the \_\_init__ constructor.
```yaml ```yml
advisor: advisor:
codeDir: /home/abc/myadvisor codeDir: /home/abc/myadvisor
classFileName: my_customized_advisor.py classFileName: my_customized_advisor.py
......
...@@ -8,7 +8,7 @@ If you want to implement a customized Assessor, there are three things for you t ...@@ -8,7 +8,7 @@ If you want to implement a customized Assessor, there are three things for you t
1) Inherit an assessor of a base Assessor class 1) Inherit an assessor of a base Assessor class
2) Implement assess_trial function 2) Implement assess_trial function
3) Configure your customized Assessor in experiment yaml config file 3) Configure your customized Assessor in experiment YAML config file
**1. Inherit an assessor of a base Assessor class** **1. Inherit an assessor of a base Assessor class**
...@@ -38,11 +38,11 @@ class CustomizedAssessor(Assessor): ...@@ -38,11 +38,11 @@ class CustomizedAssessor(Assessor):
... ...
``` ```
**3. Configure your customized Assessor in experiment yaml config file** **3. Configure your customized Assessor in experiment YAML config file**
NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the \_\_init__ constructor. NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the \_\_init__ constructor.
```yaml ```yml
assessor: assessor:
codeDir: /home/abc/myassessor codeDir: /home/abc/myassessor
......
...@@ -8,7 +8,7 @@ If you want to implement and use your own tuning algorithm, you can implement a ...@@ -8,7 +8,7 @@ If you want to implement and use your own tuning algorithm, you can implement a
1) Inherit a tuner of a base Tuner class 1) Inherit a tuner of a base Tuner class
2) Implement receive_trial_result and generate_parameter function 2) Implement receive_trial_result and generate_parameter function
3) Configure your customized tuner in experiment yaml config file 3) Configure your customized tuner in experiment YAML config file
Here is an example: Here is an example:
...@@ -91,11 +91,11 @@ _fd = open(os.path.join(_pwd, 'data.txt'), 'r') ...@@ -91,11 +91,11 @@ _fd = open(os.path.join(_pwd, 'data.txt'), 'r')
This is because your tuner is not executed in the directory of your tuner (i.e., `pwd` is not the directory of your own tuner). This is because your tuner is not executed in the directory of your tuner (i.e., `pwd` is not the directory of your own tuner).
**3. Configure your customized tuner in experiment yaml config file** **3. Configure your customized tuner in experiment YAML config file**
NNI needs to locate your customized tuner class and instantiate the class, so you need to specify the location of the customized tuner class and pass literal values as parameters to the \_\_init__ constructor. NNI needs to locate your customized tuner class and instantiate the class, so you need to specify the location of the customized tuner class and pass literal values as parameters to the \_\_init__ constructor.
```yaml ```yml
tuner: tuner:
codeDir: /home/abc/mytuner codeDir: /home/abc/mytuner
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
Assessor module is for assessing running trials. One common use case is early stopping, which terminates unpromising trial jobs based on their intermediate results. Assessor module is for assessing running trials. One common use case is early stopping, which terminates unpromising trial jobs based on their intermediate results.
## Using NNI built-in Assessor ## Using NNI built-in Assessor
Here we use the same example `examples/trials/mnist-annotation`. We use `Medianstop` assessor for this experiment. The yaml configure file is shown below: Here we use the same example `examples/trials/mnist-annotation`. We use `Medianstop` assessor for this experiment. The yml configure file is shown below:
``` ```
authorName: your_name authorName: your_name
experimentName: auto_mnist experimentName: auto_mnist
...@@ -33,7 +33,7 @@ trial: ...@@ -33,7 +33,7 @@ trial:
For our built-in assessors, you need to fill two fields: `builtinAssessorName` which chooses NNI provided assessors (refer to [here]() for built-in assessors), `optimize_mode` which includes maximize and minimize (you want to maximize or minimize your trial result). For our built-in assessors, you need to fill two fields: `builtinAssessorName` which chooses NNI provided assessors (refer to [here]() for built-in assessors), `optimize_mode` which includes maximize and minimize (you want to maximize or minimize your trial result).
## Using user customized Assessor ## Using user customized Assessor
You can also write your own assessor following the guidance [here](). For example, you wrote an assessor for `examples/trials/mnist-annotation`. You should prepare the yaml configure below: You can also write your own assessor following the guidance [here](). For example, you wrote an assessor for `examples/trials/mnist-annotation`. You should prepare the yml configure below:
``` ```
authorName: your_name authorName: your_name
experimentName: auto_mnist experimentName: auto_mnist
......
# Experiment config reference # Experiment config reference
A config file is needed when create an experiment, the path of the config file is provide to nnictl. A config file is needed when create an experiment, the path of the config file is provide to nnictl.
The config file is written in yaml format, and need to be written correctly. The config file is written in YAML format, and need to be written correctly.
This document describes the rule to write config file, and will provide some examples and templates. This document describes the rule to write config file, and will provide some examples and templates.
- [Template](#Template) (the templates of an config file) - [Template](#Template) (the templates of an config file)
...@@ -149,7 +149,7 @@ machineList: ...@@ -149,7 +149,7 @@ machineList:
* __maxTrialNum__ * __maxTrialNum__
* Description * Description
__maxTrialNum__ specifies the max number of trial jobs created by nni, including succeeded and failed jobs. __maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
* __trainingServicePlatform__ * __trainingServicePlatform__
* Description * Description
...@@ -164,7 +164,7 @@ machineList: ...@@ -164,7 +164,7 @@ machineList:
* __pai__ submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](./PAIMode.md) * __pai__ submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](./PAIMode.md)
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), nni support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/).
* __searchSpacePath__ * __searchSpacePath__
* Description * Description
...@@ -182,7 +182,7 @@ machineList: ...@@ -182,7 +182,7 @@ machineList:
* __nniManagerIp__ * __nniManagerIp__
* Description * Description
__nniManagerIp__ set the IP address of the machine on which nni manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead. __nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly. Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly.
...@@ -200,11 +200,11 @@ machineList: ...@@ -200,11 +200,11 @@ machineList:
* __tuner__ * __tuner__
* Description * Description
__tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by nni sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. __tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __builtinTunerName__ and __classArgs__ * __builtinTunerName__ and __classArgs__
* __builtinTunerName__ * __builtinTunerName__
__builtinTunerName__ specifies the name of system tuner, nni sdk provides four kinds of tuner, including {__TPE__, __Random__, __Anneal__, __Evolution__, __BatchTuner__, __GridSearch__} __builtinTunerName__ specifies the name of system tuner, NNI sdk provides four kinds of tuner, including {__TPE__, __Random__, __Anneal__, __Evolution__, __BatchTuner__, __GridSearch__}
* __classArgs__ * __classArgs__
__classArgs__ specifies the arguments of tuner algorithm. If the __builtinTunerName__ is in {__TPE__, __Random__, __Anneal__, __Evolution__}, user should set __optimize_mode__. __classArgs__ specifies the arguments of tuner algorithm. If the __builtinTunerName__ is in {__TPE__, __Random__, __Anneal__, __Evolution__}, user should set __optimize_mode__.
...@@ -231,11 +231,11 @@ machineList: ...@@ -231,11 +231,11 @@ machineList:
* Description * Description
__assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by nni sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. __assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
* __builtinAssessorName__ and __classArgs__ * __builtinAssessorName__ and __classArgs__
* __builtinAssessorName__ * __builtinAssessorName__
__builtinAssessorName__ specifies the name of system assessor, nni sdk provides one kind of assessor {__Medianstop__} __builtinAssessorName__ specifies the name of system assessor, NNI sdk provides one kind of assessor {__Medianstop__}
* __classArgs__ * __classArgs__
__classArgs__ specifies the arguments of assessor algorithm __classArgs__ specifies the arguments of assessor algorithm
...@@ -383,7 +383,7 @@ machineList: ...@@ -383,7 +383,7 @@ machineList:
If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid. If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid.
Note: if users set passwd and sshKeyPath simultaneously, nni will try passwd. Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd.
* __passphrase__ * __passphrase__
...@@ -393,7 +393,7 @@ machineList: ...@@ -393,7 +393,7 @@ machineList:
* __operator__ * __operator__
__operator__ specify the kubeflow's operator to be used, nni support __tf-operator__ in current version. __operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version.
* __storage__ * __storage__
...@@ -611,11 +611,11 @@ trial: ...@@ -611,11 +611,11 @@ trial:
gpuNum: 4 gpuNum: 4
cpuNum: 2 cpuNum: 2
memoryMB: 10000 memoryMB: 10000
#The docker image to run nni job on pai #The docker image to run NNI job on pai
image: msranni/nni:latest image: msranni/nni:latest
#The hdfs directory to store data on pai, format 'hdfs://host:port/directory' #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
dataDir: hdfs://10.11.12.13:9000/test dataDir: hdfs://10.11.12.13:9000/test
#The hdfs directory to store output data generated by nni, format 'hdfs://host:port/directory' #The hdfs directory to store output data generated by NNI, format 'hdfs://host:port/directory'
outputDir: hdfs://10.11.12.13:9000/test outputDir: hdfs://10.11.12.13:9000/test
paiConfig: paiConfig:
#The username to login pai #The username to login pai
......
...@@ -9,14 +9,14 @@ When met errors like below, try to clean up **tmp** folder first. ...@@ -9,14 +9,14 @@ When met errors like below, try to clean up **tmp** folder first.
> OSError: [Errno 28] No space left on device > OSError: [Errno 28] No space left on device
### Cannot get trials' metrics in OpenPAI mode ### Cannot get trials' metrics in OpenPAI mode
In OpenPAI training mode, we start a rest server which listens on 51189 port in nniManager to receive metrcis reported from trials running in OpenPAI cluster. If you didn't see any metrics from WebUI in OpenPAI mode, check your machine where nniManager runs on to make sure 51189 port is turned on in the firewall rule. In OpenPAI training mode, we start a rest server which listens on 51189 port in NNI Manager to receive metrcis reported from trials running in OpenPAI cluster. If you didn't see any metrics from WebUI in OpenPAI mode, check your machine where NNI manager runs on to make sure 51189 port is turned on in the firewall rule.
### Segmentation Fault (core dumped) when installing ### Segmentation Fault (core dumped) when installing
> make: *** [install-XXX] Segmentation fault (core dumped) > make: *** [install-XXX] Segmentation fault (core dumped)
Please try the following solutions in turn: Please try the following solutions in turn:
* Update or reinstall you current python's pip like `python3 -m pip install -U pip` * Update or reinstall you current python's pip like `python3 -m pip install -U pip`
* Install nni with `--no-cache-dir` flag like `python3 -m pip install nni --no-cache-dir` * Install NNI with `--no-cache-dir` flag like `python3 -m pip install nni --no-cache-dir`
### Job management error: getIPV4Address() failed because os.networkInterfaces().eth0 is undefined. ### Job management error: getIPV4Address() failed because os.networkInterfaces().eth0 is undefined.
Your machine don't have eth0 device, please set nniManagerIp in your config file manually. [refer](https://github.com/Microsoft/nni/blob/master/docs/ExperimentConfig.md) Your machine don't have eth0 device, please set nniManagerIp in your config file manually. [refer](https://github.com/Microsoft/nni/blob/master/docs/ExperimentConfig.md)
...@@ -25,7 +25,7 @@ Your machine don't have eth0 device, please set nniManagerIp in your config file ...@@ -25,7 +25,7 @@ Your machine don't have eth0 device, please set nniManagerIp in your config file
When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment. When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment.
### Could not stop an experiment using `nnictl stop` ### Could not stop an experiment using `nnictl stop`
If you upgrade your nni or you delete some config files of nni when there is an experiment running, this kind of issue may happen because the loss of config file. You could use `ps -ef | grep node` to find the pid of your experiment, and use `kill -9 {pid}` to kill it manually. If you upgrade your NNI or you delete some config files of NNI when there is an experiment running, this kind of issue may happen because the loss of config file. You could use `ps -ef | grep node` to find the pid of your experiment, and use `kill -9 {pid}` to kill it manually.
### Could not get `default metric` in webUI of virtual machines ### Could not get `default metric` in webUI of virtual machines
Config the network mode to bridge mode or other mode that could make virtual machine's host accessible from external machine, and make sure the port of virtual machine is not forbidden by firewall. Config the network mode to bridge mode or other mode that could make virtual machine's host accessible from external machine, and make sure the port of virtual machine is not forbidden by firewall.
......
...@@ -6,7 +6,7 @@ NNI supports running experiment using [FrameworkController](https://github.com/M ...@@ -6,7 +6,7 @@ NNI supports running experiment using [FrameworkController](https://github.com/M
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes 1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. 2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**. 3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. 4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copies files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client: 5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
``` ```
apt-get install nfs-common apt-get install nfs-common
...@@ -17,12 +17,12 @@ NNI supports running experiment using [FrameworkController](https://github.com/M ...@@ -17,12 +17,12 @@ NNI supports running experiment using [FrameworkController](https://github.com/M
## Prerequisite for Azure Kubernetes Service ## Prerequisite for Azure Kubernetes Service
1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service. 1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster). 2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files. 3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key. 4. To access Azure storage service, NNI need the access key of the storage account, and NNI uses [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Set up FrameworkController ## Set up FrameworkController
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode. Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, NNI supports frameworkcontroller by the statefulset mode.
## Design ## Design
Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar. Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar.
...@@ -71,7 +71,7 @@ frameworkcontrollerConfig: ...@@ -71,7 +71,7 @@ frameworkcontrollerConfig:
server: {your_nfs_server} server: {your_nfs_server}
path: {your_nfs_server_exported_path} path: {your_nfs_server_exported_path}
``` ```
If you use Azure Kubernetes Service, you should set `frameworkcontrollerConfig` in your config yaml file as follows: If you use Azure Kubernetes Service, you should set `frameworkcontrollerConfig` in your config YAML file as follows:
``` ```
frameworkcontrollerConfig: frameworkcontrollerConfig:
storage: azureStorage storage: azureStorage
...@@ -82,9 +82,9 @@ frameworkcontrollerConfig: ...@@ -82,9 +82,9 @@ frameworkcontrollerConfig:
accountName: {your_storage_account_name} accountName: {your_storage_account_name}
azureShare: {your_azure_share_name} azureShare: {your_azure_share_name}
``` ```
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding. The trial's config format for NNI frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys: Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster. * taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master". * name: the name of task role specified, like "worker", "ps", "master".
......
...@@ -34,7 +34,7 @@ An experiment is to run multiple trial jobs, each trial job tries a configuratio ...@@ -34,7 +34,7 @@ An experiment is to run multiple trial jobs, each trial job tries a configuratio
* Provide a runnable trial * Provide a runnable trial
* Provide or choose a tuner * Provide or choose a tuner
* Provide a yaml experiment configure file * Provide a YAML experiment configure file
* (optional) Provide or choose an assessor * (optional) Provide or choose an assessor
**Prepare trial**: Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run `ls ~/nni/examples/trials` to see all the trial examples. You can simply execute the following command to run the NNI mnist example: **Prepare trial**: Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run `ls ~/nni/examples/trials` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
...@@ -43,11 +43,11 @@ An experiment is to run multiple trial jobs, each trial job tries a configuratio ...@@ -43,11 +43,11 @@ An experiment is to run multiple trial jobs, each trial job tries a configuratio
python3 ~/nni/examples/trials/mnist-annotation/mnist.py python3 ~/nni/examples/trials/mnist-annotation/mnist.py
``` ```
This command will be filled in the yaml configure file below. Please refer to [here](howto_1_WriteTrial.md) for how to write your own trial. This command will be filled in the YAML configure file below. Please refer to [here](howto_1_WriteTrial.md) for how to write your own trial.
**Prepare tuner**: NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to [here](howto_2_CustomizedTuner.md), but for simplicity, here we choose a tuner provided by NNI as below: **Prepare tuner**: NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to [here](howto_2_CustomizedTuner.md), but for simplicity, here we choose a tuner provided by NNI as below:
```yaml ```yml
tuner: tuner:
builtinTunerName: TPE builtinTunerName: TPE
classArgs: classArgs:
...@@ -56,9 +56,9 @@ tuner: ...@@ -56,9 +56,9 @@ tuner:
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner, *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result. *builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner, *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**: Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the yaml configure file. NNI provides a demo configure file for each trial example, `cat ~/nni/examples/trials/mnist-annotation/config.yml` to see it. Its content is basically shown below: **Prepare configure file**: Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, `cat ~/nni/examples/trials/mnist-annotation/config.yml` to see it. Its content is basically shown below:
```yaml ```yml
authorName: your_name authorName: your_name
experimentName: auto_mnist experimentName: auto_mnist
......
...@@ -14,7 +14,7 @@ Moreover, in GridSearch Tuner, for users' convenience, the definition of `qunifo ...@@ -14,7 +14,7 @@ Moreover, in GridSearch Tuner, for users' convenience, the definition of `qunifo
## 2. Usage ## 2. Usage
Since Grid Search Tuner will exhaust all possible hyper-parameter combination according to the search space file without any hyper-parameter for tuner itself, all you need to do is to specify tuner name in your experiment's yaml config file: Since Grid Search Tuner will exhaust all possible hyper-parameter combination according to the search space file without any hyper-parameter for tuner itself, all you need to do is to specify tuner name in your experiment's YAML config file:
``` ```
tuner: tuner:
......
...@@ -29,8 +29,8 @@ This optimization approach is described in detail in [Algorithms for Hyper-Param ...@@ -29,8 +29,8 @@ This optimization approach is described in detail in [Algorithms for Hyper-Param
_Suggested scenario_: TPE, as a black-box optimization, can be used in various scenarios, and shows good performance in general. Especially when you have limited computation resource and can only try a small number of trials. From a large amount of experiments, we could found that TPE is far better than Random Search. _Suggested scenario_: TPE, as a black-box optimization, can be used in various scenarios, and shows good performance in general. Especially when you have limited computation resource and can only try a small number of trials. From a large amount of experiments, we could found that TPE is far better than Random Search.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: TPE builtinTunerName: TPE
classArgs: classArgs:
...@@ -46,8 +46,8 @@ In [Random Search for Hyper-Parameter Optimization][2] show that Random Search m ...@@ -46,8 +46,8 @@ In [Random Search for Hyper-Parameter Optimization][2] show that Random Search m
_Suggested scenario_: Random search is suggested when each trial does not take too long (e.g., each trial can be completed very soon, or early stopped by assessor quickly), and you have enough computation resource. Or you want to uniformly explore the search space. Random Search could be considered as baseline of search algorithm. _Suggested scenario_: Random search is suggested when each trial does not take too long (e.g., each trial can be completed very soon, or early stopped by assessor quickly), and you have enough computation resource. Or you want to uniformly explore the search space. Random Search could be considered as baseline of search algorithm.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: Random builtinTunerName: Random
``` ```
...@@ -60,8 +60,8 @@ This simple annealing algorithm begins by sampling from the prior, but tends ove ...@@ -60,8 +60,8 @@ This simple annealing algorithm begins by sampling from the prior, but tends ove
_Suggested scenario_: Anneal is suggested when each trial does not take too long, and you have enough computation resource(almost same with Random Search). Or the variables in search space could be sample from some prior distribution. _Suggested scenario_: Anneal is suggested when each trial does not take too long, and you have enough computation resource(almost same with Random Search). Or the variables in search space could be sample from some prior distribution.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: Anneal builtinTunerName: Anneal
classArgs: classArgs:
...@@ -77,8 +77,8 @@ Naive Evolution comes from [Large-Scale Evolution of Image Classifiers][3]. It r ...@@ -77,8 +77,8 @@ Naive Evolution comes from [Large-Scale Evolution of Image Classifiers][3]. It r
_Suggested scenario_: Its requirement of computation resource is relatively high. Specifically, it requires large inital population to avoid falling into local optimum. If your trial is short or leverages assessor, this tuner is a good choice. And, it is more suggested when your trial code supports weight transfer, that is, the trial could inherit the converged weights from its parent(s). This can greatly speed up the training progress. _Suggested scenario_: Its requirement of computation resource is relatively high. Specifically, it requires large inital population to avoid falling into local optimum. If your trial is short or leverages assessor, this tuner is a good choice. And, it is more suggested when your trial code supports weight transfer, that is, the trial could inherit the converged weights from its parent(s). This can greatly speed up the training progress.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: Evolution builtinTunerName: Evolution
classArgs: classArgs:
...@@ -89,9 +89,9 @@ _Usage_: ...@@ -89,9 +89,9 @@ _Usage_:
<a name="SMAC"></a> <a name="SMAC"></a>
**SMAC** **SMAC**
[SMAC][4] is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO, in order to handle categorical parameters. The SMAC supported by nni is a wrapper on [the SMAC3 github repo][5]. [SMAC][4] is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO, in order to handle categorical parameters. The SMAC supported by NNI is a wrapper on [the SMAC3 github repo][5].
Note that SMAC on nni only supports a subset of the types in [search space spec](./SearchSpaceSpec.md), including `choice`, `randint`, `uniform`, `loguniform`, `quniform(q=1)`. Note that SMAC on NNI only supports a subset of the types in [search space spec](./SearchSpaceSpec.md), including `choice`, `randint`, `uniform`, `loguniform`, `quniform(q=1)`.
_Installation_: _Installation_:
* Install swig first. (`sudo apt-get install swig` for Ubuntu users) * Install swig first. (`sudo apt-get install swig` for Ubuntu users)
...@@ -100,8 +100,8 @@ _Installation_: ...@@ -100,8 +100,8 @@ _Installation_:
_Suggested scenario_: Similar to TPE, SMAC is also a black-box tuner which can be tried in various scenarios, and is suggested when computation resource is limited. It is optimized for discrete hyperparameters, thus, suggested when most of your hyperparameters are discrete. _Suggested scenario_: Similar to TPE, SMAC is also a black-box tuner which can be tried in various scenarios, and is suggested when computation resource is limited. It is optimized for discrete hyperparameters, thus, suggested when most of your hyperparameters are discrete.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: SMAC builtinTunerName: SMAC
classArgs: classArgs:
...@@ -117,8 +117,8 @@ Batch tuner allows users to simply provide several configurations (i.e., choices ...@@ -117,8 +117,8 @@ Batch tuner allows users to simply provide several configurations (i.e., choices
_Suggested sceanrio_: If the configurations you want to try have been decided, you can list them in searchspace file (using `choice`) and run them using batch tuner. _Suggested sceanrio_: If the configurations you want to try have been decided, you can list them in searchspace file (using `choice`) and run them using batch tuner.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: BatchTuner builtinTunerName: BatchTuner
``` ```
...@@ -149,8 +149,8 @@ Note that the only acceptable types of search space are `choice`, `quniform`, `q ...@@ -149,8 +149,8 @@ Note that the only acceptable types of search space are `choice`, `quniform`, `q
_Suggested scenario_: It is suggested when search space is small, it is feasible to exhaustively sweeping the whole search space. _Suggested scenario_: It is suggested when search space is small, it is feasible to exhaustively sweeping the whole search space.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: GridSearch builtinTunerName: GridSearch
``` ```
...@@ -163,8 +163,8 @@ _Usage_: ...@@ -163,8 +163,8 @@ _Usage_:
_Suggested scenario_: It is suggested when you have limited computation resource but have relatively large search space. It performs good in the scenario that intermediate result (e.g., accuracy) can reflect good or bad of final result (e.g., accuracy) to some extent. _Suggested scenario_: It is suggested when you have limited computation resource but have relatively large search space. It performs good in the scenario that intermediate result (e.g., accuracy) can reflect good or bad of final result (e.g., accuracy) to some extent.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
advisor: advisor:
builtinAdvisorName: Hyperband builtinAdvisorName: Hyperband
classArgs: classArgs:
...@@ -189,8 +189,8 @@ NetworkMorphism requires [pyTorch](https://pytorch.org/get-started/locally), so ...@@ -189,8 +189,8 @@ NetworkMorphism requires [pyTorch](https://pytorch.org/get-started/locally), so
_Suggested scenario_: It is suggested that you want to apply deep learning methods to your task (your own dataset) but you have no idea of how to choose or design a network. You modify the [example](../examples/trials/network_morphism/cifar10/cifar10_keras.py) to fit your own dataset and your own data augmentation method. Also you can change the batch size, learning rate or optimizer. It is feasible for different tasks to find a good network architecture. Now this tuner only supports the cv domain. _Suggested scenario_: It is suggested that you want to apply deep learning methods to your task (your own dataset) but you have no idea of how to choose or design a network. You modify the [example](../examples/trials/network_morphism/cifar10/cifar10_keras.py) to fit your own dataset and your own data augmentation method. Also you can change the batch size, learning rate or optimizer. It is feasible for different tasks to find a good network architecture. Now this tuner only supports the cv domain.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: NetworkMorphism builtinTunerName: NetworkMorphism
classArgs: classArgs:
...@@ -232,11 +232,11 @@ Metis Tuner requires [sklearn](https://scikit-learn.org/), so users should insta ...@@ -232,11 +232,11 @@ Metis Tuner requires [sklearn](https://scikit-learn.org/), so users should insta
_Suggested scenario_: _Suggested scenario_:
Similar to TPE and SMAC, Metis is a black-box tuner. If your system takes a long time to finish each trial, Metis is more favorable than other approaches such as random search. Furthermore, Metis provides guidance on the subsequent trial. Here is an [example](../examples/trials/auto-gbdt/search_space_metis.json) about the use of Metis. User only need to send the final result like `accuracy` to tuner, by calling the nni SDK. Similar to TPE and SMAC, Metis is a black-box tuner. If your system takes a long time to finish each trial, Metis is more favorable than other approaches such as random search. Furthermore, Metis provides guidance on the subsequent trial. Here is an [example](../examples/trials/auto-gbdt/search_space_metis.json) about the use of Metis. User only need to send the final result like `accuracy` to tuner, by calling the NNI SDK.
_Usage_: _Usage_:
```yaml ```yml
# config.yaml # config.yml
tuner: tuner:
builtinTunerName: MetisTuner builtinTunerName: MetisTuner
classArgs: classArgs:
...@@ -262,7 +262,7 @@ Medianstop is a simple early stopping rule mentioned in the [paper][8]. It stops ...@@ -262,7 +262,7 @@ Medianstop is a simple early stopping rule mentioned in the [paper][8]. It stops
_Suggested scenario_: It is applicable in a wide range of performance curves, thus, can be used in various scenarios to speed up the tuning progress. _Suggested scenario_: It is applicable in a wide range of performance curves, thus, can be used in various scenarios to speed up the tuning progress.
_Usage_: _Usage_:
```yaml ```yml
assessor: assessor:
builtinAssessorName: Medianstop builtinAssessorName: Medianstop
classArgs: classArgs:
...@@ -282,7 +282,7 @@ Curve Fitting Assessor is a LPA(learning, predicting, assessing) algorithm. It s ...@@ -282,7 +282,7 @@ Curve Fitting Assessor is a LPA(learning, predicting, assessing) algorithm. It s
_Suggested scenario_: It is applicable in a wide range of performance curves, thus, can be used in various scenarios to speed up the tuning progress. Even better, it's able to handle and assess curves with similar performance. _Suggested scenario_: It is applicable in a wide range of performance curves, thus, can be used in various scenarios to speed up the tuning progress. Even better, it's able to handle and assess curves with similar performance.
_Usage_: _Usage_:
```yaml ```yml
assessor: assessor:
builtinAssessorName: Curvefitting builtinAssessorName: Curvefitting
classArgs: classArgs:
......
...@@ -7,7 +7,7 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku ...@@ -7,7 +7,7 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku
2. Download, set up, and deploy **Kubelow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to set up Kubeflow 2. Download, set up, and deploy **Kubelow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to set up Kubeflow
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. 3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**. 4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. 5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client: 6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
``` ```
apt-get install nfs-common apt-get install nfs-common
...@@ -19,14 +19,14 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku ...@@ -19,14 +19,14 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku
1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service. 1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster). 2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Deploy kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/). 3. Deploy kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/).
4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files. 4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key. 5. To access Azure storage service, NNI need the access key of the storage account, and NNI use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Design ## Design
![](./img/kubeflow_training_design.png) ![](./img/kubeflow_training_design.png)
Kubeflow training service instantiates a kubernetes rest client to interact with your K8s cluster's API server. Kubeflow training service instantiates a kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yaml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in nni config yaml file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files. For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
## Supported operator ## Supported operator
NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested. NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested.
...@@ -55,7 +55,7 @@ kubeflowConfig: ...@@ -55,7 +55,7 @@ kubeflowConfig:
# Your NFS server export path, like /var/nfs/nni # Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path} path: {your_nfs_server_export_path}
``` ```
If you use Azure storage, you should set `kubeflowConfig` in your config yaml file as follows: If you use Azure storage, you should set `kubeflowConfig` in your config YAML file as follows:
``` ```
kubeflowConfig: kubeflowConfig:
storage: azureStorage storage: azureStorage
...@@ -69,7 +69,7 @@ kubeflowConfig: ...@@ -69,7 +69,7 @@ kubeflowConfig:
## Run an experiment ## Run an experiment
Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The nni config yaml file's content is like: Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The NNI config yml file's content is like:
``` ```
authorName: default authorName: default
experimentName: example_mnist experimentName: example_mnist
...@@ -119,7 +119,7 @@ kubeflowConfig: ...@@ -119,7 +119,7 @@ kubeflowConfig:
path: {your_nfs_server_export_path} path: {your_nfs_server_export_path}
``` ```
Note: You should explicitly set `trainingServicePlatform: kubeflow` in nni config yaml file if you want to start experiment in kubeflow mode. Note: You should explicitly set `trainingServicePlatform: kubeflow` in NNI config yml file if you want to start experiment in kubeflow mode.
If you want to run Pytorch jobs, you could set your config files as follow: If you want to run Pytorch jobs, you could set your config files as follow:
``` ```
...@@ -185,9 +185,9 @@ Trial configuration in kubeflow mode have the following configuration keys: ...@@ -185,9 +185,9 @@ Trial configuration in kubeflow mode have the following configuration keys:
* ps (optional). This config section is used to configure tensorflow parameter server role. * ps (optional). This config section is used to configure tensorflow parameter server role.
* master(optional). This config section is used to configure pytorch parameter server role. * master(optional). This config section is used to configure pytorch parameter server role.
Once complete to fill nni experiment config file and save (for example, save as exp_kubeflow.yaml), then run the following command Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
``` ```
nnictl create --config exp_kubeflow.yaml nnictl create --config exp_kubeflow.yml
``` ```
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard. You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard.
......
# nnictl # nnictl
## Introduction ## Introduction
__nnictl__ is a command line tool, which can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc. __nnictl__ is a command line tool, which can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc.
...@@ -8,41 +7,45 @@ __nnictl__ is a command line tool, which can be used to control experiments, suc ...@@ -8,41 +7,45 @@ __nnictl__ is a command line tool, which can be used to control experiments, suc
## Commands ## Commands
nnictl support commands: nnictl support commands:
- [nnictl create](#create)
- [nnictl resume](#resume)
- [nnictl stop](#stop)
- [nnictl update](#update)
- [nnictl trial](#trial)
- [nnictl top](#top)
- [nnictl experiment](#experiment)
- [nnictl config](#config)
- [nnictl log](#log)
- [nnictl webui](#webui)
- [nnictl tensorboard](#tensorboard)
- [nnictl package](#package)
* [nnictl create](#create)
* [nnictl resume](#resume)
* [nnictl stop](#stop)
* [nnictl update](#update)
* [nnictl trial](#trial)
* [nnictl top](#top)
* [nnictl experiment](#experiment)
* [nnictl config](#config)
* [nnictl log](#log)
* [nnictl webui](#webui)
* [nnictl tensorboard](#tensorboard)
* [nnictl package](#package)
* [nnictl --version](#version)
### Manage an experiment ### Manage an experiment
<a name="create"></a> <a name="create"></a>
* __nnictl create__ * __nnictl create__
* Description * Description
You can use this command to create a new experiment, using the configuration specified in config file. You can use this command to create a new experiment, using the configuration specified in config file.
After this command is successfully done, the context will be set as this experiment,
which means the following command you issued is associated with this experiment, After this command is successfully done, the context will be set as this experiment, which means the following command you issued is associated with this experiment, unless you explicitly changes the context(not supported yet).
unless you explicitly changes the context(not supported yet).
* Usage * Usage
```bash
nnictl create [OPTIONS] nnictl create [OPTIONS]
```
* Options
Options: |Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|--config, -c| True| |YAML configure file of the experiment|
|--port, -p|False| |the port of restful server|
| Name, shorthand | Required|Default | Description |
| ------ | ------ | ------ |------ |
| --config, -c| True| |yaml configure file of the experiment|
| --port, -p | False| |the port of restful server|
<a name="resume"></a> <a name="resume"></a>
* __nnictl resume__ * __nnictl resume__
...@@ -56,17 +59,16 @@ nnictl support commands: ...@@ -56,17 +59,16 @@ nnictl support commands:
nnictl resume [OPTIONS] nnictl resume [OPTIONS]
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description |
| ------ | ------ | ------ |------ |
| id| False| |The id of the experiment you want to resume|
| --port, -p| False| |Rest port of the experiment you want to resume|
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|id| False| |The id of the experiment you want to resume|
|--port, -p| False| |Rest port of the experiment you want to resume|
<a name="stop"></a> <a name="stop"></a>
* __nnictl stop__ * __nnictl stop__
* Description * Description
You can use this command to stop a running experiment or multiple experiments. You can use this command to stop a running experiment or multiple experiments.
...@@ -84,7 +86,8 @@ nnictl support commands: ...@@ -84,7 +86,8 @@ nnictl support commands:
3.If the id ends with *, nnictl will stop all experiments whose ids matchs the regular. 3.If the id ends with *, nnictl will stop all experiments whose ids matchs the regular.
4.If the id does not exist but match the prefix of an experiment id, nnictl will stop the matched experiment. 4.If the id does not exist but match the prefix of an experiment id, nnictl will stop the matched experiment.
5.If the id does not exist but match multiple prefix of the experiment ids, nnictl will give id information. 5.If the id does not exist but match multiple prefix of the experiment ids, nnictl will give id information.
6.Users could use 'nnictl stop all' to stop all experiments 6.Users could use 'nnictl stop all' to stop all experiments.
<a name="update"></a> <a name="update"></a>
* __nnictl update__ * __nnictl update__
...@@ -95,14 +98,16 @@ nnictl support commands: ...@@ -95,14 +98,16 @@ nnictl support commands:
* Usage * Usage
```bash
nnictl update searchspace [OPTIONS] nnictl update searchspace [OPTIONS]
```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
| --filename, -f| True| |the file storing your new search space| |--filename, -f| True| |the file storing your new search space|
* __nnictl update concurrency__ * __nnictl update concurrency__
* Description * Description
...@@ -111,30 +116,34 @@ nnictl support commands: ...@@ -111,30 +116,34 @@ nnictl support commands:
* Usage * Usage
```bash
nnictl update concurrency [OPTIONS] nnictl update concurrency [OPTIONS]
```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
| --value, -v| True| |the number of allowed concurrent trials| |--value, -v| True| |the number of allowed concurrent trials|
* __nnictl update duration__ * __nnictl update duration__
* Description * Description
You can use this command to update an experiment's concurrency. You can use this command to update an experiment's concurrency.
* Usage * Usage
```bash
nnictl update duration [OPTIONS] nnictl update duration [OPTIONS]
```
* Options
Options: |Name, shorthand|Required|Default|Description|
|------|------|------ |------|
| Name, shorthand | Required|Default | Description | |id| False| |ID of the experiment you want to set|
| ------ | ------ | ------ |------ | |--value, -v| True| |the experiment duration will be NUMBER seconds. SUFFIX may be 's' for seconds (the default), 'm' for minutes, 'h' for hours or 'd' for days.|
| id| False| |ID of the experiment you want to set|
| --value, -v| True| |the experiment duration will be NUMBER seconds. SUFFIX may be 's' for seconds (the default), 'm' for minutes, 'h' for hours or 'd' for days.|
* __nnictl update trialnum__ * __nnictl update trialnum__
* Description * Description
...@@ -143,15 +152,16 @@ nnictl support commands: ...@@ -143,15 +152,16 @@ nnictl support commands:
* Usage * Usage
```bash
nnictl update trialnum [OPTIONS] nnictl update trialnum [OPTIONS]
```
Options: * Options
| Name, shorthand | Required|Default | Description |
| ------ | ------ | ------ |------ |
| id| False| |ID of the experiment you want to set|
| --value, -v| True| |the new number of maxtrialnum you want to set|
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|id| False| |ID of the experiment you want to set|
|--value, -v| True| |the new number of maxtrialnum you want to set|
<a name="trial"></a> <a name="trial"></a>
* __nnictl trial__ * __nnictl trial__
...@@ -168,26 +178,31 @@ nnictl support commands: ...@@ -168,26 +178,31 @@ nnictl support commands:
nnictl trial ls nnictl trial ls
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
* __nnictl trial kill__ * __nnictl trial kill__
* Description * Description
You can use this command to kill a trial job. You can use this command to kill a trial job.
* Usage * Usage
```bash
nnictl trial kill [OPTIONS] nnictl trial kill [OPTIONS]
```
Options: * Options
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|id| False| |ID of the experiment you want to set|
|--trialid, -t| True| |ID of the trial you want to kill.|
| Name, shorthand | Required|Default | Description |
| ------ | ------ | ------ |------ |
| id| False| |ID of the experiment you want to set|
| --trialid, -t| True| |ID of the trial you want to kill.|
<a name="top"></a> <a name="top"></a>
* __nnictl top__ * __nnictl top__
...@@ -197,15 +212,16 @@ nnictl support commands: ...@@ -197,15 +212,16 @@ nnictl support commands:
* Usage * Usage
```bash
nnictl top nnictl top
```
Options: * Options
| Name, shorthand | Required|Default | Description |
| ------ | ------ | ------ |------ |
| id| False| |ID of the experiment you want to set|
| --time, -t| False| |The interval to update the experiment status, the unit of time is second, and the default value is 3 second.|
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|id| False| |ID of the experiment you want to set|
|--time, -t| False| |The interval to update the experiment status, the unit of time is second, and the default value is 3 second.|
<a name="experiment"></a> <a name="experiment"></a>
### Manage experiment information ### Manage experiment information
...@@ -222,11 +238,11 @@ nnictl support commands: ...@@ -222,11 +238,11 @@ nnictl support commands:
nnictl experiment show nnictl experiment show
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
* __nnictl experiment status__ * __nnictl experiment status__
...@@ -240,13 +256,14 @@ nnictl support commands: ...@@ -240,13 +256,14 @@ nnictl support commands:
nnictl experiment status nnictl experiment status
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
* __nnictl experiment list__ * __nnictl experiment list__
* Description * Description
Show the information of all the (running) experiments. Show the information of all the (running) experiments.
...@@ -257,16 +274,18 @@ nnictl support commands: ...@@ -257,16 +274,18 @@ nnictl support commands:
nnictl experiment list nnictl experiment list
``` ```
<a name="config"></a> <a name="config"></a>
* __nnictl config show__ * __nnictl config show__
* Description * Description
Display the current context information. Display the current context information.
* Usage * Usage
```bash
nnictl config show nnictl config show
```
<a name="log"></a> <a name="log"></a>
### Manage log ### Manage log
...@@ -283,14 +302,14 @@ nnictl support commands: ...@@ -283,14 +302,14 @@ nnictl support commands:
nnictl log stdout [options] nnictl log stdout [options]
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
| --head, -h| False| |show head lines of stdout| |--head, -h| False| |show head lines of stdout|
| --tail, -t| False| |show tail lines of stdout| |--tail, -t| False| |show tail lines of stdout|
| --path, -p| False| |show the path of stdout file| |--path, -p| False| |show the path of stdout file|
* __nnictl log stderr__ * __nnictl log stderr__
* Description * Description
...@@ -303,22 +322,33 @@ nnictl support commands: ...@@ -303,22 +322,33 @@ nnictl support commands:
nnictl log stderr [options] nnictl log stderr [options]
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
| --head, -h| False| |show head lines of stderr| |--head, -h| False| |show head lines of stderr|
| --tail, -t| False| |show tail lines of stderr| |--tail, -t| False| |show tail lines of stderr|
| --path, -p| False| |show the path of stderr file| |--path, -p| False| |show the path of stderr file|
* __nnictl log trial__ * __nnictl log trial__
* Description * Description
Show trial log path. Show trial log path.
* Usage * Usage
```bash
nnictl log trial [options]
```
* Options
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|id| False| |the id of trial|
<a name="webui"></a> <a name="webui"></a>
### Manage webui ### Manage webui
...@@ -339,13 +369,13 @@ nnictl support commands: ...@@ -339,13 +369,13 @@ nnictl support commands:
nnictl tensorboard start nnictl tensorboard start
``` ```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
| --trialid| False| |ID of the trial| |--trialid| False| |ID of the trial|
| --port| False| 6006|The port of the tensorboard process| |--port| False| 6006|The port of the tensorboard process|
* Detail * Detail
...@@ -362,16 +392,19 @@ nnictl support commands: ...@@ -362,16 +392,19 @@ nnictl support commands:
* Usage * Usage
```bash
nnictl tensorboard stop nnictl tensorboard stop
```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| id| False| |ID of the experiment you want to set| |id| False| |ID of the experiment you want to set|
<a name="package"></a> <a name="package"></a>
### Manage package ### Manage package
* __nnictl package install__ * __nnictl package install__
* Description * Description
...@@ -379,19 +412,39 @@ nnictl support commands: ...@@ -379,19 +412,39 @@ nnictl support commands:
* Usage * Usage
```bash
nnictl package install [OPTIONS] nnictl package install [OPTIONS]
```
Options: * Options
| Name, shorthand | Required|Default | Description | |Name, shorthand|Required|Default|Description|
| ------ | ------ | ------ |------ | |------|------|------ |------|
| --name| True| |The name of package to be installed| |--name| True| |The name of package to be installed|
* __nnictl package show__ * __nnictl package show__
* Description * Description
List the packages supported. List the packages supported.
* Usage * Usage
```bash
nnictl package show nnictl package show
```
<a name="version"></a>
### Check NNI version
* __nnictl --version__
* Description
Describe the current version of NNI installed.
* Usage
```bash
nnictl --version
```
\ No newline at end of file
...@@ -6,9 +6,9 @@ NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai ...@@ -6,9 +6,9 @@ NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai
Install NNI, follow the install guide [here](GetStarted.md). Install NNI, follow the install guide [here](GetStarted.md).
## Run an experiment ## Run an experiment
Use `examples/trials/mnist-annotation` as an example. The nni config yaml file's content is like: Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like:
```yaml ```yml
authorName: your_name authorName: your_name
experimentName: auto_mnist experimentName: auto_mnist
# how many trials could be concurrently running # how many trials could be concurrently running
...@@ -41,7 +41,7 @@ paiConfig: ...@@ -41,7 +41,7 @@ paiConfig:
host: 10.1.1.1 host: 10.1.1.1
``` ```
Note: You should set `trainingServicePlatform: pai` in nni config yaml file if you want to start experiment in pai mode. Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode.
Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have five additional keys: Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have five additional keys:
* cpuNum * cpuNum
...@@ -49,16 +49,16 @@ Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial con ...@@ -49,16 +49,16 @@ Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial con
* memoryMB * memoryMB
* Required key. Should be positive number based on your trial program's memory requirement * Required key. Should be positive number based on your trial program's memory requirement
* image * image
* Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your traill will run. * Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it. * We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it.
* dataDir * dataDir
* Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory} * Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory}
* outputDir * outputDir
* Optional key. It specifies the HDFS output direcotry for trial. Once the trial is completed (either succeed or fail), trial's stdout, stderr will be copied to this directory by NNI sdk automatically. The format should be something like hdfs://{your HDFS host}:9000/{your output directory} * Optional key. It specifies the HDFS output direcotry for trial. Once the trial is completed (either succeed or fail), trial's stdout, stderr will be copied to this directory by NNI sdk automatically. The format should be something like hdfs://{your HDFS host}:9000/{your output directory}
Once complete to fill nni experiment config file and save (for example, save as exp_pai.yaml), then run the following command Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
``` ```
nnictl create --config exp_pai.yaml nnictl create --config exp_pai.yml
``` ```
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the pai jobs created by NNI in your OpenPAI cluster's web portal, like: You can see the pai jobs created by NNI in your OpenPAI cluster's web portal, like:
...@@ -78,4 +78,4 @@ You can see there're three fils in output folder: stderr, stdout, and trial.log ...@@ -78,4 +78,4 @@ You can see there're three fils in output folder: stderr, stdout, and trial.log
If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS. If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS.
Any problems when using NNI in pai mode, plesae create issues on [NNI github repo](https://github.com/Microsoft/nni), or send mail to nni@microsoft.com Any problems when using NNI in pai mode, plesae create issues on [NNI github repo](https://github.com/Microsoft/nni).
...@@ -104,9 +104,9 @@ If you want to use NNI to automatically train your model and find the optimal hy ...@@ -104,9 +104,9 @@ If you want to use NNI to automatically train your model and find the optimal hy
*Implemented code directory: [mnist.py](../examples/trials/mnist/mnist.py)* *Implemented code directory: [mnist.py](../examples/trials/mnist/mnist.py)*
**Step 3**: Define a `config` file in yaml, which declare the `path` to search space and trial, also give `other information` such as tuning algorithm, max trial number and max runtime arguments. **Step 3**: Define a `config` file in YAML, which declare the `path` to search space and trial, also give `other information` such as tuning algorithm, max trial number and max runtime arguments.
```yaml ```yml
authorName: default authorName: default
experimentName: example_mnist experimentName: example_mnist
trialConcurrency: 1 trialConcurrency: 1
......
# ChangeLog # ChangeLog
## Release 0.5.0 - 01/14/2019 ## Release 0.5.0 - 01/14/2019
### Major Features ### Major Features
#### New tuner and assessor supports #### New tuner and assessor supports
* Support [Metis tuner](./HowToChooseTuner.md#MetisTuner) as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support [ENAS customized tuner](https://github.com/countif/enas_nni), a tuner contributed by github community user, is an algorithm for neural network search, it could learn neural network architecture via reinforcement learning and serve a better performance than NAS.
* Support [Curve fitting assessor](./HowToChooseTuner.md#Curvefitting) for early stop policy using learning curve extrapolation.
* Advanced Support of [Weight Sharing](./AdvancedNAS.md): Enable weight sharing for NAS tuners, currently through NFS.
* Support [Metis tuner](./HowToChooseTuner.md#MetisTuner) as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support [ENAS customized tuner](https://github.com/countif/enas_nni), a tuner contributed by github community user, is an algorithm for neural network search, it could learn neural network architecture via reinforcement learning and serve a better performance than NAS.
* Support [Curve fitting assessor](./HowToChooseTuner.md#Curvefitting) for early stop policy using learning curve extrapolation.
* Advanced Support of [Weight Sharing](./AdvancedNAS.md): Enable weight sharing for NAS tuners, currently through NFS.
#### Training Service Enhancement #### Training Service Enhancement
* [FrameworkController Training service](./FrameworkControllerMode.md): Support run experiments using frameworkcontroller on kubernetes * [FrameworkController Training service](./FrameworkControllerMode.md): Support run experiments using frameworkcontroller on kubernetes
* FrameworkController is a Controller on kubernetes that is general enough to run (distributed) jobs with various machine learning frameworks, such as tensorflow, pytorch, MXNet. * FrameworkController is a Controller on kubernetes that is general enough to run (distributed) jobs with various machine learning frameworks, such as tensorflow, pytorch, MXNet.
* NNI provides unified and simple specification for job definition. * NNI provides unified and simple specification for job definition.
* MNIST example for how to use FrameworkController. * MNIST example for how to use FrameworkController.
#### User Experience improvements #### User Experience improvements
* A better trial logging support for NNI experiments in PAI, Kubeflow and FrameworkController mode:
* A better trial logging support for NNI experiments in PAI, Kubeflow and FrameworkController mode:
* An improved logging architecture to send stdout/stderr of trials to NNI manager via Http post. NNI manager will store trial's stdout/stderr messages in local log file. * An improved logging architecture to send stdout/stderr of trials to NNI manager via Http post. NNI manager will store trial's stdout/stderr messages in local log file.
* Show the link for trial log file on WebUI. * Show the link for trial log file on WebUI.
* Support to show final result's all key-value pairs. * Support to show final result's all key-value pairs.
## Release 0.4.1 - 12/14/2018 ## Release 0.4.1 - 12/14/2018
### Major Features ### Major Features
#### New tuner supports #### New tuner supports
* Support [network morphism](./HowToChooseTuner.md#NetworkMorphism) as a new tuner
* Support [network morphism](./HowToChooseTuner.md#NetworkMorphism) as a new tuner
#### Training Service improvements #### Training Service improvements
* Migrate [Kubeflow training service](https://github.com/Microsoft/nni/blob/master/docs/KubeflowMode.md)'s dependency from kubectl CLI to [Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/) client
* [Pytorch-operator](https://github.com/kubeflow/pytorch-operator) support for Kubeflow training service * Migrate [Kubeflow training service](https://github.com/Microsoft/nni/blob/master/docs/KubeflowMode.md)'s dependency from kubectl CLI to [Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/) client
* Improvement on local code files uploading to OpenPAI HDFS * [Pytorch-operator](https://github.com/kubeflow/pytorch-operator) support for Kubeflow training service
* Fixed OpenPAI integration WebUI bug: WebUI doesn't show latest trial job status, which is caused by OpenPAI token expiration * Improvement on local code files uploading to OpenPAI HDFS
* Fixed OpenPAI integration WebUI bug: WebUI doesn't show latest trial job status, which is caused by OpenPAI token expiration
#### NNICTL improvements #### NNICTL improvements
* Show version information both in nnictl and WebUI. You can run **nnictl -v** to show your current installed NNI version
* Show version information both in nnictl and WebUI. You can run **nnictl -v** to show your current installed NNI version
#### WebUI improvements #### WebUI improvements
* Enable modify concurrency number during experiment
* Add feedback link to NNI github 'create issue' page * Enable modify concurrency number during experiment
* Enable customize top 10 trials regarding to metric numbers (largest or smallest) * Add feedback link to NNI github 'create issue' page
* Enable download logs for dispatcher & nnimanager * Enable customize top 10 trials regarding to metric numbers (largest or smallest)
* Enable automatic scaling of axes for metric number * Enable download logs for dispatcher & nnimanager
* Update annotation to support displaying real choice in searchspace * Enable automatic scaling of axes for metric number
* Update annotation to support displaying real choice in searchspace
### New examples ### New examples
* [FashionMnist](https://github.com/Microsoft/nni/tree/master/examples/trials/network_morphism), work together with network morphism tuner
* [Distributed MNIST example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed-pytorch) written in PyTorch
* [FashionMnist](https://github.com/Microsoft/nni/tree/master/examples/trials/network_morphism), work together with network morphism tuner
* [Distributed MNIST example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed-pytorch) written in PyTorch
## Release 0.4 - 12/6/2018 ## Release 0.4 - 12/6/2018
### Major Features ### Major Features
* [Kubeflow Training service](./KubeflowMode.md)
* [Kubeflow Training service](./KubeflowMode.md)
* Support tf-operator * Support tf-operator
* [Distributed trial example](../examples/trials/mnist-distributed/dist_mnist.py) on Kubeflow * [Distributed trial example](../examples/trials/mnist-distributed/dist_mnist.py) on Kubeflow
* [Grid search tuner](../src/sdk/pynni/nni/README.md#Grid) * [Grid search tuner](../src/sdk/pynni/nni/README.md#Grid)
* [Hyperband tuner](../src/sdk/pynni/nni/README.md#Hyperband) * [Hyperband tuner](../src/sdk/pynni/nni/README.md#Hyperband)
* Support launch NNI experiment on MAC * Support launch NNI experiment on MAC
* WebUI * WebUI
* UI support for hyperband tuner * UI support for hyperband tuner
* Remove tensorboard button * Remove tensorboard button
* Show experiment error message * Show experiment error message
...@@ -65,67 +76,88 @@ ...@@ -65,67 +76,88 @@
* Support search a specific trial by trial number * Support search a specific trial by trial number
* Show trial's hdfsLogPath * Show trial's hdfsLogPath
* Download experiment parameters * Download experiment parameters
### Others ### Others
* Asynchronous dispatcher
* Docker file update, add pytorch library * Asynchronous dispatcher
* Refactor 'nnictl stop' process, send SIGTERM to nni manager process, rather than calling stop Rest API. * Docker file update, add pytorch library
* OpenPAI training service bug fix * Refactor 'nnictl stop' process, send SIGTERM to nni manager process, rather than calling stop Rest API.
* OpenPAI training service bug fix
* Support NNI Manager IP configuration(nniManagerIp) in PAI cluster config file, to fix the issue that user’s machine has no eth0 device * Support NNI Manager IP configuration(nniManagerIp) in PAI cluster config file, to fix the issue that user’s machine has no eth0 device
* File number in codeDir is capped to 1000 now, to avoid user mistakenly fill root dir for codeDir * File number in codeDir is capped to 1000 now, to avoid user mistakenly fill root dir for codeDir
* Don’t print useless ‘metrics is empty’ log int PAI job’s stdout. Only print useful message once new metrics are recorded, to reduce confusion when user checks PAI trial’s output for debugging purpose * Don’t print useless ‘metrics is empty’ log int PAI job’s stdout. Only print useful message once new metrics are recorded, to reduce confusion when user checks PAI trial’s output for debugging purpose
* Add timestamp at the beginning of each log entry in trial keeper. * Add timestamp at the beginning of each log entry in trial keeper.
## Release 0.3.0 - 11/2/2018 ## Release 0.3.0 - 11/2/2018
### NNICTL new features and updates ### NNICTL new features and updates
* Support running multiple experiments simultaneously. * Support running multiple experiments simultaneously.
Before v0.3, NNI only supports running single experiment once a time. After this realse, users are able to run multiple experiments simultaneously. Each experiment will require a unique port, the 1st experiment will be set to the default port as previous versions. You can specify a unique port for the rest experiments as below: Before v0.3, NNI only supports running single experiment once a time. After this realse, users are able to run multiple experiments simultaneously. Each experiment will require a unique port, the 1st experiment will be set to the default port as previous versions. You can specify a unique port for the rest experiments as below:
```nnictl create --port 8081 --config <config file path>``` ```bash
nnictl create --port 8081 --config <config file path>
```
* Support updating max trial number. * Support updating max trial number.
use ```nnictl update --help``` to learn more. Or refer to [NNICTL Spec](https://github.com/Microsoft/nni/blob/master/docs/NNICTLDOC.md) for the fully usage of NNICTL. use `nnictl update --help` to learn more. Or refer to [NNICTL Spec](https://github.com/Microsoft/nni/blob/master/docs/NNICTLDOC.md) for the fully usage of NNICTL.
### API new features and updates ### API new features and updates
* <span style="color:red">**breaking change**</span>: nn.get_parameters() is refactored to nni.get_next_parameter. All examples of prior releases can not run on v0.3, please clone nni repo to get new examples. If you had applied NNI to your own codes, please update the API accordingly. * <span style="color:red">**breaking change**</span>: nn.get_parameters() is refactored to nni.get_next_parameter. All examples of prior releases can not run on v0.3, please clone nni repo to get new examples. If you had applied NNI to your own codes, please update the API accordingly.
* New API **nni.get_sequence_id()**. * New API **nni.get_sequence_id()**.
Each trial job is allocated a unique sequence number, which can be retrieved by nni.get_sequence_id() API. Each trial job is allocated a unique sequence number, which can be retrieved by nni.get_sequence_id() API.
```git clone -b v0.3 https://github.com/Microsoft/nni.git``` ```bash
git clone -b v0.3 https://github.com/Microsoft/nni.git
```
* **nni.report_final_result(result)** API supports more data types for result parameter. * **nni.report_final_result(result)** API supports more data types for result parameter.
It can be of following types: It can be of following types:
* int * int
* float * float
* A python dict containing 'default' key, the value of 'default' key should be of type int or float. The dict can contain any other key value pairs. * A python dict containing 'default' key, the value of 'default' key should be of type int or float. The dict can contain any other key value pairs.
### New tuner support ### New tuner support
* **Batch Tuner** which iterates all parameter combination, can be used to submit batch trial jobs. * **Batch Tuner** which iterates all parameter combination, can be used to submit batch trial jobs.
### New examples ### New examples
* A NNI Docker image for public usage: * A NNI Docker image for public usage:
```docker pull msranni/nni:latest```
```bash
docker pull msranni/nni:latest
```
* New trial example: [NNI Sklearn Example](https://github.com/Microsoft/nni/tree/master/examples/trials/sklearn) * New trial example: [NNI Sklearn Example](https://github.com/Microsoft/nni/tree/master/examples/trials/sklearn)
* New competition example: [Kaggle Competition TGS Salt Example](https://github.com/Microsoft/nni/tree/master/examples/trials/kaggle-tgs-salt) * New competition example: [Kaggle Competition TGS Salt Example](https://github.com/Microsoft/nni/tree/master/examples/trials/kaggle-tgs-salt)
### Others ### Others
* UI refactoring, refer to [WebUI doc](WebUI.md) for how to work with the new UI. * UI refactoring, refer to [WebUI doc](WebUI.md) for how to work with the new UI.
* Continuous Integration: NNI had switched to Azure pipelines * Continuous Integration: NNI had switched to Azure pipelines
* [Known Issues in release 0.3.0](https://github.com/Microsoft/nni/labels/nni030knownissues). * [Known Issues in release 0.3.0](https://github.com/Microsoft/nni/labels/nni030knownissues).
## Release 0.2.0 - 9/29/2018 ## Release 0.2.0 - 9/29/2018
### Major Features ### Major Features
* Support [OpenPAI](https://github.com/Microsoft/pai) (aka pai) Training Service (See [here](./PAIMode.md) for instructions about how to submit NNI job in pai mode)
* Support [OpenPAI](https://github.com/Microsoft/pai) (aka pai) Training Service (See [here](./PAIMode.md) for instructions about how to submit NNI job in pai mode)
* Support training services on pai mode. NNI trials will be scheduled to run on OpenPAI cluster * Support training services on pai mode. NNI trials will be scheduled to run on OpenPAI cluster
* NNI trial's output (including logs and model file) will be copied to OpenPAI HDFS for further debugging and checking * NNI trial's output (including logs and model file) will be copied to OpenPAI HDFS for further debugging and checking
* Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner (See [here](HowToChooseTuner.md) for instructions about how to use SMAC tuner) * Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner (See [here](HowToChooseTuner.md) for instructions about how to use SMAC tuner)
* [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO to handle categorical parameters. The SMAC supported by NNI is a wrapper on [SMAC3](https://github.com/automl/SMAC3) * [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO to handle categorical parameters. The SMAC supported by NNI is a wrapper on [SMAC3](https://github.com/automl/SMAC3)
* Support NNI installation on [conda](https://conda.io/docs/index.html) and python virtual environment * Support NNI installation on [conda](https://conda.io/docs/index.html) and python virtual environment
* Others * Others
* Update ga squad example and related documentation * Update ga squad example and related documentation
* WebUI UX small enhancement and bug fix * WebUI UX small enhancement and bug fix
### Known Issues ### Known Issues
[Known Issues in release 0.2.0](https://github.com/Microsoft/nni/labels/nni020knownissues). [Known Issues in release 0.2.0](https://github.com/Microsoft/nni/labels/nni020knownissues).
## Release 0.1.0 - 9/10/2018 (initial release) ## Release 0.1.0 - 9/10/2018 (initial release)
...@@ -133,21 +165,23 @@ ...@@ -133,21 +165,23 @@
Initial release of Neural Network Intelligence (NNI). Initial release of Neural Network Intelligence (NNI).
### Major Features ### Major Features
* Installation and Deployment
* Installation and Deployment
* Support pip install and source codes install * Support pip install and source codes install
* Support training services on local mode(including Multi-GPU mode) as well as multi-machines mode * Support training services on local mode(including Multi-GPU mode) as well as multi-machines mode
* Tuners, Assessors and Trial * Tuners, Assessors and Trial
* Support AutoML algorithms including: hyperopt_tpe, hyperopt_annealing, hyperopt_random, and evolution_tuner * Support AutoML algorithms including: hyperopt_tpe, hyperopt_annealing, hyperopt_random, and evolution_tuner
* Support assessor(early stop) algorithms including: medianstop algorithm * Support assessor(early stop) algorithms including: medianstop algorithm
* Provide Python API for user defined tuners and assessors * Provide Python API for user defined tuners and assessors
* Provide Python API for user to wrap trial code as NNI deployable codes * Provide Python API for user to wrap trial code as NNI deployable codes
* Experiments * Experiments
* Provide a command line toolkit 'nnictl' for experiments management * Provide a command line toolkit 'nnictl' for experiments management
* Provide a WebUI for viewing experiments details and managing experiments * Provide a WebUI for viewing experiments details and managing experiments
* Continuous Integration * Continuous Integration
* Support CI by providing out-of-box integration with [travis-ci](https://github.com/travis-ci) on ubuntu * Support CI by providing out-of-box integration with [travis-ci](https://github.com/travis-ci) on ubuntu
* Others * Others
* Support simple GPU job scheduling * Support simple GPU job scheduling
### Known Issues ### Known Issues
[Known Issues in release 0.1.0](https://github.com/Microsoft/nni/labels/nni010knownissues). [Known Issues in release 0.1.0](https://github.com/Microsoft/nni/labels/nni010knownissues).
...@@ -20,7 +20,7 @@ Install NNI on another machine which has network accessibility to those three ma ...@@ -20,7 +20,7 @@ Install NNI on another machine which has network accessibility to those three ma
We use `examples/trials/mnist-annotation` as an example here. `cat ~/nni/examples/trials/mnist-annotation/config_remote.yml` to see the detailed configuration file: We use `examples/trials/mnist-annotation` as an example here. `cat ~/nni/examples/trials/mnist-annotation/config_remote.yml` to see the detailed configuration file:
```yaml ```yml
authorName: default authorName: default
experimentName: example_mnist experimentName: example_mnist
trialConcurrency: 1 trialConcurrency: 1
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment