Unverified Commit 9fb25ccc authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge pull request #189 from microsoft/master

merge master
parents 1500458a 7c4bc33b
......@@ -17,10 +17,11 @@
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments.
The tool dispatches and runs trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different environments like local machine, remote servers and cloud.
### **NNI [v0.8](https://github.com/Microsoft/nni/releases) has been released!**
### **NNI [v0.9](https://github.com/Microsoft/nni/releases) has been released! &nbsp;<a href="#nni-released-reminder"><img width="48" src="docs/img/release_icon.png"></a>**
<p align="center">
<a href="#nni-v05-has-been-released"><img src="docs/img/overview.svg" /></a>
<a href="#nni-has-been-released"><img src="docs/img/overview.svg" /></a>
</p>
<div>
<table>
<tbody>
<tr align="center" valign="bottom">
......@@ -37,7 +38,7 @@ The tool dispatches and runs trial jobs generated by tuning algorithms to search
<img src="docs/img/bar.png"/>
</td>
</tr>
<tr/>
</tr>
<tr valign="top">
<td>
<ul>
......@@ -51,40 +52,91 @@ The tool dispatches and runs trial jobs generated by tuning algorithms to search
<li>Theano</li>
</ul>
</td>
<td>
<a href="docs/en_US/BuiltinTuner.md">Tuner</a>
<td align="left">
<a href="docs/en_US/Tuner/BuiltinTuner.md">Tuner</a>
<br />
<ul>
<li><a href="docs/en_US/BuiltinTuner.md#TPE">TPE</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#Random">Random Search</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#Anneal">Anneal</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#Evolution">Naïve Evolution</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#SMAC">SMAC</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#Batch">Batch</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#GridSearch">Grid Search</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#Hyperband">Hyperband</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#NetworkMorphism">Network Morphism</a></li>
<li><a href="examples/tuners/enas_nni/README.md">ENAS</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#MetisTuner">Metis Tuner</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#BOHB">BOHB</a></li>
<b style="margin-left:-20px"><font size=4 color=#800000>General Tuner</font></b>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#Random"><font size=2.9>Random Search</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#Evolution"><font size=2.9>Naïve Evolution</font></a></li>
<b><font size=4 color=#800000 style="margin-left:-20px">Tuner for HPO</font></b>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#TPE"><font size=2.9>TPE</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#Anneal"><font size=2.9>Anneal</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#SMAC"><font size=2.9>SMAC</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#Batch"><font size=2.9>Batch</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#GridSearch"><font size=2.9>Grid Search</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#Hyperband"><font size=2.9>Hyperband</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#MetisTuner"><font size=2.9>Metis Tuner</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#BOHB"><font size=2.9>BOHB</font></a></li>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#GPTuner"><font size=2.9>GP Tuner</font></a></li>
<b style="margin-left:-20px"><font size=4 color=#800000 style="margin-left:-20px">Tuner for NAS</font></b>
<li><a href="docs/en_US/Tuner/BuiltinTuner.md#NetworkMorphism"><font size=2.9>Network Morphism</font></a></li>
<li><a href="examples/tuners/enas_nni/README.md"><font size=2.9>ENAS</font></a></li>
</ul>
<a href="docs/en_US/BuiltinAssessor.md">Assessor</a>
<a href="docs/en_US/Assessor/BuiltinAssessor.md">Assessor</a>
<ul>
<li><a href="docs/en_US/BuiltinAssessor.md#Medianstop">Median Stop</a></li>
<li><a href="docs/en_US/BuiltinAssessor.md#Curvefitting">Curve Fitting</a></li>
<li><a href="docs/en_US/Assessor/BuiltinAssessor.md#Medianstop"><font size=2.9>Median Stop</font></a></li>
<li><a href="docs/en_US/Assessor/BuiltinAssessor.md#Curvefitting"><font size=2.9>Curve Fitting</font></a></li>
</ul>
</td>
<td>
<ul>
<li><a href="docs/en_US/LocalMode.md">Local Machine</a></li>
<li><a href="docs/en_US/RemoteMachineMode.md">Remote Servers</a></li>
<li><a href="docs/en_US/PaiMode.md">OpenPAI</a></li>
<li><a href="docs/en_US/KubeflowMode.md">Kubeflow</a></li>
<li><a href="docs/en_US/FrameworkControllerMode.md">FrameworkController on K8S (AKS etc.)</a></li>
<li><a href="docs/en_US/TrainingService/LocalMode.md">Local Machine</a></li>
<li><a href="docs/en_US/TrainingService/RemoteMachineMode.md">Remote Servers</a></li>
<li><b>Kubernetes based services</b></li>
<ul><li><a href="docs/en_US/TrainingService/PaiMode.md">OpenPAI</a></li>
<li><a href="docs/en_US/TrainingService/KubeflowMode.md">Kubeflow</a></li>
<li><a href="docs/en_US/TrainingService/FrameworkControllerMode.md">FrameworkController on K8S (AKS etc.)</a></li>
</ul>
</ul>
</td>
</tr>
<tr align="center" valign="bottom">
<td style="border-top:#FF0000 solid 0px;">
<b>References</b>
<img src="docs/img/bar.png"/>
</td>
<td style="border-top:#FF0000 solid 0px;">
<b>References</b>
<img src="docs/img/bar.png"/>
</td>
<td style="border-top:#FF0000 solid 0px;">
<b>References</b>
<img src="docs/img/bar.png"/>
</td>
</tr>
<tr valign="top">
<td style="border-top:#FF0000 solid 0px;">
<ul>
<li><a href="docs/en_US/sdk_reference.rst">Python API</a></li>
<li><a href="docs/en_US/Tutorial/AnnotationSpec.md">NNI Annotation</a></li>
<li><a href="docs/en_US/TrialExample/Trials.md#nni-python-annotation">Annotation tutorial</a></li>
</ul>
</td>
<td style="border-top:#FF0000 solid 0px;">
<ul>
<li><a href="docs/en_US/tuners.rst">Try different tuners</a></li>
<li><a href="docs/en_US/assessors.rst">Try different assessors</a></li>
<li><a href="docs/en_US/Tuner/CustomizeTuner.md">Implement a customized tuner</a></li>
<li><a href="docs/en_US/Tuner/CustomizeAdvisor.md">Implement a customized advisor</a></li>
<li><a href="docs/en_US/Assessor/CustomizeAssessor.md">Implement a customized assessor </a></li>
<li><a href="docs/en_US/CommunitySharings/HpoComparision.md">HPO Comparison</a></li>
<li><a href="docs/en_US/CommunitySharings/NasComparision.md">NAS Comparison</a></li>
<li><a href="docs/en_US/CommunitySharings/RecommendersSvd.md">Automatically tuning SVD on NNI</a></li>
</ul>
</td>
<td style="border-top:#FF0000 solid 0px;">
<ul>
<li><a href="docs/en_US/TrainingService/HowToImplementTrainingService.md">Implement TrainingService in NNI</a></li>
<li><a href="docs/en_US/TrainingService/LocalMode.md">Run an experiment on local</a></li>
<li><a href="docs/en_US/TrainingService/KubeflowMode.md">Run an experiment on Kubeflow</a></li>
<li><a href="docs/en_US/TrainingService/PaiMode.md">Run an experiment on OpenPAI?</a></li>
<li><a href="docs/en_US/TrainingService/RemoteMachineMode.md">Run an experiment on multiple machines?</a></li>
</ul>
</td>
</tbody>
</table>
</div>
## **Who should consider using NNI**
......@@ -126,7 +178,7 @@ Note:
* `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges.
* Currently NNI on Windows support local, remote and pai mode. Anaconda or Miniconda is highly recommended to install NNI on Windows.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/FAQ.md)
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/Tutorial/FAQ.md)
**Install through source code**
......@@ -137,7 +189,7 @@ Linux and MacOS
* Run the following commands in an environment that has `python >= 3.5`, `git` and `wget`.
```bash
git clone -b v0.8 https://github.com/Microsoft/nni.git
git clone -b v0.9 https://github.com/Microsoft/nni.git
cd nni
source install.sh
```
......@@ -147,14 +199,14 @@ Windows
* Run the following commands in an environment that has `python >=3.5`, `git` and `PowerShell`
```bash
git clone -b v0.8 https://github.com/Microsoft/nni.git
git clone -b v0.9 https://github.com/Microsoft/nni.git
cd nni
powershell -ExecutionPolicy Bypass -file install.ps1
```
For the system requirements of NNI, please refer to [Install NNI](docs/en_US/Installation.md)
For the system requirements of NNI, please refer to [Install NNI](docs/en_US/Tutorial/Installation.md)
For NNI on Windows, please refer to [NNI on Windows](docs/en_US/NniOnWindows.md)
For NNI on Windows, please refer to [NNI on Windows](docs/en_US/Tutorial/NniOnWindows.md)
**Verify install**
......@@ -163,7 +215,7 @@ The following example is an experiment built on TensorFlow. Make sure you have *
* Download the examples via clone the source code.
```bash
git clone -b v0.8 https://github.com/Microsoft/nni.git
git clone -b v0.9 https://github.com/Microsoft/nni.git
```
Linux and MacOS
......@@ -210,7 +262,7 @@ You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
```
* Open the `Web UI url` in your browser, you can view detail information of the experiment and all the submitted trial jobs as shown below. [Here](docs/en_US/WebUI.md) are more Web UI pages.
* Open the `Web UI url` in your browser, you can view detail information of the experiment and all the submitted trial jobs as shown below. [Here](docs/en_US/Tutorial/WebUI.md) are more Web UI pages.
<table style="border: none">
<th><img src="./docs/img/webui_overview_page.png" alt="drawing" width="395"/></th>
......@@ -218,44 +270,63 @@ You can use these commands to get more information about the experiment
</table>
## **Documentation**
Our primary documentation is at [here](https://nni.readthedocs.io/en/latest/Overview.html) and is generated from this repository.<br/>
Maybe you want to read:
* [NNI overview](docs/en_US/Overview.md)
* [Quick start](docs/en_US/QuickStart.md)
* [Quick start](docs/en_US/Tutorial/QuickStart.md)
* [Contributing](docs/en_US/Tutorial/Contributing.md)
* [Examples](docs/en_US/examples.rst)
* [References](docs/en_US/reference.rst)
* [WebUI tutorial](docs/en_US/Tutorial/WebUI.md)
## **How to**
* [Install NNI](docs/en_US/Installation.md)
* [Use command line tool nnictl](docs/en_US/Nnictl.md)
* [Use NNIBoard](docs/en_US/WebUI.md)
* [How to define search space](docs/en_US/SearchSpaceSpec.md)
* [How to define a trial](docs/en_US/Trials.md)
* [How to choose tuner/search-algorithm](docs/en_US/BuiltinTuner.md)
* [Config an experiment](docs/en_US/ExperimentConfig.md)
* [How to use annotation](docs/en_US/Trials.md#nni-python-annotation)
* [Install NNI](docs/en_US/Tutorial/Installation.md)
* [Use command line tool nnictl](docs/en_US/Tutorial/Nnictl.md)
* [Use NNIBoard](docs/en_US/Tutorial/WebUI.md)
* [How to define search space](docs/en_US/Tutorial/SearchSpaceSpec.md)
* [How to define a trial](docs/en_US/TrialExample/Trials.md)
* [How to choose tuner/search-algorithm](docs/en_US/Tuner/BuiltinTuner.md)
* [Config an experiment](docs/en_US/Tutorial/ExperimentConfig.md)
* [How to use annotation](docs/en_US/TrialExample/Trials.md#nni-python-annotation)
## **Tutorials**
* [Run an experiment on local (with multiple GPUs)?](docs/en_US/LocalMode.md)
* [Run an experiment on multiple machines?](docs/en_US/RemoteMachineMode.md)
* [Run an experiment on OpenPAI?](docs/en_US/PaiMode.md)
* [Run an experiment on Kubeflow?](docs/en_US/KubeflowMode.md)
* [Run an experiment on local (with multiple GPUs)?](docs/en_US/LocalMode.md)
* [Run an experiment on multiple machines?](docs/en_US/RemoteMachineMode.md)
* [Try different tuners](docs/en_US/tuners.rst)
* [Try different assessors](docs/en_US/assessors.rst)
* [Implement a customized tuner](docs/en_US/CustomizeTuner.md)
* [Implement a customized tuner](docs/en_US/Tuner/CustomizeTuner.md)
* [Implement a customized assessor](docs/en_US/CustomizeAssessor.md)
* [Use Genetic Algorithm to find good model architectures for Reading Comprehension task](examples/trials/ga_squad/README.md)
## **Contribute**
This project welcomes contributions and there are many ways in which you can participate in the project, for example:
* Review [source code changes](https://github.com/microsoft/nni/pulls)
* Review the [documentation](https://github.com/microsoft/nni/tree/master/docs) and make pull requests for anything from typos to new content
* Find the issues tagged with ['good first issue'](https://github.com/Microsoft/nni/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or ['help-wanted'](https://github.com/microsoft/nni/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22), these are simple and easy to start , we recommend new contributors to start with.
Before providing your hacks, there are a few simple guidelines that you need to follow:
* [How to debug](docs/en_US/Tutorial/HowToDebug.md)
* [Code Styles & Naming Conventions](docs/en_US/Tutorial/Contributing.md)
* How to Set up [NNI developer environment](docs/en_US/Tutorial/SetupNniDeveloperEnvironment.md)
* Review the [Contributing Instruction](docs/en_US/Tutorial/Contributing.md) and get familiar with the NNI Code Contribution Guideline
## **External Repositories**
Now we have some external usage examples run in NNI from our contributors. Thanks our lovely contributors. And welcome more and more people to join us!
* Run [ENAS](examples/tuners/enas_nni/README.md) in NNI
* Run [Neural Network Architecture Search](examples/trials/nas_cifar10/README.md) in NNI
## **Feedback**
* Open [bug reports](https://github.com/microsoft/nni/issues/new/choose).<br/>
* Request a [new feature](https://github.com/microsoft/nni/issues/new/choose).
* Discuss on the NNI [Gitter](https://gitter.im/Microsoft/nni?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) in NNI
* Ask a question with NNI tags on [Stack Overflow](https://stackoverflow.com/questions/tagged/nni?sort=Newest&edited=true)or [file an issue](https://github.com/microsoft/nni/issues/new/choose)on GitHub.
* We are in construction of the instruction for [How to Debug](docs/en_US/Tutorial/HowToDebug.md), you are also welcome to contribute questions or suggestions on this area.
This project welcomes contributions and suggestions, we use [GitHub issues](https://github.com/Microsoft/nni/issues) for tracking requests and bugs.
Issues with the **good first issue** label are simple and easy-to-start ones that we recommend new contributors to start with.
To set up environment for NNI development, refer to the instruction: [Set up NNI developer environment](docs/en_US/SetupNniDeveloperEnvironment.md)
Before start coding, review and get familiar with the NNI Code Contribution Guideline: [Contributing](docs/en_US/Contributing.md)
We are in construction of the instruction for [How to Debug](docs/en_US/HowToDebug.md), you are also welcome to contribute questions or suggestions on this area.
## **License**
......
......@@ -10,7 +10,7 @@
NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包。 它通过多种调优的算法来搜索最好的神经网络结构和(或)超参,并支持单机、本地多机、云等不同的运行环境。
### **NNI [v0.8](https://github.com/Microsoft/nni/releases) 已发布!**
### **NNI [v0.9](https://github.com/Microsoft/nni/releases) 已发布!**
<p align="center">
<a href="#nni-v05-has-been-released"><img src="docs/img/overview.svg" /></a>
......@@ -61,11 +61,12 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包
<li><a href="examples/tuners/enas_nni/README_zh_CN.md">ENAS</a></li>
<li><a href="docs/zh_CN/BuiltinTuner.md#MetisTuner">Metis Tuner</a></li>
<li><a href="docs/zh_CN/BuiltinTuner.md#BOHB">BOHB</a></li>
<li><a href="docs/zh_CN/BuiltinTuner.md#GPTuner">GP Tuner</a></li>
</ul>
<a href="docs/zh_CN/BuiltinAssessors.md">Assessor(评估器)</a>
<a href="docs/zh_CN/BuiltinAssessor.md">Assessor(评估器)</a>
<ul>
<li><a href="docs/zh_CN/BuiltinAssessors.md#Medianstop">Median Stop</a></li>
<li><a href="docs/zh_CN/BuiltinAssessors.md#Curvefitting">Curve Fitting</a></li>
<li><a href="docs/zh_CN/BuiltinAssessor.md#Medianstop">Median Stop</a></li>
<li><a href="docs/zh_CN/BuiltinAssessor.md#Curvefitting">Curve Fitting</a></li>
</ul>
</td>
<td>
......@@ -101,12 +102,6 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包
## **安装和验证**
在 Windows 本机模式下,并且是第一次使用 PowerShell 来运行脚本,需要**使用管理员权限**运行一次下列命令:
```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
```
**通过 pip 命令安装**
* 当前支持 Linux,MacOS 和 Windows(本机,远程,OpenPAI 模式),在 Ubuntu 16.04 或更高版本,MacOS 10.14.1 以及 Windows 10.1809 上进行了测试。 在 `python >= 3.5` 的环境中,只需要运行 `pip install` 即可完成安装。
......@@ -131,14 +126,14 @@ python -m pip install --upgrade nni
**通过源代码安装**
* 当前支持 Linux(Ubuntu 16.04 或更高版本),MacOS(10.14.1)以及 Windows 10(1809 版)。
* 当前支持 Linux(Ubuntu 16.04 或更高版本),MacOS(10.14.1)以及 Windows 10(1809 版)。
Linux 和 macOS
*`python >= 3.5` 的环境中运行命令: `git``wget`,确保安装了这两个组件。
```bash
git clone -b v0.7 https://github.com/Microsoft/nni.git
git clone -b v0.8 https://github.com/Microsoft/nni.git
cd nni
source install.sh
```
......@@ -148,9 +143,9 @@ Windows
*`python >=3.5` 的环境中运行命令: `git``PowerShell`,确保安装了这两个组件。
```bash
git clone -b v0.7 https://github.com/Microsoft/nni.git
git clone -b v0.8 https://github.com/Microsoft/nni.git
cd nni
powershell .\install.ps1
powershell -ExecutionPolicy Bypass -file install.ps1
```
参考[安装 NNI](docs/zh_CN/Installation.md) 了解系统需求。
......@@ -164,7 +159,7 @@ Windows 上参考 [Windows 上使用 NNI](docs/zh_CN/NniOnWindows.md)。
* 通过克隆源代码下载示例。
```bash
git clone -b v0.7 https://github.com/Microsoft/nni.git
git clone -b v0.8 https://github.com/Microsoft/nni.git
```
Linux 和 macOS
......
......@@ -13,6 +13,11 @@ jobs:
- script: |
source install.sh
displayName: 'Install nni toolkit via source code'
- script: |
python3 -m pip install flake8 --user
IGNORE=./tools/nni_annotation/testcase/*:F821,./examples/trials/mnist-nas/mnist.py:F821
python3 -m flake8 . --count --per-file-ignores=$IGNORE --select=E9,F63,F72,F82 --show-source --statistics
displayName: 'Run flake8 tests to find Python syntax errors and undefined names'
- script: |
cd test
source unittest.sh
......
......@@ -10,13 +10,13 @@ To facilitate NAS innovations (e.g., design/implement new NAS models, compare di
A new programming interface for designing and searching for a model is often demanded in two scenarios. 1) When designing a neural network, the designer may have multiple choices for a layer, sub-model, or connection, and not sure which one or a combination performs the best. It would be appealing to have an easy way to express the candidate layers/sub-models they want to try. 2) For the researchers who are working on automatic NAS, they want to have an unified way to express the search space of neural architectures. And making unchanged trial code adapted to different searching algorithms.
We designed a simple and flexible programming interface based on [NNI annotation](./AnnotationSpec.md). It is elaborated through examples below.
We designed a simple and flexible programming interface based on [NNI annotation](../Tutorial/AnnotationSpec.md). It is elaborated through examples below.
### Example: choose an operator for a layer
When designing the following model there might be several choices in the fourth layer that may make this model perform good. In the script of this model, we can use annotation for the fourth layer as shown in the figure. In this annotation, there are five fields in total:
When designing the following model there might be several choices in the fourth layer that may make this model perform well. In the script of this model, we can use annotation for the fourth layer as shown in the figure. In this annotation, there are five fields in total:
![](../img/example_layerchoice.png)
![](../../img/example_layerchoice.png)
* __layer_choice__: It is a list of function calls, each function should have defined in user's script or imported libraries. The input arguments of the function should follow the format: `def XXX(inputs, arg2, arg3, ...)`, where inputs is a list with two elements. One is the list of `fixed_inputs`, and the other is a list of the chosen inputs from `optional_inputs`. `conv` and `pool` in the figure are examples of function definition. For the function calls in this list, no need to write the first argument (i.e., input). Note that only one of the function calls are chosen for this layer.
* __fixed_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another `nni.mutable_layer` before this layer, or other python variables before this layer. All the variables in this list will be fed into the chosen function in `layer_choice` (as the first element of the input list).
......@@ -32,25 +32,25 @@ __Debugging__: We provided an `nnictl trial codegen` command to help debugging y
Designing connections of layers is critical for making a high performance model. With our provided interface, users could annotate which connections a layer takes (as inputs). They could choose several ones from a set of connections. Below is an example which chooses two inputs from three candidate inputs for `concat`. Here `concat` always takes the output of its previous layer using `fixed_inputs`.
![](../img/example_connectchoice.png)
![](../../img/example_connectchoice.png)
### Example: choose both operators and connections
In this example, we choose one from the three operators and choose two connections for it. As there are multiple variables in inputs, we call `concat` at the beginning of the functions.
![](../img/example_combined.png)
![](../../img/example_combined.png)
### Example: [ENAS][1] macro search space
To illustrate the convenience of the programming interface, we use the interface to implement the trial code of "ENAS + macro search space". The left figure is the macro search space in ENAS paper.
![](../img/example_enas.png)
![](../../img/example_enas.png)
## Unified NAS search space specification
After finishing the trial code through the annotation above, users have implicitly specified the search space of neural architectures in the code. Based on the code, NNI will automatically generate a search space file which could be fed into tuning algorithms. This search space file follows the following JSON format.
```json
```javascript
{
"mutable_1": {
"layer_1": {
......@@ -67,7 +67,7 @@ After finishing the trial code through the annotation above, users have implicit
Accordingly, a specified neural architecture (generated by tuning algorithm) is expressed as follows:
```json
```javascript
{
"mutable_1": {
"layer_1": {
......@@ -91,7 +91,7 @@ With the specification of the format of search space and architecture (choice) e
NNI's annotation compiler transforms the annotated trial code to the code that could receive architecture choice and build the corresponding model (i.e., graph). The NAS search space can be seen as a full graph (here, full graph means enabling all the provided operators and connections to build a graph), the architecture chosen by the tuning algorithm is a subgraph in it. By default, the compiled trial code only builds and executes the subgraph.
![](../img/nas_on_nni.png)
![](../../img/nas_on_nni.png)
The above figure shows how the trial code runs on NNI. `nnictl` processes user trial code to generate a search space file and compiled trial code. The former is fed to tuner, and the latter is used to run trials.
......@@ -101,7 +101,7 @@ The above figure shows how the trial code runs on NNI. `nnictl` processes user t
Sharing weights among chosen architectures (i.e., trials) could speedup model search. For example, properly inheriting weights of completed trials could speedup the converge of new trials. One-Shot NAS (e.g., ENAS, Darts) is more aggressive, the training of different architectures (i.e., subgraphs) shares the same copy of the weights in full graph.
![](../img/nas_weight_share.png)
![](../../img/nas_weight_share.png)
We believe weight sharing (transferring) plays a key role on speeding up NAS, while finding efficient ways of sharing weights is still a hot research topic. We provide a key-value store for users to store and load weights. Tuners and Trials use a provided KV client lib to access the storage.
......@@ -111,9 +111,9 @@ Example of weight sharing on NNI.
One-Shot NAS is a popular approach to find good neural architecture within a limited time and resource budget. Basically, it builds a full graph based on the search space, and uses gradient descent to at last find the best subgraph. There are different training approaches, such as [training subgraphs (per mini-batch)][1], [training full graph through dropout][6], [training with architecture weights (regularization)][3]. Here we focus on the first approach, i.e., training subgraphs (ENAS).
With the same annotated trial code, users could choose One-Shot NAS as execution mode on NNI. Specifically, the compiled trial code builds the full graph (rather than subgraph demonstrated above), it receives a chosen architecture and training this architecture on the full graph for a mini-batch, then request another chosen architecture. It is supported by [NNI multi-phase](./multiPhase.md). We support this training approach because training a subgraph is very fast, building the graph every time training a subgraph induces too much overhead.
With the same annotated trial code, users could choose One-Shot NAS as execution mode on NNI. Specifically, the compiled trial code builds the full graph (rather than subgraph demonstrated above), it receives a chosen architecture and training this architecture on the full graph for a mini-batch, then request another chosen architecture. It is supported by [NNI multi-phase](MultiPhase.md). We support this training approach because training a subgraph is very fast, building the graph every time training a subgraph induces too much overhead.
![](../img/one-shot_training.png)
![](../../img/one-shot_training.png)
The design of One-Shot NAS on NNI is shown in the above figure. One-Shot NAS usually only has one trial job with full graph. NNI supports running multiple such trial jobs each of which runs independently. As One-Shot NAS is not stable, running multiple instances helps find better model. Moreover, trial jobs are also able to synchronize weights during running (i.e., there is only one copy of weights, like asynchronous parameter-server mode). This may speedup converge.
......
......@@ -6,15 +6,15 @@ Curve Fitting Assessor is a LPA(learning, predicting, assessing) algorithm. It s
In this algorithm, we use 12 curves to fit the learning curve, the large set of parametric curve models are chosen from [reference paper][1]. The learning curves' shape coincides with our prior knowlwdge about the form of learning curves: They are typically increasing, saturating functions.
![](../img/curvefitting_learning_curve.PNG)
![](../../img/curvefitting_learning_curve.PNG)
We combine all learning curve models into a single, more powerful model. This combined model is given by a weighted linear combination:
![](../img/curvefitting_f_comb.gif)
![](../../img/curvefitting_f_comb.gif)
where the new combined parameter vector
![](../img/curvefitting_expression_xi.gif)
![](../../img/curvefitting_expression_xi.gif)
Assuming additive a Gaussian noise and the noise parameter is initialized to its maximum likelihood estimate.
......@@ -30,7 +30,7 @@ Concretely,this algorithm goes through three stages of learning, predicting and
The figure below is the result of our algorithm on MNIST trial history data, where the green point represents the data obtained by Assessor, the blue point represents the future but unknown data, and the red line is the Curve predicted by the Curve fitting assessor.
![](../img/curvefitting_example.PNG)
![](../../img/curvefitting_example.PNG)
## 2. Usage
To use Curve Fitting Assessor, you should add the following spec in your experiment's YAML config file:
......
......@@ -5,15 +5,15 @@ Comparison of Hyperparameter Optimization algorithms on several problems.
Hyperparameter Optimization algorithms are list below:
- [Random Search](../BuiltinTuner.md)
- [Grid Search](../BuiltinTuner.md)
- [Evolution](../BuiltinTuner.md)
- [Anneal](../BuiltinTuner.md)
- [Metis](../BuiltinTuner.md)
- [TPE](../BuiltinTuner.md)
- [SMAC](../BuiltinTuner.md)
- [HyperBand](../BuiltinTuner.md)
- [BOHB](../BuiltinTuner.md)
- [Random Search](../Tuner/BuiltinTuner.md)
- [Grid Search](../Tuner/BuiltinTuner.md)
- [Evolution](../Tuner/BuiltinTuner.md)
- [Anneal](../Tuner/BuiltinTuner.md)
- [Metis](../Tuner/BuiltinTuner.md)
- [TPE](../Tuner/BuiltinTuner.md)
- [SMAC](../Tuner/BuiltinTuner.md)
- [HyperBand](../Tuner/BuiltinTuner.md)
- [BOHB](../Tuner/BuiltinTuner.md)
All algorithms run in NNI local environment.
......@@ -34,7 +34,7 @@ is running in docker?: no
### Problem Description
Nonconvex problem on the hyper-parameter search of [AutoGBDT](../gbdt_example.md) example.
Nonconvex problem on the hyper-parameter search of [AutoGBDT](../TrialExample/GbdtExample.md) example.
### Search Space
......@@ -98,8 +98,11 @@ The total search space is 1,204,224, we set the number of maximum trial to 1000.
| HyperBand |0.414065|0.415222|0.417628|
| HyperBand |0.416807|0.417549|0.418828|
| HyperBand |0.415550|0.415977|0.417186|
| GP |0.414353|0.418563|0.420263|
| GP |0.414395|0.418006|0.420431|
| GP |0.412943|0.416566|0.418443|
For Metis, there are about 300 trials because it runs slowly due to its high time complexity O(n^3) in Gaussian Process.
In this example, all the algorithms are used with default parameters. For Metis, there are about 300 trials because it runs slowly due to its high time complexity O(n^3) in Gaussian Process.
## RocksDB Benchmark 'fillrandom' and 'readrandom'
......
......@@ -7,6 +7,6 @@ In addtion to the official tutorilas and examples, we encourage community contri
.. toctree::
:maxdepth: 2
NNI Practice Sharing<nni_practice_sharing>
Neural Architecture Search Comparison<CommunitySharings/NasComparison>
Hyper-parameter Tuning Algorithm Comparsion<CommunitySharings/HpoComparison>
NNI in Recommenders <RecommendersSvd>
Neural Architecture Search Comparison <NasComparision>
Hyper-parameter Tuning Algorithm Comparsion <HpoComparision>
......@@ -33,27 +33,27 @@ Basically, an experiment runs as follows: Tuner receives search space and genera
For each experiment, user only needs to define a search space and update a few lines of code, and then leverage NNI built-in Tuner/Assessor and training platforms to search the best hyperparameters and/or neural architecture. There are basically 3 steps:
>Step 1: [Define search space](SearchSpaceSpec.md)
>Step 1: [Define search space](Tutorial/SearchSpaceSpec.md)
>Step 2: [Update model codes](Trials.md)
>Step 2: [Update model codes](TrialExample/Trials.md)
>Step 3: [Define Experiment](ExperimentConfig.md)
>Step 3: [Define Experiment](Tutorial/ExperimentConfig.md)
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816627-5d13db80-2302-11e9-8f3e-627e260203d5.jpg" alt="drawing"/>
</p>
More details about how to run an experiment, please refer to [Get Started](QuickStart.md).
More details about how to run an experiment, please refer to [Get Started](Tutorial/QuickStart.md).
## Learn More
* [Get started](QuickStart.md)
* [How to adapt your trial code on NNI?](Trials.md)
* [What are tuners supported by NNI?](BuiltinTuner.md)
* [How to customize your own tuner?](CustomizeTuner.md)
* [What are assessors supported by NNI?](BuiltinAssessors.md)
* [How to customize your own assessor?](CustomizeAssessor.md)
* [How to run an experiment on local?](LocalMode.md)
* [How to run an experiment on multiple machines?](RemoteMachineMode.md)
* [How to run an experiment on OpenPAI?](PaiMode.md)
* [Examples](MnistExamples.md)
\ No newline at end of file
* [Get started](Tutorial/QuickStart.md)
* [How to adapt your trial code on NNI?](TrialExample/Trials.md)
* [What are tuners supported by NNI?](Tuner/BuiltinTuner.md)
* [How to customize your own tuner?](Tuner/CustomizeTuner.md)
* [What are assessors supported by NNI?](Assessor/BuiltinAssessor.md)
* [How to customize your own assessor?](Assessor/CustomizeAssessor.md)
* [How to run an experiment on local?](TrainingService/LocalMode.md)
* [How to run an experiment on multiple machines?](TrainingService/RemoteMachineMode.md)
* [How to run an experiment on OpenPAI?](TrainingService/PaiMode.md)
* [Examples](TrialExample/MnistExamples.md)
\ No newline at end of file
# ChangeLog
# Release 0.8 - 6/4/2019
## Major Features
* [Support NNI on Windows for PAI/Remote mode]
* NNI running on windows for remote mode
* NNI running on windows for PAI mode
* [Advanced features for using GPU]
* Run multiple trial jobs on the same GPU for local and remote mode
* Run trial jobs on the GPU running non-NNI jobs
* [Kubeflow v1beta2 operator]
* Support Kubeflow TFJob/PyTorchJob v1beta2
* [General NAS programming interface](./GeneralNasInterfaces.md)
* Provide NAS programming interface for users to easily express their neural architecture search space through NNI annotation
* Provide a new command `nnictl trial codegen` for debugging the NAS code
* Tutorial of NAS programming interface, example of NAS on mnist, customized random tuner for NAS
* [Support resume tuner/advisor's state for experiment resume]
* For experiment resume, tuner/advisor will be resumed by replaying finished trial data
* [Web Portal]
* Improve the design of copying trial's parameters
* Support 'randint' type in hyper-parameter graph
* Use should ComponentUpdate to avoid unnecessary render
## Bug fix and other changes
* [Bug fix that `nnictl update` has inconsistent command styles]
* [Support import data for SMAC tuner]
* [Bug fix that experiment state transition from ERROR back to RUNNING]
* [Fix bug of table entries]
* [Nested search space refinement]
* [Refine 'randint' type and support lower bound]
* [Comparison of different hyper-parameter tuning algorithm](./CommunitySharings/HpoComparision.md)
* [Comparison of NAS algorithm](./CommunitySharings/NasComparision.md)
* [NNI practice on Recommenders](./CommunitySharings/NniPracticeSharing/RecommendersSvd.md)
## Release 0.9 - 7/1/2019
### Major Features
* General NAS programming interface
* Add `enas-mode` and `oneshot-mode` for NAS interface: [PR #1201](https://github.com/microsoft/nni/pull/1201#issue-291094510)
* [Gaussian Process Tuner with Matern kernel](Tuner/GPTuner.md)
* Multiphase experiment supports
* Added new training service support for multiphase experiment: PAI mode supports multiphase experiment since v0.9.
* Added multiphase capability for the following builtin tuners:
* TPE, Random Search, Anneal, Naïve Evolution, SMAC, Network Morphism, Metis Tuner.
For details, please refer to [Write a tuner that leverages multi-phase](AdvancedFeature/MultiPhase.md)
* Web Portal
* Enable trial comparation in Web Portal. For details, refer to [View trials status](Tutorial/WebUI.md)
* Allow users to adjust rendering interval of Web Portal. For details, refer to [View Summary Page](Tutorial/WebUI.md)
* show intermediate results more friendly. For details, refer to [View trials status](Tutorial/WebUI.md)
* [Commandline Interface](Tutorial/Nnictl.md)
* `nnictl experiment delete`: delete one or all experiments, it includes log, result, environment information and cache. It uses to delete useless experiment result, or save disk space.
* `nnictl platform clean`: It uses to clean up disk on a target platform. The provided YAML file includes the information of target platform, and it follows the same schema as the NNI configuration file.
### Bug fix and other changes
* Tuner Installation Improvements: add [sklearn](https://scikit-learn.org/stable/) to nni dependencies.
* (Bug Fix) Failed to connect to PAI http code - [Issue #1076](https://github.com/microsoft/nni/issues/1076)
* (Bug Fix) Validate file name for PAI platform - [Issue #1164](https://github.com/microsoft/nni/issues/1164)
* (Bug Fix) Update GMM evaluation in Metis Tuner
* (Bug Fix) Negative time number rendering in Web Portal - [Issue #1182](https://github.com/microsoft/nni/issues/1182), [Issue #1185](https://github.com/microsoft/nni/issues/1185)
* (Bug Fix) Hyper-parameter not shown correctly in WebUI when there is only one hyper parameter - [Issue #1192](https://github.com/microsoft/nni/issues/1192)
## Release 0.8 - 6/4/2019
### Major Features
* Support NNI on Windows for OpenPAI/Remote mode
* NNI running on windows for remote mode
* NNI running on windows for OpenPAI mode
* Advanced features for using GPU
* Run multiple trial jobs on the same GPU for local and remote mode
* Run trial jobs on the GPU running non-NNI jobs
* Kubeflow v1beta2 operator
* Support Kubeflow TFJob/PyTorchJob v1beta2
* [General NAS programming interface](AdvancedFeature/GeneralNasInterfaces.md)
* Provide NAS programming interface for users to easily express their neural architecture search space through NNI annotation
* Provide a new command `nnictl trial codegen` for debugging the NAS code
* Tutorial of NAS programming interface, example of NAS on MNIST, customized random tuner for NAS
* Support resume tuner/advisor's state for experiment resume
* For experiment resume, tuner/advisor will be resumed by replaying finished trial data
* Web Portal
* Improve the design of copying trial's parameters
* Support 'randint' type in hyper-parameter graph
* Use should ComponentUpdate to avoid unnecessary render
### Bug fix and other changes
* Bug fix that `nnictl update` has inconsistent command styles
* Support import data for SMAC tuner
* Bug fix that experiment state transition from ERROR back to RUNNING
* Fix bug of table entries
* Nested search space refinement
* Refine 'randint' type and support lower bound
* [Comparison of different hyper-parameter tuning algorithm](CommunitySharings/HpoComparision.md)
* [Comparison of NAS algorithm](CommunitySharings/NasComparision.md)
* [NNI practice on Recommenders](CommunitySharings/RecommendersSvd.md)
## Release 0.7 - 4/29/2018
### Major Features
* [Support NNI on Windows](./WindowsLocalMode.md)
* [Support NNI on Windows](Tutorial/NniOnWindows.md)
* NNI running on windows for local mode
* [New advisor: BOHB](./BohbAdvisor.md)
* [New advisor: BOHB](Tuner/BohbAdvisor.md)
* Support a new advisor BOHB, which is a robust and efficient hyperparameter tuning algorithm, combines the advantages of Bayesian optimization and Hyperband
* [Support import and export experiment data through nnictl](./Nnictl.md#experiment)
* [Support import and export experiment data through nnictl](Tutorial/Nnictl.md#experiment)
* Generate analysis results report after the experiment execution
* Support import data to tuner and advisor for tuning
* [Designated gpu devices for NNI trial jobs](./ExperimentConfig.md#localConfig)
* [Designated gpu devices for NNI trial jobs](Tutorial/ExperimentConfig.md#localConfig)
* Specify GPU devices for NNI trial jobs by gpuIndices configuration, if gpuIndices is set in experiment configuration file, only the specified GPU devices are used for NNI trial jobs.
* Web Portal enhancement
* Decimal format of metrics other than default on the Web UI
......@@ -56,7 +89,7 @@
* Unable to kill all python threads after nnictl stop in async dispatcher mode
* nnictl --version does not work with make dev-install
* All trail jobs status stays on 'waiting' for long time on PAI platform
* All trail jobs status stays on 'waiting' for long time on OpenPAI platform
## Release 0.6 - 4/2/2019
......@@ -73,7 +106,7 @@
### Bug fix
* [Add shmMB config key for PAI](https://github.com/Microsoft/nni/issues/842)
* [Add shmMB config key for OpenPAI](https://github.com/Microsoft/nni/issues/842)
* Fix the bug that doesn't show any result if metrics is dict
* Fix the number calculation issue for float types in hyperband
* Fix a bug in the search space conversion in SMAC tuner
......@@ -118,14 +151,14 @@
#### New tuner and assessor supports
* Support [Metis tuner](MetisTuner.md) as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support [Metis tuner](Tuner/MetisTuner.md) as a new NNI tuner. Metis algorithm has been proofed to be well performed for **online** hyper-parameter tuning.
* Support [ENAS customized tuner](https://github.com/countif/enas_nni), a tuner contributed by github community user, is an algorithm for neural network search, it could learn neural network architecture via reinforcement learning and serve a better performance than NAS.
* Support [Curve fitting assessor](CurvefittingAssessor.md) for early stop policy using learning curve extrapolation.
* Advanced Support of [Weight Sharing](./AdvancedNas.md): Enable weight sharing for NAS tuners, currently through NFS.
* Support [Curve fitting assessor](Assessor/CurvefittingAssessor.md) for early stop policy using learning curve extrapolation.
* Advanced Support of [Weight Sharing](AdvancedFeature/AdvancedNas.md): Enable weight sharing for NAS tuners, currently through NFS.
#### Training Service Enhancement
* [FrameworkController Training service](./FrameworkControllerMode.md): Support run experiments using frameworkcontroller on kubernetes
* [FrameworkController Training service](TrainingService/FrameworkControllerMode.md): Support run experiments using frameworkcontroller on kubernetes
* FrameworkController is a Controller on kubernetes that is general enough to run (distributed) jobs with various machine learning frameworks, such as tensorflow, pytorch, MXNet.
* NNI provides unified and simple specification for job definition.
* MNIST example for how to use FrameworkController.
......@@ -143,11 +176,11 @@
#### New tuner supports
* Support [network morphism](NetworkmorphismTuner.md) as a new tuner
* Support [network morphism](Tuner/NetworkmorphismTuner.md) as a new tuner
#### Training Service improvements
* Migrate [Kubeflow training service](KubeflowMode.md)'s dependency from kubectl CLI to [Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/) client
* Migrate [Kubeflow training service](TrainingService/KubeflowMode.md)'s dependency from kubectl CLI to [Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/) client
* [Pytorch-operator](https://github.com/kubeflow/pytorch-operator) support for Kubeflow training service
* Improvement on local code files uploading to OpenPAI HDFS
* Fixed OpenPAI integration WebUI bug: WebUI doesn't show latest trial job status, which is caused by OpenPAI token expiration
......@@ -174,11 +207,11 @@
### Major Features
* [Kubeflow Training service](./KubeflowMode.md)
* [Kubeflow Training service](TrainingService/KubeflowMode.md)
* Support tf-operator
* [Distributed trial example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed/dist_mnist.py) on Kubeflow
* [Grid search tuner](GridsearchTuner.md)
* [Hyperband tuner](HyperbandAdvisor.md)
* [Grid search tuner](Tuner/GridsearchTuner.md)
* [Hyperband tuner](Tuner/HyperbandAdvisor.md)
* Support launch NNI experiment on MAC
* WebUI
* UI support for hyperband tuner
......@@ -213,7 +246,7 @@
```
* Support updating max trial number.
use `nnictl update --help` to learn more. Or refer to [NNICTL Spec](Nnictl.md) for the fully usage of NNICTL.
use `nnictl update --help` to learn more. Or refer to [NNICTL Spec](Tutorial/Nnictl.md) for the fully usage of NNICTL.
### API new features and updates
......@@ -250,7 +283,7 @@
### Others
* UI refactoring, refer to [WebUI doc](WebUI.md) for how to work with the new UI.
* UI refactoring, refer to [WebUI doc](Tutorial/WebUI.md) for how to work with the new UI.
* Continuous Integration: NNI had switched to Azure pipelines
* [Known Issues in release 0.3.0](https://github.com/Microsoft/nni/labels/nni030knownissues).
......@@ -258,10 +291,10 @@
### Major Features
* Support [OpenPAI](https://github.com/Microsoft/pai) Training Platform (See [here](./PaiMode.md) for instructions about how to submit NNI job in pai mode)
* Support [OpenPAI](https://github.com/Microsoft/pai) Training Platform (See [here](TrainingService/PaiMode.md) for instructions about how to submit NNI job in pai mode)
* Support training services on pai mode. NNI trials will be scheduled to run on OpenPAI cluster
* NNI trial's output (including logs and model file) will be copied to OpenPAI HDFS for further debugging and checking
* Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner (See [here](SmacTuner.md) for instructions about how to use SMAC tuner)
* Support [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) tuner (See [here](Tuner/SmacTuner.md) for instructions about how to use SMAC tuner)
* [SMAC](https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf) is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO to handle categorical parameters. The SMAC supported by NNI is a wrapper on [SMAC3](https://github.com/automl/SMAC3)
* Support NNI installation on [conda](https://conda.io/docs/index.html) and python virtual environment
* Others
......
**Run an Experiment on FrameworkController**
# Run an Experiment on FrameworkController
===
NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator. Now you can use frameworkcontroller as the training service to run NNI experiment.
NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
## Prerequisite for on-premises Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copies files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
```
```bash
apt-get install nfs-common
```
6. Install **NNI**, follow the install guide [here](QuickStart.md).
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
## Prerequisite for Azure Kubernetes Service
1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
1. NNI support Kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, NNI need the access key of the storage account, and NNI uses [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Setup FrameworkController
## Set up FrameworkController
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, NNI supports frameworkcontroller by the statefulset mode.
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
## Design
Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar.
Please refer the design of [Kubeflow training service](KubeflowMode.md), FrameworkController training service pipeline is similar.
## Example
The frameworkcontroller config file format is:
```
The FrameworkController config file format is:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
......@@ -71,8 +77,10 @@ frameworkcontrollerConfig:
server: {your_nfs_server}
path: {your_nfs_server_exported_path}
```
If you use Azure Kubernetes Service, you should set `frameworkcontrollerConfig` in your config YAML file as follows:
```
```yaml
frameworkcontrollerConfig:
storage: azureStorage
keyVault:
......@@ -82,22 +90,27 @@ frameworkcontrollerConfig:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
```
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the [Tensorflow example of FrameworkController](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
* taskNum: the replica number of the task role.
* command: the users' command to be used in the container.
* gpuNum: the number of gpu device used in container.
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps.
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in Kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
* taskNum: the replica number of the task role.
* command: the users' command to be used in the container.
* gpuNum: the number of gpu device used in container.
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
## How to run example
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information.
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on FrameworkController is similar to Kubeflow, please refer the [document](KubeflowMode.md) for more information.
## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
\ No newline at end of file
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
......@@ -2,12 +2,13 @@
===
## Overview
TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainignService, users just need to inherit the parent class and complete their own clild class if they want to implement customized TrainingService.
TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
## System architecture
![](../img/NNIDesign.jpg)
![](../../img/NNIDesign.jpg)
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports [local platfrom](LocalMode.md), [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkControllerMode.md).
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports local platfrom, [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkController.md).
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
## Folder structure of code
......@@ -63,6 +64,7 @@ abstract class TrainingService {
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
__setClusterMetadata(key: string, value: string)__
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
```
export class RemoteMachineMeta {
......@@ -91,9 +93,11 @@ export class RemoteMachineMeta {
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
__getClusterMetadata(key: string)__
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
__submitTrialJob(form: JobApplicationForm)__
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
```
interface TrialJobDetail {
......@@ -113,37 +117,49 @@ interface TrialJobDetail {
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
__cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)__
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
__updateTrialJob(trialJobId: string, form: JobApplicationForm)__
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to `RUNNING`, `SUCCEED`, `FAILED` etc.
__getTrialJob(trialJobId: string)__
This function returns a trialJob detail instance according to trialJobId.
__listTrialJobs()__
Users should put all of trial job detail information into a list, and return the list.
__addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
__removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
Close the EventEmitter.
__run()__
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
__cleanUp()__
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
## TrialKeeper tool
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in `nni/tools/nni_trial_tool`. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
![](../img/trialkeeper.jpg)
![](../../img/trialkeeper.jpg)
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts`.
## Reference
For more information about how to debug, please [refer](HowToDebug.md).
The guide line of how to contribute, please [refer](Contributing.md).
For more information about how to debug, please [refer](../Tutorial/HowToDebug.md).
The guideline of how to contribute, please [refer](../Tutorial/Contributing.md).
**Run an Experiment on Kubeflow**
# Run an Experiment on Kubeflow
===
Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster.
Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
## Prerequisite for on-premises Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Download, set up, and deploy **Kubelow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to set up Kubeflow
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
2. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to setup Kubeflow.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
......@@ -13,40 +16,50 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku
apt-get install nfs-common
```
7. Install **NNI**, follow the install guide [here](QuickStart.md).
7. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
## Prerequisite for Azure Kubernetes Service
1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
1. NNI support Kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Deploy kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/).
3. Deploy Kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/).
4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account, and NNI use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Design
![](../img/kubeflow_training_design.png)
Kubeflow training service instantiates a kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
![](../../img/kubeflow_training_design.png)
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumes: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volume into the job's pod. Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
## Supported operator
NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested.
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators is not tested.
Users could set operator type in config file.
The setting of tf-operator:
```
```yaml
kubeflowConfig:
operator: tf-operator
```
The setting of pytorch-operator:
```
```yaml
kubeflowConfig:
operator: pytorch-operator
```
If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config.
## Supported storage type
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
```
```yaml
kubeflowConfig:
storage: nfs
nfs:
......@@ -55,8 +68,10 @@ kubeflowConfig:
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
```
If you use Azure storage, you should set `kubeflowConfig` in your config YAML file as follows:
```
```yaml
kubeflowConfig:
storage: azureStorage
keyVault:
......@@ -67,10 +82,11 @@ kubeflowConfig:
azureShare: {your_azure_share_name}
```
## Run an experiment
Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The NNI config YAML file's content is like:
```
Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of Kubeflow. The NNI config YAML file's content is like:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 2
......@@ -122,7 +138,8 @@ kubeflowConfig:
Note: You should explicitly set `trainingServicePlatform: kubeflow` in NNI config YAML file if you want to start experiment in kubeflow mode.
If you want to run PyTorch jobs, you could set your config files as follow:
```
```yaml
authorName: default
experimentName: example_mnist_distributed_pytorch
trialConcurrency: 1
......@@ -166,37 +183,41 @@ kubeflowConfig:
```
Trial configuration in kubeflow mode have the following configuration keys:
* codeDir
* code directory, where you put training code and config files
* code directory, where you put training code and config files
* worker (required). This config section is used to configure tensorflow worker role
* replicas
* Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
* command
* Required key. Command to launch your trial job, like ```python mnist.py```
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* apiVersion
* Required key. The API version of your kubeflow.
* ps (optional). This config section is used to configure tensorflow parameter server role.
* master(optional). This config section is used to configure pytorch parameter server role.
* replicas
* Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
* command
* Required key. Command to launch your trial job, like ```python mnist.py```
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* apiVersion
* Required key. The API version of your Kubeflow.
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
```
```bash
nnictl create --config exp_kubeflow.yml
```
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard.
You can see the Kubeflow tfjob created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Once a trial job is completed, you can go to NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
Any problems when using NNI in kubeflow mode, please create issues on [NNI Github repo](https://github.com/Microsoft/nni).
Any problems when using NNI in Kubeflow mode, please create issues on [NNI Github repo](https://github.com/Microsoft/nni).
......@@ -56,7 +56,7 @@ The hyper-parameters used in `Step 1.2 - Get predefined parameters` is defined i
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
```
Refer to [SearchSpaceSpec.md](SearchSpaceSpec.md) to learn more about search space.
Refer to [SearchSpaceSpec.md](../Tutorial/SearchSpaceSpec.md) to learn more about search space.
>Step 3 - Define Experiment
......@@ -83,16 +83,16 @@ Let's use a simple trial example, e.g. mnist, provided by NNI. After you install
python ~/nni/examples/trials/mnist-annotation/mnist.py
This command will be filled in the YAML configure file below. Please refer to [here](Trials.md) for how to write your own trial.
This command will be filled in the YAML configure file below. Please refer to [here](../TrialExample/Trials.md) for how to write your own trial.
**Prepare tuner**: NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to [here](CustomizeTuner.md)), but for simplicity, here we choose a tuner provided by NNI as below:
**Prepare tuner**: NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to [here](../Tuner/CustomizeTuner.md)), but for simplicity, here we choose a tuner provided by NNI as below:
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found [here](BuiltinTuner.md)), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found [here](../Tuner/BuiltinTuner.md)), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**: Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, `cat ~/nni/examples/trials/mnist-annotation/config.yml` to see it. Its content is basically shown below:
......@@ -124,13 +124,13 @@ trial:
gpuNum: 0
```
Here *useAnnotation* is true because this trial example uses our python annotation (refer to [here](AnnotationSpec.md) for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
Here *useAnnotation* is true because this trial example uses our python annotation (refer to [here](../Tutorial/AnnotationSpec.md) for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
With all these steps done, we can run the experiment with the following command:
nnictl create --config ~/nni/examples/trials/mnist-annotation/config.yml
You can refer to [here](Nnictl.md) for more usage guide of *nnictl* command line tool.
You can refer to [here](../Tutorial/Nnictl.md) for more usage guide of *nnictl* command line tool.
## View experiment results
The experiment has been running now. Other than *nnictl*, NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
......
......@@ -3,7 +3,7 @@
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
## Setup environment
Install NNI, follow the install guide [here](QuickStart.md).
Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).
## Run an experiment
Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like:
......@@ -43,7 +43,7 @@ paiConfig:
Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode.
Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have these additional keys:
Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
......@@ -66,17 +66,17 @@ nnictl create --config exp_pai.yml
```
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
![](../img/nni_pai_joblist.jpg)
![](../../img/nni_pai_joblist.jpg)
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
![](../img/nni_webui_joblist.jpg)
![](../../img/nni_webui_joblist.jpg)
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
![](../img/nni_trial_hdfs_output.jpg)
![](../../img/nni_trial_hdfs_output.jpg)
You can see there're three fils in output folder: stderr, stdout, and trial.log
......@@ -92,4 +92,4 @@ Check policy:
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
![](../img/version_check.png)
\ No newline at end of file
![](../../img/version_check.png)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment