Unverified Commit 12410686 authored by chicm-ms's avatar chicm-ms Committed by GitHub
Browse files

Merge pull request #20 from microsoft/master

pull code
parents 611a45fc 61fec446
......@@ -2,7 +2,7 @@
A config file is needed when create an experiment, the path of the config file is provide to nnictl.
The config file is written in YAML format, and need to be written correctly.
This document describes the rule to write config file, and will provide some examples and templates.
This document describes the rule to write config file, and will provide some examples and templates.
- [Experiment config reference](#experiment-config-reference)
- [Template](#template)
......@@ -125,7 +125,7 @@ machineList:
## Configuration spec
* __authorName__
* Description
* Description
__authorName__ is the name of the author who create the experiment.
TBD: add default value
......@@ -133,20 +133,20 @@ machineList:
* __experimentName__
* Description
__experimentName__ is the name of the experiment created.
__experimentName__ is the name of the experiment created.
TBD: add default value
* __trialConcurrency__
* Description
__trialConcurrency__ specifies the max num of trial jobs run simultaneously.
__trialConcurrency__ specifies the max num of trial jobs run simultaneously.
Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.
* __maxExecDuration__
* Description
__maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
__maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
......@@ -158,16 +158,16 @@ machineList:
* __maxTrialNum__
* Description
__maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
__maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
* __trainingServicePlatform__
* Description
__trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
__trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
* __local__ run an experiment on local ubuntu machine.
* __local__ run an experiment on local ubuntu machine.
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
* __pai__ submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](./PaiMode.md)
......@@ -396,7 +396,7 @@ machineList:
__localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
* __gpuIndices__
__gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`.
* __maxTrialNumPerGpu__
......@@ -413,11 +413,11 @@ machineList:
__machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
* __ip__
__ip__ is the ip address of remote machine.
* __port__
__port__ is the ssh port to be used to connect machine.
Note: if users set port empty, the default value will be 22.
......@@ -439,7 +439,7 @@ machineList:
__passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.
* __gpuIndices__
__gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`.
* __maxTrialNumPerGpu__
......
......@@ -21,10 +21,10 @@ Please try the following solutions in turn:
Your machine don't have eth0 device, please set [nniManagerIp](ExperimentConfig.md) in your config file manually.
### Exceed the MaxDuration but didn't stop
When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment.
When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment.
### Could not stop an experiment using `nnictl stop`
If you upgrade your NNI or you delete some config files of NNI when there is an experiment running, this kind of issue may happen because the loss of config file. You could use `ps -ef | grep node` to find the pid of your experiment, and use `kill -9 {pid}` to kill it manually.
If you upgrade your NNI or you delete some config files of NNI when there is an experiment running, this kind of issue may happen because the loss of config file. You could use `ps -ef | grep node` to find the PID of your experiment, and use `kill -9 {pid}` to kill it manually.
### Could not get `default metric` in webUI of virtual machines
Config the network mode to bridge mode or other mode that could make virtual machine's host accessible from external machine, and make sure the port of virtual machine is not forbidden by firewall.
......@@ -34,10 +34,10 @@ Unable to open the WebUI may have the following reasons:
* http://127.0.0.1, http://172.17.0.1 and http://10.0.0.15 are referred to localhost, if you start your experiment on the server or remote machine. You can replace the IP to your server IP to view the WebUI, like http://[your_server_ip]:8080
* If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment.
* Another reason may be your experiment is failed and NNI may fail to get the experiment infomation. You can check the log of NNImanager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log
* Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log
### NNI on Windows problems
Please refer to [NNI on Windows](NniOnWindows.md)
Please refer to [NNI on Windows](NniOnWindows.md)
### Help us improve
Please inquiry the problem in https://github.com/Microsoft/nni/issues to see whether there are other people already reported the problem, create a new one if there are no existing issues been created.
......@@ -9,8 +9,8 @@ NNI supports running experiment using [FrameworkController](https://github.com/M
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copies files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
```
apt-get install nfs-common
```
apt-get install nfs-common
```
6. Install **NNI**, follow the install guide [here](QuickStart.md).
......@@ -82,9 +82,9 @@ frameworkcontrollerConfig:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
```
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
The trial's config format for NNI frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
......
# GBDT in nni
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree has many popular implementations, such as [lightgbm](https://github.com/Microsoft/LightGBM), [xgboost](https://github.com/dmlc/xgboost), and [catboost](https://github.com/catboost/catboost), etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
## 1. Search Space in GBDT
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):
......
# General Programming Interface for Neural Architecture Search
# General Programming Interface for Neural Architecture Search (experimental feature)
_*This is an experimental feature, currently, we only implemented the general NAS programming interface. Weight sharing and one-shot NAS based on this programming interface will be supported in the following releases._
Automatic neural architecture search is taking an increasingly important role on finding better models. Recent research works have proved the feasibility of automatic NAS, and also found some models that could beat manually designed and tuned models. Some of representative works are [NASNet][2], [ENAS][1], [DARTS][3], [Network Morphism][4], and [Evolution][5]. There are new innovations keeping emerging. However, it takes great efforts to implement those algorithms, and it is hard to reuse code base of one algorithm for implementing another.
To facilitate NAS innovations (e.g., design/implement new NAS models, compare different NAS models side-by-side), an easy-to-use and flexibile programming interface is crucial.
To facilitate NAS innovations (e.g., design/implement new NAS models, compare different NAS models side-by-side), an easy-to-use and flexible programming interface is crucial.
## Programming interface
......@@ -10,19 +12,21 @@ To facilitate NAS innovations (e.g., design/implement new NAS models, compare di
We designed a simple and flexible programming interface based on [NNI annotation](./AnnotationSpec.md). It is elaborated through examples below.
### Example: choose an operator for a layer
### Example: choose an operator for a layer
When designing the following model there might be several choices in the fourth layer that may make this model perform good. In the script of this model, we can use annotation for the fourth layer as shown in the figure. In this annotation, there are five fields in total:
![](../img/example_layerchoice.png)
* __layer_choice__: It is a list of function calls, each function should have defined in user's script or imported libraries. The input arguments of the function should follow the format: `def XXX(inputs, arg2, arg3, ...)`, where `inputs` is a list with two elements. One is the list of `fixed_inputs`, and the other is a list of the chosen inputs from `optional_inputs`. `conv` and `pool` in the figure are examples of function definition. For the function calls in this list, no need to write the first argument (i.e., `input`). Note that only one of the function calls are chosen for this layer.
* __fixed_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another nni.mutable_layer before this layer, or other python variables before this layer. All the variables in this list will be fed into the chosen function in `layer_choice` (as the first element of the `input` list).
* __optional_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another nni.mutable_layer before this layer, or other python variables before this layer. Only `input_num` variables will be fed into the chosen function in `layer_choice` (as the second element of the `input` list).
* __layer_choice__: It is a list of function calls, each function should have defined in user's script or imported libraries. The input arguments of the function should follow the format: `def XXX(inputs, arg2, arg3, ...)`, where inputs is a list with two elements. One is the list of `fixed_inputs`, and the other is a list of the chosen inputs from `optional_inputs`. `conv` and `pool` in the figure are examples of function definition. For the function calls in this list, no need to write the first argument (i.e., input). Note that only one of the function calls are chosen for this layer.
* __fixed_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another `nni.mutable_layer` before this layer, or other python variables before this layer. All the variables in this list will be fed into the chosen function in `layer_choice` (as the first element of the input list).
* __optional_inputs__: It is a list of variables, the variable could be an output tensor from a previous layer. The variable could be `layer_output` of another `nni.mutable_layer` before this layer, or other python variables before this layer. Only `optional_input_size` variables will be fed into the chosen function in `layer_choice` (as the second element of the input list).
* __optional_input_size__: It indicates how many inputs are chosen from `input_candidates`. It could be a number or a range. A range [1,3] means it chooses 1, 2, or 3 inputs.
* __layer_output__: The name of the output(s) of this layer, in this case it represents the return of the function call in `layer_choice`. This will be a variable name that can be used in the following python code or nni.mutable_layer(s).
* __layer_output__: The name of the output(s) of this layer, in this case it represents the return of the function call in `layer_choice`. This will be a variable name that can be used in the following python code or `nni.mutable_layer`.
There are two ways to write annotation for this example. For the upper one, input of the function calls is `[[],[out3]]`. For the bottom one, input is `[[out3],[]]`.
There are two ways to write annotation for this example. For the upper one, `input` of the function calls is `[[],[out3]]`. For the bottom one, `input` is `[[out3],[]]`.
__Debugging__: We provided an `nnictl trial codegen` command to help debugging your code of NAS programming on NNI. If your trial with trial_id `XXX` in your experiment `YYY` is failed, you could run `nnictl trial codegen YYY --trial_id XXX` to generate an executable code for this trial under your current directory. With this code, you can directly run the trial command without NNI to check why this trial is failed. Basically, this command is to compile your trial code and replace the NNI NAS code with the real chosen layers and inputs.
### Example: choose input connections for a layer
......@@ -32,7 +36,7 @@ Designing connections of layers is critical for making a high performance model.
### Example: choose both operators and connections
In this example, we choose one from the three operators and choose two connections for it. As there are multiple variables in `inputs`, we call `concat` at the beginning of the functions.
In this example, we choose one from the three operators and choose two connections for it. As there are multiple variables in inputs, we call `concat` at the beginning of the functions.
![](../img/example_combined.png)
......@@ -42,10 +46,9 @@ To illustrate the convenience of the programming interface, we use the interface
![](../img/example_enas.png)
## Unified NAS search space specification
After finishing the trial code through the annotation above, users have implicitly specified the search space of neural architectures in the code. Based on the code, NNI will automatcailly generate a search space file which could be fed into tuning algorithms. This search space file follows the following `json` format.
After finishing the trial code through the annotation above, users have implicitly specified the search space of neural architectures in the code. Based on the code, NNI will automatically generate a search space file which could be fed into tuning algorithms. This search space file follows the following JSON format.
```json
{
......@@ -78,7 +81,7 @@ Accordingly, a specified neural architecture (generated by tuning algorithm) is
}
```
With the specification of the format of search space and architecture (choice) expression, users are free to implement various (general) tuning algorithms for neural architecture search on NNI. One future work is to provide a general NAS algorihtm.
With the specification of the format of search space and architecture (choice) expression, users are free to implement various (general) tuning algorithms for neural architecture search on NNI. One future work is to provide a general NAS algorithm.
=============================================================
......@@ -90,11 +93,11 @@ NNI's annotation compiler transforms the annotated trial code to the code that c
![](../img/nas_on_nni.png)
The above figure shows how the trial code runs on NNI. `nnictl` processes user trial code to generate a search space file and compiled trial code. The former is fed to tuner, and the latter is used to run trilas.
The above figure shows how the trial code runs on NNI. `nnictl` processes user trial code to generate a search space file and compiled trial code. The former is fed to tuner, and the latter is used to run trials.
[__TODO__] Simple example of NAS on NNI.
[Simple example of NAS on NNI](https://github.com/microsoft/nni/tree/v0.8/examples/trials/mnist-nas).
### Weight sharing
### [__TODO__] Weight sharing
Sharing weights among chosen architectures (i.e., trials) could speedup model search. For example, properly inheriting weights of completed trials could speedup the converge of new trials. One-Shot NAS (e.g., ENAS, Darts) is more aggressive, the training of different architectures (i.e., subgraphs) shares the same copy of the weights in full graph.
......@@ -102,9 +105,9 @@ Sharing weights among chosen architectures (i.e., trials) could speedup model se
We believe weight sharing (transferring) plays a key role on speeding up NAS, while finding efficient ways of sharing weights is still a hot research topic. We provide a key-value store for users to store and load weights. Tuners and Trials use a provided KV client lib to access the storage.
[__TODO__] Example of weight sharing on NNI.
Example of weight sharing on NNI.
### Support of One-Shot NAS
### [__TODO__] Support of One-Shot NAS
One-Shot NAS is a popular approach to find good neural architecture within a limited time and resource budget. Basically, it builds a full graph based on the search space, and uses gradient descent to at last find the best subgraph. There are different training approaches, such as [training subgraphs (per mini-batch)][1], [training full graph through dropout][6], [training with architecture weights (regularization)][3]. Here we focus on the first approach, i.e., training subgraphs (ENAS).
......@@ -112,27 +115,26 @@ With the same annotated trial code, users could choose One-Shot NAS as execution
![](../img/one-shot_training.png)
The design of One-Shot NAS on NNI is shown in the above figure. One-Shot NAS usually only has one trial job with full graph. NNI supports running multiple such trial jobs each of which runs independently. As One-Shot NAS is not stable, running multiple instances helps find better model. Moreover, trial jobs are also able to synchronize weights during running (i.e., there is only one copy of weights, like asynchroneous parameter-server mode). This may speedup converge.
The design of One-Shot NAS on NNI is shown in the above figure. One-Shot NAS usually only has one trial job with full graph. NNI supports running multiple such trial jobs each of which runs independently. As One-Shot NAS is not stable, running multiple instances helps find better model. Moreover, trial jobs are also able to synchronize weights during running (i.e., there is only one copy of weights, like asynchronous parameter-server mode). This may speedup converge.
[__TODO__] Example of One-Shot NAS on NNI.
Example of One-Shot NAS on NNI.
## General tuning algorithms for NAS
## [__TODO__] General tuning algorithms for NAS
Like hyperparameter tuning, a relatively general algorithm for NAS is required. The general programming interface makes this task easier to some extent. We have a RL-based tuner algorithm for NAS from our contributors. We expect efforts from community to design and implement better NAS algorithms.
[__TODO__] More tuning algorithms for NAS.
More tuning algorithms for NAS.
## Export best neural architecture and code
## [__TODO__] Export best neural architecture and code
[__TODO__] After the NNI experiment is done, users could run `nnictl experiment export --code` to export the trial code with the best neural architecture.
After the NNI experiment is done, users could run `nnictl experiment export --code` to export the trial code with the best neural architecture.
## Conclusion and Future work
There could be different NAS algorithms and execution modes, but they could be supported with the same programming interface as demonstrated above.
There are many interesting research topics in this area, both system and machine learning.
There could be different NAS algorithms and execution modes, but they could be supported with the same programming interface as demonstrated above.
There are many interesting research topics in this area, both system and machine learning.
[1]: https://arxiv.org/abs/1802.03268
[2]: https://arxiv.org/abs/1707.07012
......
......@@ -7,7 +7,7 @@ TrainingService is a module related to platform management and job schedule in N
## System architecture
![](../img/NNIDesign.jpg)
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports local platfrom, [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkController.md).
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports local platfrom, [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkController.md).
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
## Folder structure of code
......@@ -62,7 +62,7 @@ abstract class TrainingService {
```
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
__setClusterMetadata(key: string, value: string)__
__setClusterMetadata(key: string, value: string)__
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
```
export class RemoteMachineMeta {
......@@ -76,7 +76,7 @@ export class RemoteMachineMeta {
/* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
public gpuReservation : Map<number, string>;
constructor(ip : string, port : number, username : string, passwd : string,
constructor(ip : string, port : number, username : string, passwd : string,
sshKeyPath : string, passphrase : string) {
this.ip = ip;
this.port = port;
......@@ -90,10 +90,10 @@ export class RemoteMachineMeta {
```
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
__getClusterMetadata(key: string)__
__getClusterMetadata(key: string)__
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
__submitTrialJob(form: JobApplicationForm)__
__submitTrialJob(form: JobApplicationForm)__
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
```
interface TrialJobDetail {
......@@ -112,38 +112,38 @@ interface TrialJobDetail {
```
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
__cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)__
__cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)__
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
__updateTrialJob(trialJobId: string, form: JobApplicationForm)__
__updateTrialJob(trialJobId: string, form: JobApplicationForm)__
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to `RUNNING`, `SUCCEED`, `FAILED` etc.
__getTrialJob(trialJobId: string)__
__getTrialJob(trialJobId: string)__
This function returns a trialJob detail instance according to trialJobId.
__listTrialJobs()__
__listTrialJobs()__
Users should put all of trial job detail information into a list, and return the list.
__addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
__addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
__removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
__removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
Close the EventEmitter.
__run()__
__run()__
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
__cleanUp()__
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
__cleanUp()__
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
## TrialKeeper tool
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in `nni/tools/nni_trial_tool`. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
![](../img/trialkeeper.jpg)
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts`.
The running architecture of TrialKeeper is show as follow:
![](../img/trialkeeper.jpg)
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts`.
## Reference
For more information about how to debug, please [refer](HowToDebug.md).
The guide line of how to contribute, please [refer](Contributing.md).
\ No newline at end of file
For more information about how to debug, please [refer](HowToDebug.md).
The guide line of how to contribute, please [refer](Contributing.md).
......@@ -18,12 +18,12 @@ If you have installed the docker package in your local machine, you could start
For example, you could start a new docker container from following command:
```
docker run -i -t -p [hostPort]:[containerPort] [image]
```
`-i:` Start a docker in an interactive mode.
```
`-i:` Start a docker in an interactive mode.
`-t:` Docker assign the container a input terminal.
`-t:` Docker assign the container a input terminal.
`-p:` Port mapping, map host port to a container port.
`-p:` Port mapping, map host port to a container port.
For more information about docker command, please [refer](https://docs.docker.com/v17.09/edge/engine/reference/run/)
......@@ -33,11 +33,11 @@ Note:
```
### Step3: Run NNI in docker container
If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.
If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.
If you start your own docker image, you may need to install NNI package first, please [refer](Installation.md).
If you want to run NNI's offical examples, you may need to clone NNI repo in github using
If you want to run NNI's offical examples, you may need to clone NNI repo in github using
```
git clone https://github.com/Microsoft/nni.git
```
......
......@@ -7,6 +7,7 @@ Currently we support installation on Linux, Mac and Windows(local, remote and pa
* __Install NNI through pip__
Prerequisite: `python >= 3.5`
```bash
python3 -m pip install --upgrade nni
```
......@@ -14,6 +15,7 @@ Currently we support installation on Linux, Mac and Windows(local, remote and pa
* __Install NNI through source code__
Prerequisite: `python >=3.5`, `git`, `wget`
```bash
git clone -b v0.8 https://github.com/Microsoft/nni.git
cd nni
......@@ -24,15 +26,10 @@ Currently we support installation on Linux, Mac and Windows(local, remote and pa
You can also install NNI in a docker image. Please follow the instructions [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/README.md) to build NNI docker image. The NNI docker image can also be retrieved from Docker Hub through the command `docker pull msranni/nni:latest`.
## **Installation on Windows**
When you use PowerShell to run script for the first time, you need **run PowerShell as administrator** with this command:
```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
```
## **Installation on Windows**
Anaconda or Miniconda is highly recommended.
* __Install NNI through pip__
Prerequisite: `python(64-bit) >= 3.5`
......@@ -44,13 +41,11 @@ Currently we support installation on Linux, Mac and Windows(local, remote and pa
* __Install NNI through source code__
Prerequisite: `python >=3.5`, `git`, `PowerShell`.
you can install NNI as administrator or current user as follows:
```bash
git clone -b v0.8 https://github.com/Microsoft/nni.git
cd nni
powershell .\install.ps1
powershell -ExecutionPolicy Bypass -file install.ps1
```
## **System requirements**
......
**Run an Experiment on Kubeflow**
===
Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster.
Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster.
## Prerequisite for on-premises Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Download, set up, and deploy **Kubelow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to set up Kubeflow
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
```
apt-get install nfs-common
```
apt-get install nfs-common
```
7. Install **NNI**, follow the install guide [here](QuickStart.md).
......@@ -22,11 +22,11 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku
4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account, and NNI use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Design
## Design
![](../img/kubeflow_training_design.png)
Kubeflow training service instantiates a kubernetes rest client to interact with your K8s cluster's API server.
Kubeflow training service instantiates a kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
## Supported operator
NNI only support tf-operator and pytorch-operator of kubeflow, other operators is not tested.
......@@ -41,10 +41,10 @@ The setting of pytorch-operator:
kubeflowConfig:
operator: pytorch-operator
```
If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config.
If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config.
## Supported storage type
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
```
kubeflowConfig:
......@@ -69,7 +69,7 @@ kubeflowConfig:
## Run an experiment
Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The NNI config YAML file's content is like:
Use `examples/trials/mnist` as an example. This is a tensorflow job, and use tf-operator of kubeflow. The NNI config YAML file's content is like:
```
authorName: default
experimentName: example_mnist
......@@ -119,7 +119,7 @@ kubeflowConfig:
path: {your_nfs_server_export_path}
```
Note: You should explicitly set `trainingServicePlatform: kubeflow` in NNI config YAML file if you want to start experiment in kubeflow mode.
Note: You should explicitly set `trainingServicePlatform: kubeflow` in NNI config YAML file if you want to start experiment in kubeflow mode.
If you want to run PyTorch jobs, you could set your config files as follow:
```
......@@ -178,7 +178,7 @@ Trial configuration in kubeflow mode have the following configuration keys:
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run.
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* apiVersion
* Required key. The API version of your kubeflow.
......@@ -189,12 +189,12 @@ Once complete to fill NNI experiment config file and save (for example, save as
```
nnictl create --config exp_kubeflow.yml
```
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard.
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the kubeflow tfjob created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
......
......@@ -19,26 +19,26 @@ To enable NNI API, make the following changes:
RECEIVED_PARAMS = nni.get_next_parameter()
to get hyper-parameters' values assigned by tuner. `RECEIVED_PARAMS` is an object, for example:
to get hyper-parameters' values assigned by tuner. `RECEIVED_PARAMS` is an object, for example:
{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}
1.3 Report NNI results
Use the API:
Use the API:
`nni.report_intermediate_result(accuracy)`
`nni.report_intermediate_result(accuracy)`
to send `accuracy` to assessor.
Use the API:
`nni.report_final_result(accuracy)`
to send `accuracy` to tuner.
`nni.report_final_result(accuracy)`
to send `accuracy` to tuner.
~~~~
We had made the changes and saved it to `mnist.py`.
**NOTE**:
**NOTE**:
~~~~
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
......@@ -47,7 +47,7 @@ tuner - The tuner will generate next parameters/architecture based on the exp
>Step 2 - Define SearchSpace
The hyper-parameters used in `Step 1.2 - Get predefined parameters` is defined in a `search_space.json` file like below:
The hyper-parameters used in `Step 1.2 - Get predefined parameters` is defined in a `search_space.json` file like below:
```
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
......@@ -76,10 +76,10 @@ To run an experiment in NNI, you only needed:
* Provide a YAML experiment configure file
* (optional) Provide or choose an assessor
**Prepare trial**:
**Prepare trial**:
>A set of examples can be found in ~/nni/examples after your installation, run `ls ~/nni/examples/trials` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run `ls ~/nni/examples/trials` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run `ls ~/nni/examples/trials` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
python ~/nni/examples/trials/mnist-annotation/mnist.py
......@@ -109,10 +109,10 @@ maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote
# choice: local, remote
trainingServicePlatform: local
# choice: true, false
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
......@@ -122,7 +122,7 @@ trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
```
```
Here *useAnnotation* is true because this trial example uses our python annotation (refer to [here](AnnotationSpec.md) for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
......@@ -136,7 +136,7 @@ You can refer to [here](Nnictl.md) for more usage guide of *nnictl* command line
The experiment has been running now. Other than *nnictl*, NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
## Using multiple local GPUs to speed up search
The following steps assume that you have 4 NVIDIA GPUs installed at local and [tensorflow with GPU support](https://www.tensorflow.org/install/gpu). The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
The following steps assume that you have 4 NVIDIA GPUs installed at local and [tensorflow with GPU support](https://www.tensorflow.org/install/gpu). The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**: NNI provides a demo configuration file for the setting above, `cat ~/nni/examples/trials/mnist-annotation/config_gpu.yml` to see it. The trailConcurrency and gpuNum are different from the basic configure file:
......@@ -152,7 +152,7 @@ trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 1
```
```
We can run the experiment with the following command:
......
......@@ -6,7 +6,7 @@ Typically each trial job gets a single configuration (e.g., hyperparameters) fro
2. Some types of models have to be trained phase by phase, the configuration of next phase depends on the results of previous phase(s). For example, to find the best quantization for a model, the training procedure is often as follows: the auto-quantization algorithm (i.e., tuner in NNI) chooses a size of bits (e.g., 16 bits), a trial job gets this configuration and trains the model for some epochs and reports result (e.g., accuracy). The algorithm receives this result and makes decision of changing 16 bits to 8 bits, or changing back to 32 bits. This process is repeated for a configured times.
The above cases can be supported by the same feature, i.e., multi-phase execution. To support those cases, basically a trial job should be able to request multiple configurations from tuner. Tuner is aware of whether two configuration requests are from the same trial job or different ones. Also in multi-phase a trial job can report multiple final results.
The above cases can be supported by the same feature, i.e., multi-phase execution. To support those cases, basically a trial job should be able to request multiple configurations from tuner. Tuner is aware of whether two configuration requests are from the same trial job or different ones. Also in multi-phase a trial job can report multiple final results.
Note that, `nni.get_next_parameter()` and `nni.report_final_result()` should be called sequentially: __call the former one, then call the later one; and repeat this pattern__. If `nni.get_next_parameter()` is called multiple times consecutively, and then `nni.report_final_result()` is called once, the result is associated to the last configuration, which is retrieved from the last get_next_parameter call. So there is no result associated to previous get_next_parameter calls, and it may cause some multi-phase algorithm broken.
......
......@@ -54,11 +54,11 @@ net = build_graph_from_json(RCV_CONFIG)
nni.report_final_result(best_acc)
```
If you want to save and **load the best model**, the following methods are recommended.
If you want to save and **load the best model**, the following methods are recommended.
```python
# 1. Use NNI API
## You can get the best model ID from WebUI
## You can get the best model ID from WebUI
## or `nni/experiments/experiment_id/log/model_path/best_model.txt'
## read the json string from model file and load it with NNI API
......@@ -66,7 +66,7 @@ with open("best-model.json") as json_file:
json_of_model = json_file.read()
model = build_graph_from_json(json_of_model)
# 2. Use Framework API (Related to Framework)
# 2. Use Framework API (Related to Framework)
## 2.1 Keras API
## Save the model with Keras API in the trial code
......@@ -106,9 +106,9 @@ The tuner has a lot of different files, functions and classes. Here we will only
- `networkmorphism_tuner.py` is a tuner which using network morphism techniques.
- `bayesian.py` is Bayesian method to estimate the metric of unseen model based on the models we have already searched.
- `bayesian.py` is Bayesian method to estimate the metric of unseen model based on the models we have already searched.
- `graph.py` is the meta graph data structure. Class Graph is representing the neural architecture graph of a model.
- Graph extracts the neural architecture graph from a model.
- Graph extracts the neural architecture graph from a model.
- Each node in the graph is a intermediate tensor between layers.
- Each layer is an edge in the graph.
- Notably, multiple edges may refer to the same layer.
......
......@@ -4,7 +4,7 @@ Currently we support local, remote and pai mode on Windows. Windows 10.1809 is w
## **Installation on Windows**
please refer to [Installation](Installation.md#installation-on-windows) for more details.
please refer to [Installation](Installation.md) for more details.
When these things are done, use the **config_windows.yml** configuration to start an experiment for validation.
......@@ -21,16 +21,6 @@ For other examples you need to change trial command `python3` into `python` in e
Make sure C++ 14.0 compiler installed.
>building 'simplejson._speedups' extension error: [WinError 3] The system cannot find the path specified
### Fail to run PowerShell when install NNI from source
If you run PowerShell script for the first time and did not set the execution policies for executing the script, you will meet this error below. Try to run PowerShell as administrator with this command first:
```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
```
>...cannot be loaded because running scripts is disabled on this system.
### Trial failed with missing DLL in command line or PowerShell
This error caused by missing LIBIFCOREMD.DLL and LIBMMD.DLL and fail to install SciPy. Using Anaconda or Miniconda with Python(64-bit) can solve it.
......@@ -38,11 +28,7 @@ This error caused by missing LIBIFCOREMD.DLL and LIBMMD.DLL and fail to install
### Trial failed on webUI
Please check the trial log file stderr for more details. If there is no such file and NNI is installed through pip, then you need to run PowerShell as administrator with this command first:
```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
```
Please check the trial log file stderr for more details.
If there is a stderr file, please check out. Two possible cases are as follows:
......
......@@ -29,7 +29,7 @@ nnictl support commands:
* Description
You can use this command to create a new experiment, using the configuration specified in config file.
You can use this command to create a new experiment, using the configuration specified in config file.
After this command is successfully done, the context will be set as this experiment, which means the following command you issued is associated with this experiment, unless you explicitly changes the context(not supported yet).
......@@ -91,7 +91,7 @@ Debug mode will disable version check function in Trialkeeper.
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
|id| True| |The id of the experiment you want to resume|
|id| True| |The id of the experiment you want to resume|
|--port, -p| False| |Rest port of the experiment you want to resume|
|--debug, -d|False||set debug mode|
......@@ -170,7 +170,7 @@ Debug mode will disable version check function in Trialkeeper.
nnictl update searchspace [experiment_id] --filename examples/trials/mnist/search_space.json
```
* __nnictl update concurrency__
* __nnictl update concurrency__
* Description
......@@ -197,11 +197,11 @@ Debug mode will disable version check function in Trialkeeper.
nnictl update concurrency [experiment_id] --value [concurrency_number]
```
* __nnictl update duration__
* __nnictl update duration__
* Description
You can use this command to update an experiment's duration.
You can use this command to update an experiment's duration.
* Usage
......@@ -224,7 +224,7 @@ Debug mode will disable version check function in Trialkeeper.
nnictl update duration [experiment_id] --value [duration]
```
* __nnictl update trialnum__
* __nnictl update trialnum__
* Description
You can use this command to update an experiment's maxtrialnum.
......@@ -312,7 +312,7 @@ Debug mode will disable version check function in Trialkeeper.
nnictl top
```
* Options
* Options
|Name, shorthand|Required|Default|Description|
|------|------|------ |------|
......@@ -525,12 +525,12 @@ Debug mode will disable version check function in Trialkeeper.
* __nnictl log trial__
* Description
Show trial log path.
* Usage
```bash
```bash
nnictl log trial [options]
```
......@@ -554,7 +554,7 @@ Debug mode will disable version check function in Trialkeeper.
* Description
Start the tensorboard process.
* Usage
```bash
......@@ -571,8 +571,8 @@ Debug mode will disable version check function in Trialkeeper.
* Detail
1. NNICTL support tensorboard function in local and remote platform for the moment, other platforms will be supported later.
2. If you want to use tensorboard, you need to write your tensorboard log data to environment variable [NNI_OUTPUT_DIR] path.
1. NNICTL support tensorboard function in local and remote platform for the moment, other platforms will be supported later.
2. If you want to use tensorboard, you need to write your tensorboard log data to environment variable [NNI_OUTPUT_DIR] path.
3. In local mode, nnictl will set --logdir=[NNI_OUTPUT_DIR] directly and start a tensorboard process.
4. In remote mode, nnictl will create a ssh client to copy log data from remote machine to local temp directory firstly, and then start a tensorboard process in your local machine. You need to notice that nnictl only copy the log data one time when you use the command, if you want to see the later result of tensorboard, you should execute nnictl tensorboard command again.
5. If there is only one trial job, you don't need to set trial id. If there are multiple trial jobs running, you should set the trial id, or you could use [nnictl tensorboard start --trial_id all] to map --logdir to all trial log paths.
......
......@@ -11,7 +11,7 @@ The figure below shows high-level architecture of NNI.
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
</p>
</p>
## Key Concepts
......@@ -42,7 +42,7 @@ For each experiment, user only needs to define a search space and update a few l
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816627-5d13db80-2302-11e9-8f3e-627e260203d5.jpg" alt="drawing"/>
</p>
</p>
More details about how to run an experiment, please refer to [Get Started](QuickStart.md).
......
......@@ -19,7 +19,7 @@ maxExecDuration: 3h
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# choice: true, false
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
......@@ -31,7 +31,7 @@ trial:
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: openpai/pai.example.tensorflow
image: msranni/nni:latest
dataDir: hdfs://10.1.1.1:9000/nni
outputDir: hdfs://10.1.1.1:9000/nni
# Configuration to access OpenPAI Cluster
......@@ -53,7 +53,7 @@ Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial con
* We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* dataDir
* Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory}
* outputDir
* outputDir
* Optional key. It specifies the HDFS output directory for trial. Once the trial is completed (either succeed or fail), trial's stdout, stderr will be copied to this directory by NNI sdk automatically. The format should be something like hdfs://{your HDFS host}:9000/{your output directory}
* virtualCluster
* Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
......@@ -64,13 +64,13 @@ Once complete to fill NNI experiment config file and save (for example, save as
```
nnictl create --config exp_pai.yml
```
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
![](../img/nni_pai_joblist.jpg)
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
![](../img/nni_webui_joblist.jpg)
......@@ -80,16 +80,16 @@ And you will be redirected to HDFS web portal to browse the output files of that
You can see there're three fils in output folder: stderr, stdout, and trial.log
If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS.
If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS.
Any problems when using NNI in pai mode, please create issues on [NNI github repo](https://github.com/Microsoft/nni).
## version check
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
1. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
2. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
2. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
![](../img/version_check.png)
\ No newline at end of file
......@@ -10,11 +10,7 @@ We support Linux MacOS and Windows in current stage, Ubuntu 16.04 or higher, Mac
```
#### Windows
If you are using NNI on Windows, you need run below PowerShell command as administrator at first time.
```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
```
Then install nni through pip:
```bash
python -m pip install --upgrade nni
```
......@@ -130,7 +126,7 @@ useAnnotation: false
tuner:
builtinTunerName: TPE
# The path and the running command of trial
trial:
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
......
# ChangeLog
# Release 0.8 - 6/4/2019
## Major Features
* [Support NNI on Windows for PAI/Remote mode]
* NNI running on windows for remote mode
* NNI running on windows for PAI mode
* [Advanced features for using GPU]
* Run multiple trial jobs on the same GPU for local and remote mode
* Run trial jobs on the GPU running non-NNI jobs
* [Kubeflow v1beta2 operator]
* Support Kubeflow TFJob/PyTorchJob v1beta2
* [General NAS programming interface](./GeneralNasInterfaces.md)
* Provide NAS programming interface for users to easily express their neural architecture search space through NNI annotation
* Provide a new command `nnictl trial codegen` for debugging the NAS code
* Tutorial of NAS programming interface, example of NAS on mnist, customized random tuner for NAS
* [Support resume tuner/advisor's state for experiment resume]
* For experiment resume, tuner/advisor will be resumed by replaying finished trial data
* [Web Portal]
* Improve the design of copying trial's parameters
* Support 'randint' type in hyper-parameter graph
* Use should ComponentUpdate to avoid unnecessary render
## Bug fix and other changes
* [Bug fix that `nnictl update` has inconsistent command styles]
* [Support import data for SMAC tuner]
* [Bug fix that experiment state transition from ERROR back to RUNNING]
* [Fix bug of table entries]
* [Nested search space refinement]
* [Refine 'randint' type and support lower bound]
* [Comparison of different hyper-parameter tuning algorithm](./CommunitySharings/HpoComparision.md)
* [Comparison of NAS algorithm](./CommunitySharings/NasComparision.md)
* [NNI practice on Recommenders](./CommunitySharings/NniPracticeSharing/RecommendersSvd.md)
## Release 0.7 - 4/29/2018
### Major Features
......@@ -71,14 +102,14 @@
## Release 0.5.1 - 1/31/2018
### Improvements
* Making [log directory](https://github.com/Microsoft/nni/blob/v0.5.1/docs/en_US/ExperimentConfig.md) configurable
* Support [different levels of logs](https://github.com/Microsoft/nni/blob/v0.5.1/docs/en_US/ExperimentConfig.md), making it easier for debugging
* Support [different levels of logs](https://github.com/Microsoft/nni/blob/v0.5.1/docs/en_US/ExperimentConfig.md), making it easier for debugging
### Documentation
* Reorganized documentation & New Homepage Released: https://nni.readthedocs.io/en/latest/
### Bug Fixes and Other Changes
* Fix the bug of installation in python virtualenv, and refactor the installation logic
* Fix the bug of HDFS access failure on OpenPAI mode after OpenPAI is upgraded.
* Fix the bug of HDFS access failure on OpenPAI mode after OpenPAI is upgraded.
* Fix the bug that sometimes in-place flushed stdout makes experiment crash
## Release 0.5.0 - 01/14/2019
......@@ -146,7 +177,7 @@
* [Kubeflow Training service](./KubeflowMode.md)
* Support tf-operator
* [Distributed trial example](https://github.com/Microsoft/nni/tree/master/examples/trials/mnist-distributed/dist_mnist.py) on Kubeflow
* [Grid search tuner](GridsearchTuner.md)
* [Grid search tuner](GridsearchTuner.md)
* [Hyperband tuner](HyperbandAdvisor.md)
* Support launch NNI experiment on MAC
* WebUI
......@@ -161,10 +192,10 @@
### Others
* Asynchronous dispatcher
* Docker file update, add pytorch library
* Refactor 'nnictl stop' process, send SIGTERM to nni manager process, rather than calling stop Rest API.
* Docker file update, add pytorch library
* Refactor 'nnictl stop' process, send SIGTERM to nni manager process, rather than calling stop Rest API.
* OpenPAI training service bug fix
* Support NNI Manager IP configuration(nniManagerIp) in OpenPAI cluster config file, to fix the issue that user’s machine has no eth0 device
* Support NNI Manager IP configuration(nniManagerIp) in OpenPAI cluster config file, to fix the issue that user’s machine has no eth0 device
* File number in codeDir is capped to 1000 now, to avoid user mistakenly fill root dir for codeDir
* Don’t print useless ‘metrics is empty’ log in OpenPAI job’s stdout. Only print useful message once new metrics are recorded, to reduce confusion when user checks OpenPAI trial’s output for debugging purpose
* Add timestamp at the beginning of each log entry in trial keeper.
......@@ -188,7 +219,7 @@
* <span style="color:red">**breaking change**</span>: nn.get_parameters() is refactored to nni.get_next_parameter. All examples of prior releases can not run on v0.3, please clone nni repo to get new examples. If you had applied NNI to your own codes, please update the API accordingly.
* New API **nni.get_sequence_id()**.
* New API **nni.get_sequence_id()**.
Each trial job is allocated a unique sequence number, which can be retrieved by nni.get_sequence_id() API.
```bash
......
......@@ -38,7 +38,7 @@ All types of sampling strategies and their parameter are listed here:
* {"_type":"randint","_value":[lower, upper]}
* For now, we implment the "randint" distribution with "quniform", which means the variable value is a value like round(uniform(lower, upper)). The type of chosen value is float. If you want to use integer value, please convert it explicitly.
* For now, we implement the "randint" distribution with "quniform", which means the variable value is a value like round(uniform(lower, upper)). The type of chosen value is float. If you want to use integer value, please convert it explicitly.
* {"_type":"uniform","_value":[low, high]}
* Which means the variable value is a value uniformly between low and high.
......@@ -92,7 +92,7 @@ Known Limitations:
* Note that In Grid Search Tuner, for users' convenience, the definition of `quniform` and `qloguniform` change, where q here specifies the number of values that will be sampled. Details about them are listed as follows
* Type 'quniform' will receive three values [low, high, q], where [low, high] specifies a range and 'q' specifies the number of values that will be sampled evenly. Note that q should be at least 2. It will be sampled in a way that the first sampled value is 'low', and each of the following values is (high-low)/q larger that the value in front of it.
* Type 'qloguniform' behaves like 'quniform' except that it will first change the range to [log(low), log(high)] and sample and then change the sampled value back.
* Note that Metis Tuner only supports numerical `choice` now
......
......@@ -63,4 +63,4 @@ After the code changes, use **step 3** to rebuild your codes, then the changes w
---
At last, wish you have a wonderful day.
For more contribution guidelines on making PR's or issues to NNI source code, you can refer to our [Contributing](./Contributing.md) document.
For more contribution guidelines on making PR's or issues to NNI source code, you can refer to our [Contributing](./Contributing.md) document.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment