Commit ae7a72bc authored by Hongarc's avatar Hongarc Committed by Chi Song
Browse files

Remove all whitespace at end of line (#1162)

parent 14c1b31c
......@@ -14,7 +14,7 @@
[简体中文](README_zh_CN.md)
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments.
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments.
The tool dispatches and runs trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different environments like local machine, remote servers and cloud.
### **NNI [v0.8](https://github.com/Microsoft/nni/releases) has been released!**
......@@ -67,7 +67,7 @@ The tool dispatches and runs trial jobs generated by tuning algorithms to search
<li><a href="docs/en_US/BuiltinTuner.md#MetisTuner">Metis Tuner</a></li>
<li><a href="docs/en_US/BuiltinTuner.md#BOHB">BOHB</a></li>
</ul>
<a href="docs/en_US/BuiltinAssessors.md">Assessor</a>
<a href="docs/en_US/BuiltinAssessors.md">Assessor</a>
<ul>
<li><a href="docs/en_US/BuiltinAssessors.md#Medianstop">Median Stop</a></li>
<li><a href="docs/en_US/BuiltinAssessors.md#Curvefitting">Curve Fitting</a></li>
......@@ -106,7 +106,7 @@ We encourage researchers and students leverage these projects to accelerate the
## **Install & Verify**
**Install through pip**
**Install through pip**
* We support Linux, MacOS and Windows(local, remote and pai mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 along with Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`.
......@@ -136,7 +136,7 @@ Note:
**Install through source code**
* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows (10.1809) in our current stage.
* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows (10.1809) in our current stage.
Linux and MacOS
......
......@@ -31,7 +31,7 @@ RUN DEBIAN_FRONTEND=noninteractive && \
vim \
unzip \
wget \
build-essential \
build-essential \
cmake \
libopenblas-dev \
automake \
......@@ -51,9 +51,9 @@ RUN DEBIAN_FRONTEND=noninteractive && \
#
RUN python3 -m pip install --upgrade pip
# numpy 1.14.3 scipy 1.1.0
# numpy 1.14.3 scipy 1.1.0
RUN python3 -m pip --no-cache-dir install \
numpy==1.14.3 scipy==1.1.0
numpy==1.14.3 scipy==1.1.0
#
# Tensorflow 1.10.0
......@@ -66,7 +66,7 @@ RUN python3 -m pip --no-cache-dir install tensorflow-gpu==1.10.0
RUN python3 -m pip --no-cache-dir install Keras==2.1.6
#
# PyTorch
# PyTorch
#
RUN python3 -m pip --no-cache-dir install torch==0.4.1
RUN python3 -m pip install torchvision==0.2.1
......
......@@ -14,11 +14,11 @@ pandas 0.23.4
lightgbm 2.2.2
NNI v0.7
```
You can take this Dockerfile as a reference for your own customized Dockerfile.
You can take this Dockerfile as a reference for your own customized Dockerfile.
## 2.How to build and run
__Use the following command from `nni/deployment/docker` to build docker image__
```
```
docker build -t nni/nni .
```
__Run the docker image__
......
......@@ -7,7 +7,7 @@ ifeq ($(UNAME_S), Linux)
else ifeq ($(UNAME_S), Darwin)
OS_SPEC := darwin
WHEEL_SPEC := macosx_10_9_x86_64
else
else
$(error platform $(UNAME_S) not supported)
endif
......
......@@ -43,7 +43,7 @@ $env:PATH = $NNI_NODE_FOLDER+';'+$env:PATH
cd $CWD\..\..\src\nni_manager
yarn
yarn build
cd $CWD\..\..\src\webui
cd $CWD\..\..\src\webui
yarn
yarn build
if(Test-Path $CWD\nni){
......
......@@ -88,4 +88,4 @@ For details, please refer to this [simple weight sharing example](https://github
[2]: https://arxiv.org/abs/1707.07012
[3]: https://arxiv.org/abs/1806.09055
[4]: https://arxiv.org/abs/1806.10282
[5]: https://arxiv.org/abs/1703.01041
[5]: https://arxiv.org/abs/1703.01041
......@@ -29,7 +29,7 @@ In NNI, there are mainly four types of annotation:
**Arguments**
- **sampling_algo**: Sampling algorithm that specifies a search space. User should replace it with a built-in NNI sampling function whose name consists of an `nni.` identification and a search space type specified in [SearchSpaceSpec](SearchSpaceSpec.md) such as `choice` or `uniform`.
- **sampling_algo**: Sampling algorithm that specifies a search space. User should replace it with a built-in NNI sampling function whose name consists of an `nni.` identification and a search space type specified in [SearchSpaceSpec](SearchSpaceSpec.md) such as `choice` or `uniform`.
- **name**: The name of the variable that the selected value will be assigned to. Note that this argument should be the same as the left value of the following assignment statement.
There are 10 types to express your search space as follows:
......
......@@ -195,8 +195,8 @@ Note that the search space that BatchTuner supported like:
"_value" : [{"optimizer": "Adam", "learning_rate": 0.00001},
{"optimizer": "Adam", "learning_rate": 0.0001},
{"optimizer": "Adam", "learning_rate": 0.001},
{"optimizer": "SGD", "learning_rate": 0.01},
{"optimizer": "SGD", "learning_rate": 0.005},
{"optimizer": "SGD", "learning_rate": 0.01},
{"optimizer": "SGD", "learning_rate": 0.005},
{"optimizer": "SGD", "learning_rate": 0.0002}]
}
}
......
......@@ -28,7 +28,7 @@ To avoid over-fitting in **CIFAR-10**, we also compare the models in the other f
We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
We do not change the default fine-tuning technique in their source code. In order to match each task, the codes of input image shape and output numbers are changed.
Search phase time for all NAS methods is **two days** as well as the retrain time. Average results are reported based on **three repeat times**. Our evaluation machines have one Nvidia Tesla P100 GPU, 112GB of RAM and one 2.60GHz CPU (Intel E5-2690).
......@@ -57,7 +57,7 @@ The best or average results reported in the paper:
For AutoKeras, it has relatively worse performance across all datasets due to its random factor on network morphism.
For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro) shows good results in CIFAR-10.
For ENAS, ENAS (macro) shows good results in OUI-Adience-Age and ENAS (micro) shows good results in CIFAR-10.
For DARTS, it has a good performance on some datasets but we found its high variance in other datasets. The difference among three runs of benchmarks can be up to 5.37% in OUI-Adience-Age and 4.36% in ImageNet-10-1.
......
......@@ -8,7 +8,7 @@ However, for those individuals who want a bit more guidance on the best way to c
Looking for a quickstart, get acquainted with our [Get Started](./QuickStart.md) guide.
There are a few simple guidelines that you need to follow before providing your hacks.
There are a few simple guidelines that you need to follow before providing your hacks.
## Raising Issues
......@@ -20,9 +20,9 @@ When raising issues, please specify the following:
## Submit Proposals for New Features
- There is always something more that is required, to make it easier to suit your use-cases. Feel free to join the discussion on new features or raise a PR with your proposed change.
- There is always something more that is required, to make it easier to suit your use-cases. Feel free to join the discussion on new features or raise a PR with your proposed change.
- Fork the repository under your own github handle. After cloning the repository. Add, commit, push and sqaush (if necessary) the changes with detailed commit messages to your fork. From where you can proceed to making a pull request.
- Fork the repository under your own github handle. After cloning the repository. Add, commit, push and sqaush (if necessary) the changes with detailed commit messages to your fork. From where you can proceed to making a pull request.
## Contributing to Source Code and Bug Fixes
......@@ -33,15 +33,15 @@ If you are looking for How to develop and debug the NNI source code, you can ref
Similarly for [Quick Start](QuickStart.md). For everything else, refer to [NNI Home page](http://nni.readthedocs.io).
## Solve Existing Issues
Head over to [issues](https://github.com/Microsoft/nni/issues) to find issues where help is needed from contributors. You can find issues tagged with 'good-first-issue' or 'help-wanted' to contribute in.
Head over to [issues](https://github.com/Microsoft/nni/issues) to find issues where help is needed from contributors. You can find issues tagged with 'good-first-issue' or 'help-wanted' to contribute in.
A person looking to contribute can take up an issue by claiming it as a comment/assign their Github ID to it. In case there is no PR or update in progress for a week on the said issue, then the issue reopens for anyone to take up again. We need to consider high priority issues/regressions where response time must be a day or so.
A person looking to contribute can take up an issue by claiming it as a comment/assign their Github ID to it. In case there is no PR or update in progress for a week on the said issue, then the issue reopens for anyone to take up again. We need to consider high priority issues/regressions where response time must be a day or so.
## Code Styles & Naming Conventions
## Code Styles & Naming Conventions
* We follow [PEP8](https://www.python.org/dev/peps/pep-0008/) for Python code and naming conventions, do try to adhere to the same when making a pull request or making a change. One can also take the help of linters such as `flake8` or `pylint`
* We also follow [NumPy Docstring Style](https://www.sphinx-doc.org/en/master/usage/extensions/example_numpy.html#example-numpy) for Python Docstring Conventions. During the [documentation building](Contributing.md#documentation), we use [sphinx.ext.napoleon](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html) to generate Python API documentation from Docstring.
## Documentation
## Documentation
Our documentation is built with [sphinx](http://sphinx-doc.org/), supporting [Markdown](https://guides.github.com/features/mastering-markdown/) and [reStructuredText](http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html) format. All our documentations are placed under [docs/en_US](https://github.com/Microsoft/nni/tree/master/docs).
* Before submitting the documentation change, please __build homepage locally__: `cd docs/en_US && make html`, then you can see all the built documentation webpage under the folder `docs/en_US/_build/html`. It's also highly recommended taking care of __every WARNING__ during the build, which is very likely the signal of a __deadlink__ and other annoying issues.
......
......@@ -2,7 +2,7 @@ Curve Fitting Assessor on NNI
===
## 1. Introduction
Curve Fitting Assessor is a LPA(learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of final epoch's performance is worse than the best final performance in the trial history.
Curve Fitting Assessor is a LPA(learning, predicting, assessing) algorithm. It stops a pending trial X at step S if the prediction of final epoch's performance is worse than the best final performance in the trial history.
In this algorithm, we use 12 curves to fit the learning curve, the large set of parametric curve models are chosen from [reference paper][1]. The learning curves' shape coincides with our prior knowlwdge about the form of learning curves: They are typically increasing, saturating functions.
......
......@@ -34,7 +34,7 @@ advisor:
classFileName: my_customized_advisor.py
className: CustomizedAdvisor
# Any parameter need to pass to your advisor class __init__ constructor
# can be specified in this optional classArgs field, for example
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
```
......@@ -47,7 +47,7 @@ assessor:
classFileName: my_customized_assessor.py
className: CustomizedAssessor
# Any parameter need to pass to your Assessor class __init__ constructor
# can be specified in this optional classArgs field, for example
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
```
......
......@@ -96,7 +96,7 @@ tuner:
classFileName: my_customized_tuner.py
className: CustomizedTuner
# Any parameter need to pass to your tuner class __init__ constructor
# can be specified in this optional classArgs field, for example
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
......
......@@ -2,7 +2,7 @@
A config file is needed when create an experiment, the path of the config file is provide to nnictl.
The config file is written in YAML format, and need to be written correctly.
This document describes the rule to write config file, and will provide some examples and templates.
This document describes the rule to write config file, and will provide some examples and templates.
- [Experiment config reference](#experiment-config-reference)
- [Template](#template)
......@@ -125,7 +125,7 @@ machineList:
## Configuration spec
* __authorName__
* Description
* Description
__authorName__ is the name of the author who create the experiment.
TBD: add default value
......@@ -133,20 +133,20 @@ machineList:
* __experimentName__
* Description
__experimentName__ is the name of the experiment created.
__experimentName__ is the name of the experiment created.
TBD: add default value
* __trialConcurrency__
* Description
__trialConcurrency__ specifies the max num of trial jobs run simultaneously.
__trialConcurrency__ specifies the max num of trial jobs run simultaneously.
Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.
* __maxExecDuration__
* Description
__maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
__maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
......@@ -158,16 +158,16 @@ machineList:
* __maxTrialNum__
* Description
__maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
__maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
* __trainingServicePlatform__
* Description
__trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
__trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
* __local__ run an experiment on local ubuntu machine.
* __local__ run an experiment on local ubuntu machine.
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
* __pai__ submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](./PaiMode.md)
......@@ -396,7 +396,7 @@ machineList:
__localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
* __gpuIndices__
__gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`.
* __maxTrialNumPerGpu__
......@@ -413,11 +413,11 @@ machineList:
__machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
* __ip__
__ip__ is the ip address of remote machine.
* __port__
__port__ is the ssh port to be used to connect machine.
Note: if users set port empty, the default value will be 22.
......@@ -439,7 +439,7 @@ machineList:
__passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.
* __gpuIndices__
__gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`.
* __maxTrialNumPerGpu__
......
......@@ -21,7 +21,7 @@ Please try the following solutions in turn:
Your machine don't have eth0 device, please set [nniManagerIp](ExperimentConfig.md) in your config file manually.
### Exceed the MaxDuration but didn't stop
When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment.
When the duration of experiment reaches the maximum duration, nniManager will not create new trials, but the existing trials will continue unless user manually stop the experiment.
### Could not stop an experiment using `nnictl stop`
If you upgrade your NNI or you delete some config files of NNI when there is an experiment running, this kind of issue may happen because the loss of config file. You could use `ps -ef | grep node` to find the PID of your experiment, and use `kill -9 {pid}` to kill it manually.
......@@ -37,7 +37,7 @@ Unable to open the WebUI may have the following reasons:
* Another reason may be your experiment is failed and NNI may fail to get the experiment information. You can check the log of NNIManager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log
### NNI on Windows problems
Please refer to [NNI on Windows](NniOnWindows.md)
Please refer to [NNI on Windows](NniOnWindows.md)
### Help us improve
Please inquiry the problem in https://github.com/Microsoft/nni/issues to see whether there are other people already reported the problem, create a new one if there are no existing issues been created.
......@@ -9,8 +9,8 @@ NNI supports running experiment using [FrameworkController](https://github.com/M
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copies files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
```
apt-get install nfs-common
```
apt-get install nfs-common
```
6. Install **NNI**, follow the install guide [here](QuickStart.md).
......@@ -82,9 +82,9 @@ frameworkcontrollerConfig:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
```
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
The trial's config format for NNI frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
......
# GBDT in nni
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree has many popular implementations, such as [lightgbm](https://github.com/Microsoft/LightGBM), [xgboost](https://github.com/dmlc/xgboost), and [catboost](https://github.com/catboost/catboost), etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
## 1. Search Space in GBDT
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):
......
......@@ -7,7 +7,7 @@ TrainingService is a module related to platform management and job schedule in N
## System architecture
![](../img/NNIDesign.jpg)
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports local platfrom, [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkController.md).
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports local platfrom, [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkController.md).
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
## Folder structure of code
......@@ -62,7 +62,7 @@ abstract class TrainingService {
```
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
__setClusterMetadata(key: string, value: string)__
__setClusterMetadata(key: string, value: string)__
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
```
export class RemoteMachineMeta {
......@@ -76,7 +76,7 @@ export class RemoteMachineMeta {
/* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
public gpuReservation : Map<number, string>;
constructor(ip : string, port : number, username : string, passwd : string,
constructor(ip : string, port : number, username : string, passwd : string,
sshKeyPath : string, passphrase : string) {
this.ip = ip;
this.port = port;
......@@ -90,10 +90,10 @@ export class RemoteMachineMeta {
```
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
__getClusterMetadata(key: string)__
__getClusterMetadata(key: string)__
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
__submitTrialJob(form: JobApplicationForm)__
__submitTrialJob(form: JobApplicationForm)__
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
```
interface TrialJobDetail {
......@@ -112,38 +112,38 @@ interface TrialJobDetail {
```
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
__cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)__
__cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)__
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
__updateTrialJob(trialJobId: string, form: JobApplicationForm)__
__updateTrialJob(trialJobId: string, form: JobApplicationForm)__
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to `RUNNING`, `SUCCEED`, `FAILED` etc.
__getTrialJob(trialJobId: string)__
__getTrialJob(trialJobId: string)__
This function returns a trialJob detail instance according to trialJobId.
__listTrialJobs()__
__listTrialJobs()__
Users should put all of trial job detail information into a list, and return the list.
__addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
__addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
__removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
__removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
Close the EventEmitter.
__run()__
__run()__
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
__cleanUp()__
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
__cleanUp()__
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
## TrialKeeper tool
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in `nni/tools/nni_trial_tool`. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
![](../img/trialkeeper.jpg)
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts`.
The running architecture of TrialKeeper is show as follow:
![](../img/trialkeeper.jpg)
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts`.
## Reference
For more information about how to debug, please [refer](HowToDebug.md).
The guide line of how to contribute, please [refer](Contributing.md).
\ No newline at end of file
For more information about how to debug, please [refer](HowToDebug.md).
The guide line of how to contribute, please [refer](Contributing.md).
......@@ -18,12 +18,12 @@ If you have installed the docker package in your local machine, you could start
For example, you could start a new docker container from following command:
```
docker run -i -t -p [hostPort]:[containerPort] [image]
```
`-i:` Start a docker in an interactive mode.
```
`-i:` Start a docker in an interactive mode.
`-t:` Docker assign the container a input terminal.
`-t:` Docker assign the container a input terminal.
`-p:` Port mapping, map host port to a container port.
`-p:` Port mapping, map host port to a container port.
For more information about docker command, please [refer](https://docs.docker.com/v17.09/edge/engine/reference/run/)
......@@ -33,11 +33,11 @@ Note:
```
### Step3: Run NNI in docker container
If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.
If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.
If you start your own docker image, you may need to install NNI package first, please [refer](Installation.md).
If you want to run NNI's offical examples, you may need to clone NNI repo in github using
If you want to run NNI's offical examples, you may need to clone NNI repo in github using
```
git clone https://github.com/Microsoft/nni.git
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment