"...git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "7b0c99add97d170b466091fcd44b8e7a5c697bd9"
Unverified Commit ef176d29 authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge pull request #116 from Microsoft/master

merge master
parents 97866505 4553de75
......@@ -14,9 +14,71 @@
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments.
The tool dispatches and runs trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different environments like local machine, remote servers and cloud.
### **NNI [v0.5](https://github.com/Microsoft/nni/releases) has been released!**
<p align="center">
<img src="./docs/img/nni_arch_overview.png" alt="drawing"/>
<a href=#><img src="https://rawgit.com/QuanluZhang/nni/update-doc11/overview.svg" /></a>
</p>
<table>
<tbody>
<tr align="center">
<td>
<b>User Code + SDK( import nni )</b>
<img src="https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png"/>
</td>
<td>
<b>Tunning Algorithm Extensions</b>
<img src="https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png"/>
</td>
<td>
<b>Training Service Extensions</b>
<img src="https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png"/>
</td>
</tr>
<tr/>
<tr valign="top">
<td>
<ul>
<li>CNTK</li>
<li>Tensorflow</li>
<li>PyTorch</li>
<li>Keras</li>
<li>...</li>
</ul>
(Python based frameworks)
</td>
<td>
<a href="docs/HowToChooseTuner.md">Tuner</a>
<ul>
<li><a href="docs/HowToChooseTuner.md#TPE">TPE</a></li>
<li><a href="docs/HowToChooseTuner.md#Random">Random Search</a></li>
<li><a href="docs/HowToChooseTuner.md#Anneal">Anneal</a></li>
<li><a href="docs/HowToChooseTuner.md#Evolution">Naive Evolution</a></li>
<li><a href="docs/HowToChooseTuner.md#SMAC">SMAC</a></li>
<li><a href="docs/HowToChooseTuner.md#Batch">Batch</a></li>
<li><a href="docs/HowToChooseTuner.md#Grid">Grid Search</a></li>
<li><a href="docs/HowToChooseTuner.md#Hyperband">Hyperband</a></li>
<li><a href="docs/HowToChooseTuner.md#NetworkMorphism">Network Morphism</a></li>
<li><a href="examples/tuners/enas_nni/README.md">ENAS</a></li>
<li><a href="docs/HowToChooseTuner.md#NetworkMorphism#MetisTuner">Metis Tuner</a></li>
</ul>
<a href="docs/HowToChooseTuner.md#assessor">Assessor</a>
<ul>
<li><a href="docs/HowToChooseTuner.md#Medianstop">Median Stop</a></li>
<li><a href="docs/HowToChooseTuner.md#Curvefitting">Curve Fitting</a></li>
</ul>
</td>
<td>
<ul>
<li><a href="docs/tutorial_1_CR_exp_local_api.md">Local Machine</a></li>
<li><a href="docs/tutorial_2_RemoteMachineMode.md">Remote Servers</a></li>
<li><a href="docs/PAIMode.md">OpenPAI</a></li>
<li><a href="docs/KubeflowMode.md">Kubeflow</a></li>
<li><a href="docs/KubeflowMode.md">FrameworkController on K8S (AKS etc.)</a></li>
</ul>
</td>
</tr>
</tbody>
</table>
## **Who should consider using NNI**
* Those who want to try different AutoML algorithms in their training code (model) at their local machine.
......@@ -35,12 +97,14 @@ We encourage researchers and students leverage these projects to accelerate the
**Install through pip**
* We support Linux and MacOS in current stage, Ubuntu 16.04 or higher, along with MacOS 10.14.1 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`.
```bash
```bash
python3 -m pip install --upgrade nni
```
* Note:
* If you are in docker container (as root), please remove `--user` from the installation command.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/FAQ.md)
Note:
* `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/FAQ.md)
**Install through source code**
* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage.
......
trigger:
- dev-it
- master
- dev-remote-ci
jobs:
- job: 'Ubuntu_16_04'
pool:
vmImage: 'Ubuntu 16.04'
strategy:
matrix:
Python36:
PYTHON_VERSION: '3.6'
pool: 'NNI CI GPU'
steps:
- script: python3 -m pip install --upgrade pip setuptools
- script: python3 -m pip install --upgrade pip setuptools --user
displayName: 'Install python tools'
- script: |
source install.sh
displayName: 'Install nni toolkit via source code'
- script: |
python3 -m pip install scikit-learn==0.20.0 --user
python3 -m pip install torch==0.4.1 --user
python3 -m pip install torchvision==0.2.1 --user
python3 -m pip install keras==2.1.6 --user
python3 -m pip install tensorflow-gpu==1.10.0 --user
displayName: 'Install dependencies for integration tests'
- script: |
cd test
source unittest.sh
......@@ -25,11 +27,19 @@ jobs:
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 naive_test.py
displayName: 'Integration tests'
displayName: 'Naive test'
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 tuner_test.py
displayName: 'Built-in tuners / assessors tests'
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts local
displayName: 'Examples and advanced features tests on local machine'
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 sdk_test.py
displayName: 'Built-in dispatcher tests'
PATH=$HOME/.local/bin:$PATH python3 metrics_test.py
displayName: 'Trial job metrics test'
- job: 'macOS_10_13'
pool:
......@@ -52,8 +62,8 @@ jobs:
- script: |
cd test
PATH=$HOME/Library/Python/3.7/bin:$PATH python3 naive_test.py
displayName: 'Integration tests'
displayName: 'Naive test'
- script: |
cd test
PATH=$HOME/Library/Python/3.7/bin:$PATH python3 sdk_test.py
displayName: 'Built-in dispatcher tests'
\ No newline at end of file
PATH=$HOME/Library/Python/3.7/bin:$PATH python3 tuner_test.py
displayName: 'Built-in tuners / assessors tests'
......@@ -110,3 +110,4 @@ The experiment has been running now, NNI provides WebUI for you to view experime
* [How to run an experiment on local (with multiple GPUs)?](tutorial_1_CR_exp_local_api.md)
* [How to run an experiment on multiple machines?](tutorial_2_RemoteMachineMode.md)
* [How to run an experiment on OpenPAI?](PAIMode.md)
* [How to create a multi-phase experiment](multiPhase.md)
......@@ -244,7 +244,7 @@ _Usage_:
optimize_mode: maximize
```
<a name="assessor"></a>
# How to use Assessor that NNI supports?
For now, NNI has supported the following assessor algorithms.
......
nnictl
# nnictl
===
## Introduction
......
## Create multi-phase experiment
Typically each trial job gets single set of configuration (e.g. hyper parameters) from tuner and do some kind of experiment, let's say train a model with that hyper parameter and reports its result to tuner. Sometimes you may want to train multiple models within one trial job to share information between models or saving system resource by creating less trial jobs, for example:
1. Train multiple models sequentially in one trial job, so that later models can leverage the weights or other information of prior models and may use different hyper parameters.
2. Train large amount of models on limited system resource, combine multiple models together to save system resource to create large amount of trial jobs.
3. Any other scenario that you would like to train multiple models with different hyper parameters in one trial job, be aware that if you allocate multiple GPUs to a trial job and you train multiple models concurrently within on trial job, you need to allocate GPU resource properly by your trial code.
In above cases, you can leverage NNI multi-phase experiment to train multiple models with different hyper parameters within each trial job.
Multi-phase experiments refer to experiments whose trial jobs request multiple hyper parameters from tuner and report multiple final results to NNI.
To use multi-phase experiment, please follow below steps:
1. Implement nni.multi_phase.MultiPhaseTuner. For example, this [ENAS tuner](https://github.com/countif/enas_nni/blob/master/nni/examples/tuners/enas/nni_controller_ptb.py) is a multi-phase Tuner which implements nni.multi_phase.MultiPhaseTuner. While implementing your MultiPhaseTuner, you may want to use the trial_job_id parameter of generate_parameters method to generate hyper parameters for each trial job.
2. Set ```multiPhase``` field to ```true```, and configure your tuner implemented in step 1 as customized tuner in configuration file, for example:
```yml
...
multiPhase: true
tuner:
codeDir: tuners/enas
classFileName: nni_controller_ptb.py
className: ENASTuner
classArgs:
say_hello: "hello"
...
```
3. Invoke nni.get_next_parameter() API for multiple times as needed in a trial, for example:
```python
for i in range(5):
# get parameter from tuner
tuner_param = nni.get_next_parameter()
# consume the params
# ...
# report final result somewhere for the parameter retrieved above
nni.report_final_result()
# ...
```
'''Train CIFAR10 with PyTorch.'''
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
......@@ -174,6 +174,10 @@ def test(epoch):
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=200)
args, _ = parser.parse_known_args()
try:
RCV_CONFIG = nni.get_next_parameter()
#RCV_CONFIG = {'lr': 0.1, 'optimizer': 'Adam', 'model':'senet18'}
......@@ -182,7 +186,7 @@ if __name__ == '__main__':
prepare(RCV_CONFIG)
acc = 0.0
best_acc = 0.0
for epoch in range(start_epoch, start_epoch+200):
for epoch in range(start_epoch, start_epoch+args.epochs):
train(epoch)
acc, best_acc = test(epoch)
nni.report_intermediate_result(acc)
......
"""A deep MNIST classifier using convolutional layers."""
import argparse
import logging
import math
import tempfile
......@@ -180,7 +181,7 @@ def main(params):
test_acc = 0.0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
"""@nni.variable(nni.choice(1, 4, 8, 16, 32), name=batch_size)"""
"""@nni.variable(nni.choice(16, 32), name=batch_size)"""
batch_size = params['batch_size']
for i in range(params['batch_num']):
batch = mnist.train.next_batch(batch_size)
......@@ -210,29 +211,27 @@ def main(params):
logger.debug('Final result is %g', test_acc)
logger.debug('Send final result done.')
def generate_default_params():
'''
Generate default parameters for mnist network.
'''
params = {
'data_dir': '/tmp/tensorflow/mnist/input_data',
'dropout_rate': 0.5,
'channel_1_num': 32,
'channel_2_num': 64,
'conv_size': 5,
'pool_size': 2,
'hidden_size': 1024,
'learning_rate': 1e-4,
'batch_num': 2000,
'batch_size': 32}
return params
def get_params():
''' Get parameters from command line '''
parser = argparse.ArgumentParser()
parser.add_argument("--data_dir", type=str, default='/tmp/tensorflow/mnist/input_data', help="data directory")
parser.add_argument("--dropout_rate", type=float, default=0.5, help="dropout rate")
parser.add_argument("--channel_1_num", type=int, default=32)
parser.add_argument("--channel_2_num", type=int, default=64)
parser.add_argument("--conv_size", type=int, default=5)
parser.add_argument("--pool_size", type=int, default=2)
parser.add_argument("--hidden_size", type=int, default=1024)
parser.add_argument("--learning_rate", type=float, default=1e-4)
parser.add_argument("--batch_num", type=int, default=2000)
parser.add_argument("--batch_size", type=int, default=32)
args, _ = parser.parse_known_args()
return args
if __name__ == '__main__':
'''@nni.get_next_parameter()'''
try:
main(generate_default_params())
main(vars(get_params()))
except Exception as exception:
logger.exception(exception)
raise
"""A deep MNIST classifier using convolutional layers."""
import argparse
import logging
import math
import tempfile
......@@ -198,33 +199,30 @@ def main(params):
logger.debug('Final result is %g', test_acc)
logger.debug('Send final result done.')
def generate_default_params():
'''
Generate default parameters for mnist network.
'''
params = {
'data_dir': '/tmp/tensorflow/mnist/input_data',
'dropout_rate': 0.5,
'channel_1_num': 32,
'channel_2_num': 64,
'conv_size': 5,
'pool_size': 2,
'hidden_size': 1024,
'learning_rate': 1e-4,
'batch_num': 2000,
'batch_size': 32}
return params
def get_params():
''' Get parameters from command line '''
parser = argparse.ArgumentParser()
parser.add_argument("--data_dir", type=str, default='/tmp/tensorflow/mnist/input_data', help="data directory")
parser.add_argument("--dropout_rate", type=float, default=0.5, help="dropout rate")
parser.add_argument("--channel_1_num", type=int, default=32)
parser.add_argument("--channel_2_num", type=int, default=64)
parser.add_argument("--conv_size", type=int, default=5)
parser.add_argument("--pool_size", type=int, default=2)
parser.add_argument("--hidden_size", type=int, default=1024)
parser.add_argument("--learning_rate", type=float, default=1e-4)
parser.add_argument("--batch_num", type=int, default=2000)
parser.add_argument("--batch_size", type=int, default=32)
args, _ = parser.parse_known_args()
return args
if __name__ == '__main__':
try:
# get parameters form tuner
RCV_PARAMS = nni.get_next_parameter()
logger.debug(RCV_PARAMS)
# run
params = generate_default_params()
params.update(RCV_PARAMS)
tuner_params = nni.get_next_parameter()
logger.debug(tuner_params)
params = vars(get_params())
params.update(tuner_params)
main(params)
except Exception as exception:
logger.exception(exception)
......
<svg id="图层_1" data-name="图层 1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 806.55 233.23"><defs><style>.cls-1,.cls-10,.cls-11,.cls-12,.cls-2,.cls-3,.cls-5,.cls-9{fill:none;}.cls-1{clip-rule:evenodd;}.cls-14,.cls-2{fill-rule:evenodd;}.cls-11,.cls-3{stroke:#c1c1c1;}.cls-3{stroke-miterlimit:10;}.cls-14,.cls-4{fill:#fff;}.cls-12,.cls-5{stroke:#000;}.cls-10,.cls-11,.cls-12,.cls-5,.cls-9{stroke-miterlimit:8;stroke-width:1.53px;}.cls-6{font-size:18px;fill:#0071bc;font-family:SegoeUI, Segoe UI;}.cls-7{letter-spacing:-0.1em;}.cls-8{font-family:SegoeUIBlack, Segoe UI;}.cls-10,.cls-9{stroke:#505050;}.cls-10,.cls-11,.cls-12{stroke-linecap:square;}.cls-13{clip-path:url(#clip-path);}</style><clipPath id="clip-path"><polygon class="cls-1" points="643.38 226.46 702.27 226.46 702.27 171.36 643.38 171.36 643.38 226.46 643.38 226.46"/></clipPath></defs><title>overview</title><path class="cls-3" d="M701.49,57.48H104.59"/><path class="cls-3" d="M110.48,198.91H686.9"/><path class="cls-3" d="M104.59,198.91a70.72,70.72,0,0,1,0-141.43"/><path class="cls-3" d="M698.51,57.48a70.72,70.72,0,0,1,0,141.43"/><path class="cls-4" d="M396.26,164a34.61,34.61,0,1,1-34.59,34.6A34.6,34.6,0,0,1,396.26,164"/><path class="cls-4" d="M134.8,154.46a42,42,0,1,1-42,42,42,42,0,0,1,42-42"/><path class="cls-5" d="M377.34,201.92c0-1-.15-2-.15-3a20.37,20.37,0,0,1,20.27-20.48c1,0,1.84.14,2.69.14m-8.36,40.11a24.59,24.59,0,0,0,5.67.71,20.37,20.37,0,0,0,20.27-20.48,12.59,12.59,0,0,0-.15-2.44M397.46,215.1l-5.81,3.43,3.54,5.73m-.71-42.11,5.81-3.44L396.75,173M414,184h-8.36v8.31H414V184ZM383,216.1a5.88,5.88,0,1,0-5.81-5.87A5.9,5.9,0,0,0,383,216.1Z"/><path class="cls-4" d="M672.21,154.46a42,42,0,1,1-42,42,42,42,0,0,1,42-42"/><rect class="cls-4" x="210.1" y="23.41" width="86.5" height="113"/><rect class="cls-4" x="488.78" y="25.62" width="107" height="113"/><text class="cls-6" transform="translate(174.39 103.07)">Command Line <tspan class="cls-7" x="124.94" y="0">T</tspan><tspan x="132.61" y="0">ool</tspan><tspan class="cls-8"><tspan x="45.01" y="23.6">NNICTL</tspan></tspan></text><text class="cls-6" transform="translate(493.22 105.07)">Visualized UI<tspan class="cls-8"><tspan x="3.81" y="23.6">NNI Board</tspan></tspan></text><path class="cls-9" d="M240.73,53.54a12.69,12.69,0,1,1,12.69,12.38,12.55,12.55,0,0,1-12.69-12.38Zm20.42-21.33a23.92,23.92,0,0,0-7.73-1.28c-12.84,0-23.19,10.1-23.19,22.61A22.58,22.58,0,0,0,243.5,74m2.19.85a22.63,22.63,0,0,0,7.73,1.28c12.83,0,23.19-10.1,23.19-22.62a22.48,22.48,0,0,0-1.17-7m-.87-2.42a23.09,23.09,0,0,0-10.79-10.81m-4.09,5.55-1.17,3.27m10.8,5.54L265.23,49m-.14,9.25,4.08,1.56m-8.9,8.82-1.89-3.7m-11.09,3.84,1.61-3.69M238.1,59.8l3.65-1.42M238.1,47.71,241.6,49m7.15-7-1.6-3.12"/><path class="cls-10" d="M567.21,74.12h-46.5V37.28h46.5V74.12m-46.5-27.63h45.74m-27.14,6.15h-12.4V68h12.4V52.64m4.65,0h9.3M544,58.77h9.3M544,64.91h6.2"/><polyline class="cls-3" points="357.3 194.3 361.8 198.62 357.3 202.95"/><polyline class="cls-3" points="625.71 194.3 630.21 198.62 625.71 202.95"/><path class="cls-4" d="M265.88,179.36a20,20,0,1,1-20,20,20,20,0,0,1,20-20"/><path class="cls-4" d="M537.65,179.34a20,20,0,1,1-20,20,20,20,0,0,1,20-20"/><path class="cls-11" d="M265.88,187.49A11.87,11.87,0,1,1,254,199.35a11.86,11.86,0,0,1,11.86-11.86ZM264,205.39l6-6-6-6"/><path class="cls-11" d="M537.65,187.49a11.87,11.87,0,1,1-11.86,11.86,11.86,11.86,0,0,1,11.86-11.86Zm-1.9,17.9,6-6-6-6"/><path class="cls-5" d="M112,177.18v8.74m17.11-8.74v8.74M146,177.18v8.74m-22.51-2.73v-3.28a2.88,2.88,0,0,0-3-2.73h0a2.88,2.88,0,0,0-3,2.73v3.28a3,3,0,0,0,3,2.91h0a3,3,0,0,0,3-2.91Zm16.93,0v-3.28a2.89,2.89,0,0,0-3-2.73h0a2.83,2.83,0,0,0-2.79,2.73v3.28a3,3,0,0,0,2.79,2.91h0a3,3,0,0,0,3-2.91Zm17.11,0v-3.28a3,3,0,0,0-3-2.73h0a2.88,2.88,0,0,0-3,2.73v3.28a3,3,0,0,0,3,2.91h0a3.17,3.17,0,0,0,3-2.91ZM112,206.85v8.92m17.11-8.92v8.92M146,206.85v8.92m-22.51-2.91v-3.1a2.9,2.9,0,0,0-3-2.91h0a2.9,2.9,0,0,0-3,2.91v3.1a2.91,2.91,0,0,0,3,2.91h0a2.91,2.91,0,0,0,3-2.91Zm16.93,0v-3.1a2.91,2.91,0,0,0-3-2.91h0a2.86,2.86,0,0,0-2.79,2.91v3.1a2.87,2.87,0,0,0,2.79,2.91h0a2.92,2.92,0,0,0,3-2.91Zm17.11,0v-3.1a3,3,0,0,0-3-2.91h0a2.9,2.9,0,0,0-3,2.91v3.1a2.91,2.91,0,0,0,3,2.91h0a3,3,0,0,0,3-2.91Zm-34-20.57V201m16.93-8.73V201M118,198.29V195a2.87,2.87,0,0,0-3-2.73h0a2.87,2.87,0,0,0-3,2.73v3.27a3,3,0,0,0,3,2.92h0a3,3,0,0,0,3-2.92Zm16.93,0V195a2.88,2.88,0,0,0-3-2.73h0a2.83,2.83,0,0,0-2.79,2.73v3.27a3,3,0,0,0,2.79,2.92h0a3,3,0,0,0,3-2.92Zm17.11,0V195a3,3,0,0,0-3-2.73h0a2.87,2.87,0,0,0-3,2.73v3.27a3,3,0,0,0,3,2.92h0a3.17,3.17,0,0,0,3-2.92Zm5.58-6V201"/><path class="cls-12" d="M691,208.69v.17H658.49a14.59,14.59,0,0,1-14.34-14.73,14.45,14.45,0,0,1,14.34-14.56,17.76,17.76,0,0,1,3,.17,15.64,15.64,0,0,1,13.34-7.62,15.84,15.84,0,0,1,15.67,14.39h0a11.1,11.1,0,0,1,.5,22.18Z"/><g class="cls-13"><polygon class="cls-14" points="670.15 197.38 699.21 197.38 699.21 226.46 670.15 226.46 670.15 197.38 670.15 197.38"/><g class="cls-13"><path class="cls-5" d="M682,194.24a3,3,0,0,1,6.05,0v1h4v4H678v-4h4v-1Zm-4,1h-4v25.87h22.17V195.24h-4m1,9.95h-8.07m8.07,6h-8.07m8.07,6h-8.07m-7-13.93,2,2,3-3m-5,7,2,2,3-3m-5,7,2,2,3-3"/></g></g></svg>
......@@ -25,6 +25,7 @@ const UPDATE_SEARCH_SPACE = 'SS';
const ADD_CUSTOMIZED_TRIAL_JOB = 'AD';
const TRIAL_END = 'EN';
const TERMINATE = 'TE';
const PING = 'PI';
const INITIALIZED = 'ID';
const NEW_TRIAL_JOB = 'TR';
......@@ -39,6 +40,7 @@ const TUNER_COMMANDS: Set<string> = new Set([
UPDATE_SEARCH_SPACE,
ADD_CUSTOMIZED_TRIAL_JOB,
TERMINATE,
PING,
INITIALIZED,
NEW_TRIAL_JOB,
......@@ -63,6 +65,7 @@ export {
ADD_CUSTOMIZED_TRIAL_JOB,
TRIAL_END,
TERMINATE,
PING,
INITIALIZED,
NEW_TRIAL_JOB,
NO_MORE_TRIAL_JOBS,
......
......@@ -35,15 +35,15 @@ import {
import {
TrainingService, TrialJobApplicationForm, TrialJobDetail, TrialJobMetric, TrialJobStatus
} from '../common/trainingService';
import { delay, getLogDir, getCheckpointDir, getMsgDispatcherCommand, mkDirP } from '../common/utils';
import { delay, getCheckpointDir, getLogDir, getMsgDispatcherCommand, mkDirP } from '../common/utils';
import {
ADD_CUSTOMIZED_TRIAL_JOB, INITIALIZE, INITIALIZED, KILL_TRIAL_JOB, NEW_TRIAL_JOB, NO_MORE_TRIAL_JOBS,
ADD_CUSTOMIZED_TRIAL_JOB, INITIALIZE, INITIALIZED, KILL_TRIAL_JOB, NEW_TRIAL_JOB, NO_MORE_TRIAL_JOBS, PING,
REPORT_METRIC_DATA, REQUEST_TRIAL_JOBS, SEND_TRIAL_JOB_PARAMETER, TERMINATE, TRIAL_END, UPDATE_SEARCH_SPACE
} from './commands';
import { createDispatcherInterface, IpcInterface } from './ipcInterface';
/**
* NNIManager
* NNIManager which implements Manager interface
*/
class NNIManager implements Manager {
private trainingService: TrainingService;
......@@ -360,6 +360,16 @@ class NNIManager implements Manager {
}
}
private async pingDispatcher(): Promise<void> {
if (this.dispatcher === undefined) {
throw new Error('Error: tuner has not been setup');
}
while (!['ERROR', 'STOPPING', 'STOPPED'].includes(this.status.status)) {
await delay(1000 * 5);
this.dispatcher.sendCommand(PING);
}
}
private async requestTrialJobsStatus(): Promise<number> {
let finishedTrialJobNum: number = 0;
if (this.dispatcher === undefined) {
......@@ -424,7 +434,7 @@ class NNIManager implements Manager {
if (this.dispatcher === undefined) {
throw new Error('Error: tuner has not been setup');
}
let allFinishedTrialJobNum: number = 0;
let allFinishedTrialJobNum: number = this.currSubmittedTrialNum;
let waitSubmittedToFinish: number;
while (this.status.status !== 'STOPPING' && this.status.status !== 'STOPPED') {
const finishedTrialJobNum: number = await this.requestTrialJobsStatus();
......@@ -536,6 +546,9 @@ class NNIManager implements Manager {
await Promise.all([
this.periodicallyUpdateExecDuration(),
this.pingDispatcher().catch((err: Error) => {
throw new NNIError('Dispatcher error', `Dispatcher error: ${err.message}`, err);
}),
this.trainingService.run().catch((err: Error) => {
throw new NNIError('Training service error', `Training service error: ${err.message}`, err);
}),
......
......@@ -108,6 +108,7 @@ export namespace ValidationSchemas {
}),
frameworkcontroller_config: joi.object({
storage: joi.string().min(1),
serviceAccountName: joi.string().min(1),
nfs: joi.object({
server: joi.string().min(1).required(),
path: joi.string().min(1).required()
......
......@@ -18,8 +18,11 @@
*/
'use strict';
import * as assert from 'assert';
import { KubernetesTrialConfig, KubernetesTrialConfigTemplate } from '../kubernetesConfig'
import { KubernetesTrialConfig, KubernetesTrialConfigTemplate, KubernetesClusterConfigAzure,
KubernetesClusterConfigNFS, NFSConfig, KubernetesStorageKind, keyVaultConfig, AzureStorage, KubernetesClusterConfig,
StorageConfig } from '../kubernetesConfig'
export class FrameworkAttemptCompletionPolicy {
public readonly minFailedTaskCount: number;
......@@ -57,6 +60,80 @@ export class FrameworkControllerTrialConfig extends KubernetesTrialConfig{
}
}
export class FrameworkControllerClusterConfig extends KubernetesClusterConfig {
public readonly serviceAccountName: string;
constructor(apiVersion: string, serviceAccountName: string) {
super(apiVersion);
this.serviceAccountName = serviceAccountName;
}
}
export class FrameworkControllerClusterConfigNFS extends KubernetesClusterConfigNFS {
public readonly serviceAccountName: string;
constructor(
serviceAccountName: string,
apiVersion: string,
nfs: NFSConfig,
storage?: KubernetesStorageKind
) {
super(apiVersion, nfs, storage);
this.serviceAccountName = serviceAccountName;
}
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigNFS {
let kubeflowClusterConfigObjectNFS = <FrameworkControllerClusterConfigNFS>jsonObject;
assert (kubeflowClusterConfigObjectNFS !== undefined)
return new FrameworkControllerClusterConfigNFS(
kubeflowClusterConfigObjectNFS.serviceAccountName,
kubeflowClusterConfigObjectNFS.apiVersion,
kubeflowClusterConfigObjectNFS.nfs,
kubeflowClusterConfigObjectNFS.storage
);
}
}
export class FrameworkControllerClusterConfigAzure extends KubernetesClusterConfigAzure {
public readonly serviceAccountName: string;
constructor(
serviceAccountName: string,
apiVersion: string,
keyVault: keyVaultConfig,
azureStorage: AzureStorage,
storage?: KubernetesStorageKind
) {
super(apiVersion, keyVault, azureStorage,storage);
this.serviceAccountName = serviceAccountName;
}
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigAzure {
let kubeflowClusterConfigObjectAzure = <FrameworkControllerClusterConfigAzure>jsonObject;
return new FrameworkControllerClusterConfigAzure(
kubeflowClusterConfigObjectAzure.serviceAccountName,
kubeflowClusterConfigObjectAzure.apiVersion,
kubeflowClusterConfigObjectAzure.keyVault,
kubeflowClusterConfigObjectAzure.azureStorage,
kubeflowClusterConfigObjectAzure.storage
);
}
}
export class FrameworkControllerClusterConfigFactory {
public static generateFrameworkControllerClusterConfig(jsonObject: object): FrameworkControllerClusterConfig {
let storageConfig = <StorageConfig>jsonObject;
if(!storageConfig) {
throw new Error("Invalid json object as a StorageConfig instance");
}
if(storageConfig.storage && storageConfig.storage === 'azureStorage') {
return FrameworkControllerClusterConfigAzure.getInstance(jsonObject);
} else if (storageConfig.storage === undefined || storageConfig.storage === 'nfs') {
return FrameworkControllerClusterConfigNFS.getInstance(jsonObject);
}
throw new Error(`Invalid json object ${jsonObject}`);
}
}
export type FrameworkControllerJobStatus = 'AttemptRunning' | 'Completed' | 'AttemptCreationPending' | 'AttemptCreationRequested' | 'AttemptPreparing' | 'AttemptCompleted';
export type FrameworkControllerJobCompleteStatus = 'Succeeded' | 'Failed';
\ No newline at end of file
......@@ -32,12 +32,13 @@ import {
TrialJobDetail, NNIManagerIpConfig
} from '../../../common/trainingService';
import { delay, generateParamFileName, getExperimentRootDir, uniqueString } from '../../../common/utils';
import { NFSConfig, KubernetesClusterConfigNFS, KubernetesClusterConfigAzure, KubernetesClusterConfigFactory } from '../kubernetesConfig'
import { NFSConfig } from '../kubernetesConfig'
import { KubernetesTrialJobDetail } from '../kubernetesData';
import { validateCodeDir } from '../../common/util';
import { AzureStorageClientUtility } from '../azureStorageClientUtils';
import { KubernetesTrainingService } from '../kubernetesTrainingService';
import { FrameworkControllerTrialConfig } from './frameworkcontrollerConfig';
import { FrameworkControllerTrialConfig, FrameworkControllerClusterConfig, FrameworkControllerClusterConfigAzure, FrameworkControllerClusterConfigNFS,
FrameworkControllerClusterConfigFactory} from './frameworkcontrollerConfig';
import { FrameworkControllerJobRestServer } from './frameworkcontrollerJobRestServer';
import { FrameworkControllerClient } from './frameworkcontrollerApiClient';
import { FrameworkControllerJobInfoCollector } from './frameworkcontrollerJobInfoCollector';
......@@ -50,6 +51,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
private fcTrialConfig?: FrameworkControllerTrialConfig; // frameworkcontroller trial configuration
private fcJobInfoCollector: FrameworkControllerJobInfoCollector; // frameworkcontroller job info collector
private fcContainerPortMap = new Map<string, number>(); // store frameworkcontroller container port
private fcClusterConfig?: FrameworkControllerClusterConfig;
constructor() {
super();
......@@ -73,7 +75,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
}
public async submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail> {
if(!this.kubernetesClusterConfig) {
if(!this.fcClusterConfig) {
throw new Error('frameworkcontrollerClusterConfig is not initialized');
}
if(!this.kubernetesCRDClient) {
......@@ -129,13 +131,13 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
* return: trialJobOutputUrl
*/
private async uploadCodeFiles(trialJobId: string, trialLocalTempFolder: string): Promise<string> {
if(!this.kubernetesClusterConfig) {
if(!this.fcClusterConfig) {
throw new Error('Kubeflow Cluster config is not initialized');
}
let trialJobOutputUrl: string = '';
if(this.kubernetesClusterConfig.storageType === 'azureStorage') {
if(this.fcClusterConfig.storageType === 'azureStorage') {
try{
//upload local files to azure storage
await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient,
......@@ -146,8 +148,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
this.log.error(error);
return Promise.reject(error);
}
} else if(this.kubernetesClusterConfig.storageType === 'nfs') {
let nfsFrameworkControllerClusterConfig: KubernetesClusterConfigNFS = <KubernetesClusterConfigNFS>this.kubernetesClusterConfig;
} else if(this.fcClusterConfig.storageType === 'nfs') {
let nfsFrameworkControllerClusterConfig: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
// Creat work dir for current trial in NFS directory
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}`);
// Copy code files from local dir to NFS mounted dir
......@@ -170,7 +172,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
throw new Error('frameworkcontroller trial config is not initialized');
}
for(let taskRole of this.fcTrialConfig.taskRoles) {
portScript += `${taskRole.name}_port=${this.fcContainerPortMap.get(taskRole.name)} `;
portScript += `FB_${taskRole.name.toUpperCase()}_PORT=${this.fcContainerPortMap.get(taskRole.name)} `;
}
return `${portScript} . /mnt/frameworkbarrier/injector.sh && ${command}`;
}
......@@ -229,9 +231,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
case TrialConfigMetadataKey.FRAMEWORKCONTROLLER_CLUSTER_CONFIG:
let frameworkcontrollerClusterJsonObject = JSON.parse(value);
this.kubernetesClusterConfig = KubernetesClusterConfigFactory.generateKubernetesClusterConfig(frameworkcontrollerClusterJsonObject);
if(this.kubernetesClusterConfig.storageType === 'azureStorage') {
let azureFrameworkControllerClusterConfig = <KubernetesClusterConfigAzure>this.kubernetesClusterConfig;
this.fcClusterConfig = FrameworkControllerClusterConfigFactory.generateFrameworkControllerClusterConfig(frameworkcontrollerClusterJsonObject);
if(this.fcClusterConfig.storageType === 'azureStorage') {
let azureFrameworkControllerClusterConfig = <FrameworkControllerClusterConfigAzure>this.fcClusterConfig;
this.azureStorageAccountName = azureFrameworkControllerClusterConfig.azureStorage.accountName;
this.azureStorageShare = azureFrameworkControllerClusterConfig.azureStorage.azureShare;
await this.createAzureStorage(
......@@ -240,8 +242,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
azureFrameworkControllerClusterConfig.azureStorage.accountName,
azureFrameworkControllerClusterConfig.azureStorage.azureShare
);
} else if(this.kubernetesClusterConfig.storageType === 'nfs') {
let nfsFrameworkControllerClusterConfig = <KubernetesClusterConfigNFS>this.kubernetesClusterConfig;
} else if(this.fcClusterConfig.storageType === 'nfs') {
let nfsFrameworkControllerClusterConfig = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
await this.createNFSStorage(
nfsFrameworkControllerClusterConfig.nfs.server,
nfsFrameworkControllerClusterConfig.nfs.path
......@@ -292,7 +294,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
* @param podResources pod template
*/
private generateFrameworkControllerJobConfig(trialJobId: string, trialWorkingFolder: string, frameworkcontrollerJobName : string, podResources : any) : any {
if(!this.kubernetesClusterConfig) {
if(!this.fcClusterConfig) {
throw new Error('frameworkcontroller Cluster config is not initialized');
}
......@@ -346,16 +348,16 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
private generateTaskRoleConfig(trialWorkingFolder: string, replicaImage: string, runScriptFile: string, podResources: any, containerPort: number): any {
if(!this.kubernetesClusterConfig) {
if(!this.fcClusterConfig) {
throw new Error('frameworkcontroller Cluster config is not initialized');
}
if(!this.fcTrialConfig) {
throw new Error('frameworkcontroller trial config is not initialized');
}
let volumeSpecMap = new Map<string, object>();
if(this.kubernetesClusterConfig.storageType === 'azureStorage'){
if(this.fcClusterConfig.storageType === 'azureStorage'){
volumeSpecMap.set('nniVolumes', [
{
name: 'nni-vol',
......@@ -369,7 +371,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
emptyDir: {}
}])
}else {
let frameworkcontrollerClusterConfigNFS: KubernetesClusterConfigNFS = <KubernetesClusterConfigNFS> this.kubernetesClusterConfig;
let frameworkcontrollerClusterConfigNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS> this.fcClusterConfig;
volumeSpecMap.set('nniVolumes', [
{
name: 'nni-vol',
......@@ -382,41 +384,49 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
emptyDir: {}
}])
}
let containers = [
{
name: 'framework',
image: replicaImage,
command: ["sh", `${path.join(trialWorkingFolder, runScriptFile)}`],
volumeMounts: [
{
name: 'nni-vol',
mountPath: this.CONTAINER_MOUNT_PATH
},{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}],
resources: podResources,
ports: [{
containerPort: containerPort
}]
}]
let initContainers = [
{
name: 'frameworkbarrier',
image: 'frameworkcontroller/frameworkbarrier',
volumeMounts: [
{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}]
}]
let spec: any = {
containers: containers,
initContainers: initContainers,
restartPolicy: 'OnFailure',
volumes: volumeSpecMap.get('nniVolumes'),
hostNetwork: false
};
if(this.fcClusterConfig.serviceAccountName) {
spec.serviceAccountName = this.fcClusterConfig.serviceAccountName;
}
let taskRole = {
pod: {
spec: {
containers: [
{
name: 'framework',
image: replicaImage,
command: ["sh", `${path.join(trialWorkingFolder, runScriptFile)}`],
volumeMounts: [
{
name: 'nni-vol',
mountPath: this.CONTAINER_MOUNT_PATH
},{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}],
resources: podResources,
ports: [{
containerPort: containerPort
}]
}],
initContainers: [
{
name: 'frameworkbarrier',
image: 'frameworkcontroller/frameworkbarrier',
volumeMounts: [
{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}]
}],
restartPolicy: 'OnFailure',
volumes: volumeSpecMap.get('nniVolumes'),
hostNetwork: false
}
spec: spec
}
}
return taskRole;
......
......@@ -32,8 +32,8 @@ export type OperatorApiVersion = 'v1alpha2' | 'v1beta1';
export class KubeflowClusterConfig extends KubernetesClusterConfig {
public readonly operator: KubeflowOperator;
constructor(codeDir: string, operator: KubeflowOperator) {
super(codeDir);
constructor(apiVersion: string, operator: KubeflowOperator) {
super(apiVersion);
this.operator = operator;
}
}
......
......@@ -83,7 +83,8 @@ class MsgDispatcherBase(Recoverable):
_logger.debug('handle request: command: [{}], data: [{}]'.format(command, data))
data = json_tricks.loads(data)
if data:
data = json_tricks.loads(data)
command_handlers = {
# Tunner commands:
......@@ -96,12 +97,16 @@ class MsgDispatcherBase(Recoverable):
CommandType.ReportMetricData: self.handle_report_metric_data,
CommandType.TrialEnd: self.handle_trial_end,
CommandType.Ping: self.handle_ping,
}
if command not in command_handlers:
raise AssertionError('Unsupported command: {}'.format(command))
return command_handlers[command](data)
def handle_ping(self, data):
pass
def handle_initialize(self, data):
raise NotImplementedError('handle_initialize not implemented')
......
......@@ -33,6 +33,7 @@ class CommandType(Enum):
AddCustomizedTrialJob = b'AD'
TrialEnd = b'EN'
Terminate = b'TE'
Ping = b'PI'
# out
Initialized = b'ID'
......
......@@ -6,7 +6,7 @@ import {
Experiment, TableObj,
Parameters, TrialNumber
} from '../static/interface';
import { getFinalResult } from '../static/function';
import { getFinal } from '../static/function';
import SuccessTable from './overview/SuccessTable';
import Title1 from './overview/Title1';
import Progressed from './overview/Progress';
......@@ -62,17 +62,7 @@ class Overview extends React.Component<{}, OverviewState> {
tuner: {},
trainingServicePlatform: ''
},
tableData: [{
key: 0,
sequenceId: 0,
id: '',
duration: 0,
status: '',
acc: 0,
description: {
parameters: {}
}
}],
tableData: [],
option: {},
noData: '',
// accuracy
......@@ -224,7 +214,7 @@ class Overview extends React.Component<{}, OverviewState> {
parameters: {}
};
const duration = (tableData[item].endTime - tableData[item].startTime) / 1000;
const acc = getFinalResult(tableData[item].finalMetricData);
const acc = getFinal(tableData[item].finalMetricData);
// if hyperparameters is undefine, show error message, else, show parameters value
if (tableData[item].hyperParameters) {
const parameters = JSON.parse(tableData[item].hyperParameters[0]).parameters;
......@@ -256,16 +246,16 @@ class Overview extends React.Component<{}, OverviewState> {
const { isTop10 } = this.state;
if (isTop10 === true) {
topTableData.sort((a: TableObj, b: TableObj) => {
if (a.acc && b.acc) {
return b.acc - a.acc;
if (a.acc !== undefined && b.acc !== undefined) {
return JSON.parse(b.acc.default) - JSON.parse(a.acc.default);
} else {
return NaN;
}
});
} else {
topTableData.sort((a: TableObj, b: TableObj) => {
if (a.acc && b.acc) {
return a.acc - b.acc;
if (a.acc !== undefined && b.acc !== undefined) {
return JSON.parse(a.acc.default) - JSON.parse(b.acc.default);
} else {
return NaN;
}
......@@ -275,7 +265,7 @@ class Overview extends React.Component<{}, OverviewState> {
let bestDefaultMetric = 0;
if (topTableData[0] !== undefined) {
if (topTableData[0].acc !== undefined) {
bestDefaultMetric = topTableData[0].acc;
bestDefaultMetric = JSON.parse(topTableData[0].acc.default);
}
}
if (this._isMounted) {
......@@ -308,7 +298,7 @@ class Overview extends React.Component<{}, OverviewState> {
const indexarr: Array<number> = [];
Object.keys(sourcePoint).map(item => {
const items = sourcePoint[item];
accarr.push(items.acc);
accarr.push(items.acc.default);
indexarr.push(items.sequenceId);
});
const accOption = {
......
......@@ -3,7 +3,7 @@ import axios from 'axios';
import { MANAGER_IP } from '../static/const';
import { Row, Col, Tabs, Input, Select, Button } from 'antd';
const Option = Select.Option;
import { TableObjFianl, Parameters, DetailAccurPoint, TooltipForAccuracy } from '../static/interface';
import { TableObj, Parameters, DetailAccurPoint, TooltipForAccuracy } from '../static/interface';
import { getFinalResult, getFinal } from '../static/function';
import Accuracy from './overview/Accuracy';
import Duration from './trial-detail/Duration';
......@@ -16,8 +16,8 @@ import '../static/style/trialsDetail.scss';
interface TrialDetailState {
accSource: object;
accNodata: string;
tableListSource: Array<TableObjFianl>;
searchResultSource: Array<TableObjFianl>;
tableListSource: Array<TableObj>;
searchResultSource: Array<TableObj>;
isHasSearch: boolean;
experimentStatus: string;
entriesTable: number;
......@@ -136,7 +136,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
.then(res => {
if (res.status === 200) {
const trialJobs = res.data;
const trialTable: Array<TableObjFianl> = [];
const trialTable: Array<TableObj> = [];
Object.keys(trialJobs).map(item => {
// only succeeded trials have finalMetricData
let desc: Parameters = {
......@@ -189,7 +189,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
Object.keys(searchResultSource).map(index => {
temp.push(searchResultSource[index].id);
});
const searchResultList: Array<TableObjFianl> = [];
const searchResultList: Array<TableObj> = [];
for (let i = 0; i < temp.length; i++) {
Object.keys(trialTable).map(key => {
const item = trialTable[key];
......@@ -221,7 +221,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
.then(res => {
if (res.status === 200) {
const trialJobs = res.data;
const trialTable: Array<TableObjFianl> = [];
const trialTable: Array<TableObj> = [];
Object.keys(trialJobs).map(item => {
// only succeeded trials have finalMetricData
let desc: Parameters = {
......@@ -312,7 +312,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
} else {
window.clearInterval(this.interAllTableList);
const { tableListSource } = this.state;
const searchResultList: Array<TableObjFianl> = [];
const searchResultList: Array<TableObj> = [];
Object.keys(tableListSource).map(key => {
const item = tableListSource[key];
if (item.sequenceId.toString() === targetValue || item.id.includes(targetValue)) {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment