Unverified Commit ef176d29 authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge pull request #116 from Microsoft/master

merge master
parents 97866505 4553de75
...@@ -14,9 +14,71 @@ ...@@ -14,9 +14,71 @@
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments. NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments.
The tool dispatches and runs trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different environments like local machine, remote servers and cloud. The tool dispatches and runs trial jobs generated by tuning algorithms to search the best neural architecture and/or hyper-parameters in different environments like local machine, remote servers and cloud.
### **NNI [v0.5](https://github.com/Microsoft/nni/releases) has been released!**
<p align="center"> <p align="center">
<img src="./docs/img/nni_arch_overview.png" alt="drawing"/> <a href=#><img src="https://rawgit.com/QuanluZhang/nni/update-doc11/overview.svg" /></a>
</p> </p>
<table>
<tbody>
<tr align="center">
<td>
<b>User Code + SDK( import nni )</b>
<img src="https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png"/>
</td>
<td>
<b>Tunning Algorithm Extensions</b>
<img src="https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png"/>
</td>
<td>
<b>Training Service Extensions</b>
<img src="https://user-images.githubusercontent.com/44491713/51381727-e3d0f780-1b4f-11e9-96ab-d26b9198ba65.png"/>
</td>
</tr>
<tr/>
<tr valign="top">
<td>
<ul>
<li>CNTK</li>
<li>Tensorflow</li>
<li>PyTorch</li>
<li>Keras</li>
<li>...</li>
</ul>
(Python based frameworks)
</td>
<td>
<a href="docs/HowToChooseTuner.md">Tuner</a>
<ul>
<li><a href="docs/HowToChooseTuner.md#TPE">TPE</a></li>
<li><a href="docs/HowToChooseTuner.md#Random">Random Search</a></li>
<li><a href="docs/HowToChooseTuner.md#Anneal">Anneal</a></li>
<li><a href="docs/HowToChooseTuner.md#Evolution">Naive Evolution</a></li>
<li><a href="docs/HowToChooseTuner.md#SMAC">SMAC</a></li>
<li><a href="docs/HowToChooseTuner.md#Batch">Batch</a></li>
<li><a href="docs/HowToChooseTuner.md#Grid">Grid Search</a></li>
<li><a href="docs/HowToChooseTuner.md#Hyperband">Hyperband</a></li>
<li><a href="docs/HowToChooseTuner.md#NetworkMorphism">Network Morphism</a></li>
<li><a href="examples/tuners/enas_nni/README.md">ENAS</a></li>
<li><a href="docs/HowToChooseTuner.md#NetworkMorphism#MetisTuner">Metis Tuner</a></li>
</ul>
<a href="docs/HowToChooseTuner.md#assessor">Assessor</a>
<ul>
<li><a href="docs/HowToChooseTuner.md#Medianstop">Median Stop</a></li>
<li><a href="docs/HowToChooseTuner.md#Curvefitting">Curve Fitting</a></li>
</ul>
</td>
<td>
<ul>
<li><a href="docs/tutorial_1_CR_exp_local_api.md">Local Machine</a></li>
<li><a href="docs/tutorial_2_RemoteMachineMode.md">Remote Servers</a></li>
<li><a href="docs/PAIMode.md">OpenPAI</a></li>
<li><a href="docs/KubeflowMode.md">Kubeflow</a></li>
<li><a href="docs/KubeflowMode.md">FrameworkController on K8S (AKS etc.)</a></li>
</ul>
</td>
</tr>
</tbody>
</table>
## **Who should consider using NNI** ## **Who should consider using NNI**
* Those who want to try different AutoML algorithms in their training code (model) at their local machine. * Those who want to try different AutoML algorithms in their training code (model) at their local machine.
...@@ -35,12 +97,14 @@ We encourage researchers and students leverage these projects to accelerate the ...@@ -35,12 +97,14 @@ We encourage researchers and students leverage these projects to accelerate the
**Install through pip** **Install through pip**
* We support Linux and MacOS in current stage, Ubuntu 16.04 or higher, along with MacOS 10.14.1 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. * We support Linux and MacOS in current stage, Ubuntu 16.04 or higher, along with MacOS 10.14.1 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`.
```bash
```bash
python3 -m pip install --upgrade nni python3 -m pip install --upgrade nni
``` ```
* Note: Note:
* If you are in docker container (as root), please remove `--user` from the installation command.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/FAQ.md) * `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/FAQ.md)
**Install through source code** **Install through source code**
* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage. * We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage.
......
trigger: trigger:
- dev-it
- master - master
- dev-remote-ci - dev-remote-ci
jobs: jobs:
- job: 'Ubuntu_16_04' - job: 'Ubuntu_16_04'
pool: pool: 'NNI CI GPU'
vmImage: 'Ubuntu 16.04'
strategy:
matrix:
Python36:
PYTHON_VERSION: '3.6'
steps: steps:
- script: python3 -m pip install --upgrade pip setuptools - script: python3 -m pip install --upgrade pip setuptools --user
displayName: 'Install python tools' displayName: 'Install python tools'
- script: | - script: |
source install.sh source install.sh
displayName: 'Install nni toolkit via source code' displayName: 'Install nni toolkit via source code'
- script: |
python3 -m pip install scikit-learn==0.20.0 --user
python3 -m pip install torch==0.4.1 --user
python3 -m pip install torchvision==0.2.1 --user
python3 -m pip install keras==2.1.6 --user
python3 -m pip install tensorflow-gpu==1.10.0 --user
displayName: 'Install dependencies for integration tests'
- script: | - script: |
cd test cd test
source unittest.sh source unittest.sh
...@@ -25,11 +27,19 @@ jobs: ...@@ -25,11 +27,19 @@ jobs:
- script: | - script: |
cd test cd test
PATH=$HOME/.local/bin:$PATH python3 naive_test.py PATH=$HOME/.local/bin:$PATH python3 naive_test.py
displayName: 'Integration tests' displayName: 'Naive test'
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 tuner_test.py
displayName: 'Built-in tuners / assessors tests'
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts local
displayName: 'Examples and advanced features tests on local machine'
- script: | - script: |
cd test cd test
PATH=$HOME/.local/bin:$PATH python3 sdk_test.py PATH=$HOME/.local/bin:$PATH python3 metrics_test.py
displayName: 'Built-in dispatcher tests' displayName: 'Trial job metrics test'
- job: 'macOS_10_13' - job: 'macOS_10_13'
pool: pool:
...@@ -52,8 +62,8 @@ jobs: ...@@ -52,8 +62,8 @@ jobs:
- script: | - script: |
cd test cd test
PATH=$HOME/Library/Python/3.7/bin:$PATH python3 naive_test.py PATH=$HOME/Library/Python/3.7/bin:$PATH python3 naive_test.py
displayName: 'Integration tests' displayName: 'Naive test'
- script: | - script: |
cd test cd test
PATH=$HOME/Library/Python/3.7/bin:$PATH python3 sdk_test.py PATH=$HOME/Library/Python/3.7/bin:$PATH python3 tuner_test.py
displayName: 'Built-in dispatcher tests' displayName: 'Built-in tuners / assessors tests'
\ No newline at end of file
...@@ -110,3 +110,4 @@ The experiment has been running now, NNI provides WebUI for you to view experime ...@@ -110,3 +110,4 @@ The experiment has been running now, NNI provides WebUI for you to view experime
* [How to run an experiment on local (with multiple GPUs)?](tutorial_1_CR_exp_local_api.md) * [How to run an experiment on local (with multiple GPUs)?](tutorial_1_CR_exp_local_api.md)
* [How to run an experiment on multiple machines?](tutorial_2_RemoteMachineMode.md) * [How to run an experiment on multiple machines?](tutorial_2_RemoteMachineMode.md)
* [How to run an experiment on OpenPAI?](PAIMode.md) * [How to run an experiment on OpenPAI?](PAIMode.md)
* [How to create a multi-phase experiment](multiPhase.md)
...@@ -244,7 +244,7 @@ _Usage_: ...@@ -244,7 +244,7 @@ _Usage_:
optimize_mode: maximize optimize_mode: maximize
``` ```
<a name="assessor"></a>
# How to use Assessor that NNI supports? # How to use Assessor that NNI supports?
For now, NNI has supported the following assessor algorithms. For now, NNI has supported the following assessor algorithms.
......
nnictl # nnictl
===
## Introduction ## Introduction
......
## Create multi-phase experiment
Typically each trial job gets single set of configuration (e.g. hyper parameters) from tuner and do some kind of experiment, let's say train a model with that hyper parameter and reports its result to tuner. Sometimes you may want to train multiple models within one trial job to share information between models or saving system resource by creating less trial jobs, for example:
1. Train multiple models sequentially in one trial job, so that later models can leverage the weights or other information of prior models and may use different hyper parameters.
2. Train large amount of models on limited system resource, combine multiple models together to save system resource to create large amount of trial jobs.
3. Any other scenario that you would like to train multiple models with different hyper parameters in one trial job, be aware that if you allocate multiple GPUs to a trial job and you train multiple models concurrently within on trial job, you need to allocate GPU resource properly by your trial code.
In above cases, you can leverage NNI multi-phase experiment to train multiple models with different hyper parameters within each trial job.
Multi-phase experiments refer to experiments whose trial jobs request multiple hyper parameters from tuner and report multiple final results to NNI.
To use multi-phase experiment, please follow below steps:
1. Implement nni.multi_phase.MultiPhaseTuner. For example, this [ENAS tuner](https://github.com/countif/enas_nni/blob/master/nni/examples/tuners/enas/nni_controller_ptb.py) is a multi-phase Tuner which implements nni.multi_phase.MultiPhaseTuner. While implementing your MultiPhaseTuner, you may want to use the trial_job_id parameter of generate_parameters method to generate hyper parameters for each trial job.
2. Set ```multiPhase``` field to ```true```, and configure your tuner implemented in step 1 as customized tuner in configuration file, for example:
```yml
...
multiPhase: true
tuner:
codeDir: tuners/enas
classFileName: nni_controller_ptb.py
className: ENASTuner
classArgs:
say_hello: "hello"
...
```
3. Invoke nni.get_next_parameter() API for multiple times as needed in a trial, for example:
```python
for i in range(5):
# get parameter from tuner
tuner_param = nni.get_next_parameter()
# consume the params
# ...
# report final result somewhere for the parameter retrieved above
nni.report_final_result()
# ...
```
'''Train CIFAR10 with PyTorch.''' '''Train CIFAR10 with PyTorch.'''
from __future__ import print_function from __future__ import print_function
import argparse
import torch import torch
import torch.nn as nn import torch.nn as nn
import torch.optim as optim import torch.optim as optim
...@@ -174,6 +174,10 @@ def test(epoch): ...@@ -174,6 +174,10 @@ def test(epoch):
if __name__ == '__main__': if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--epochs", type=int, default=200)
args, _ = parser.parse_known_args()
try: try:
RCV_CONFIG = nni.get_next_parameter() RCV_CONFIG = nni.get_next_parameter()
#RCV_CONFIG = {'lr': 0.1, 'optimizer': 'Adam', 'model':'senet18'} #RCV_CONFIG = {'lr': 0.1, 'optimizer': 'Adam', 'model':'senet18'}
...@@ -182,7 +186,7 @@ if __name__ == '__main__': ...@@ -182,7 +186,7 @@ if __name__ == '__main__':
prepare(RCV_CONFIG) prepare(RCV_CONFIG)
acc = 0.0 acc = 0.0
best_acc = 0.0 best_acc = 0.0
for epoch in range(start_epoch, start_epoch+200): for epoch in range(start_epoch, start_epoch+args.epochs):
train(epoch) train(epoch)
acc, best_acc = test(epoch) acc, best_acc = test(epoch)
nni.report_intermediate_result(acc) nni.report_intermediate_result(acc)
......
"""A deep MNIST classifier using convolutional layers.""" """A deep MNIST classifier using convolutional layers."""
import argparse
import logging import logging
import math import math
import tempfile import tempfile
...@@ -180,7 +181,7 @@ def main(params): ...@@ -180,7 +181,7 @@ def main(params):
test_acc = 0.0 test_acc = 0.0
with tf.Session() as sess: with tf.Session() as sess:
sess.run(tf.global_variables_initializer()) sess.run(tf.global_variables_initializer())
"""@nni.variable(nni.choice(1, 4, 8, 16, 32), name=batch_size)""" """@nni.variable(nni.choice(16, 32), name=batch_size)"""
batch_size = params['batch_size'] batch_size = params['batch_size']
for i in range(params['batch_num']): for i in range(params['batch_num']):
batch = mnist.train.next_batch(batch_size) batch = mnist.train.next_batch(batch_size)
...@@ -210,29 +211,27 @@ def main(params): ...@@ -210,29 +211,27 @@ def main(params):
logger.debug('Final result is %g', test_acc) logger.debug('Final result is %g', test_acc)
logger.debug('Send final result done.') logger.debug('Send final result done.')
def get_params():
def generate_default_params(): ''' Get parameters from command line '''
''' parser = argparse.ArgumentParser()
Generate default parameters for mnist network. parser.add_argument("--data_dir", type=str, default='/tmp/tensorflow/mnist/input_data', help="data directory")
''' parser.add_argument("--dropout_rate", type=float, default=0.5, help="dropout rate")
params = { parser.add_argument("--channel_1_num", type=int, default=32)
'data_dir': '/tmp/tensorflow/mnist/input_data', parser.add_argument("--channel_2_num", type=int, default=64)
'dropout_rate': 0.5, parser.add_argument("--conv_size", type=int, default=5)
'channel_1_num': 32, parser.add_argument("--pool_size", type=int, default=2)
'channel_2_num': 64, parser.add_argument("--hidden_size", type=int, default=1024)
'conv_size': 5, parser.add_argument("--learning_rate", type=float, default=1e-4)
'pool_size': 2, parser.add_argument("--batch_num", type=int, default=2000)
'hidden_size': 1024, parser.add_argument("--batch_size", type=int, default=32)
'learning_rate': 1e-4,
'batch_num': 2000, args, _ = parser.parse_known_args()
'batch_size': 32} return args
return params
if __name__ == '__main__': if __name__ == '__main__':
'''@nni.get_next_parameter()''' '''@nni.get_next_parameter()'''
try: try:
main(generate_default_params()) main(vars(get_params()))
except Exception as exception: except Exception as exception:
logger.exception(exception) logger.exception(exception)
raise raise
"""A deep MNIST classifier using convolutional layers.""" """A deep MNIST classifier using convolutional layers."""
import argparse
import logging import logging
import math import math
import tempfile import tempfile
...@@ -198,33 +199,30 @@ def main(params): ...@@ -198,33 +199,30 @@ def main(params):
logger.debug('Final result is %g', test_acc) logger.debug('Final result is %g', test_acc)
logger.debug('Send final result done.') logger.debug('Send final result done.')
def get_params():
def generate_default_params(): ''' Get parameters from command line '''
''' parser = argparse.ArgumentParser()
Generate default parameters for mnist network. parser.add_argument("--data_dir", type=str, default='/tmp/tensorflow/mnist/input_data', help="data directory")
''' parser.add_argument("--dropout_rate", type=float, default=0.5, help="dropout rate")
params = { parser.add_argument("--channel_1_num", type=int, default=32)
'data_dir': '/tmp/tensorflow/mnist/input_data', parser.add_argument("--channel_2_num", type=int, default=64)
'dropout_rate': 0.5, parser.add_argument("--conv_size", type=int, default=5)
'channel_1_num': 32, parser.add_argument("--pool_size", type=int, default=2)
'channel_2_num': 64, parser.add_argument("--hidden_size", type=int, default=1024)
'conv_size': 5, parser.add_argument("--learning_rate", type=float, default=1e-4)
'pool_size': 2, parser.add_argument("--batch_num", type=int, default=2000)
'hidden_size': 1024, parser.add_argument("--batch_size", type=int, default=32)
'learning_rate': 1e-4,
'batch_num': 2000, args, _ = parser.parse_known_args()
'batch_size': 32} return args
return params
if __name__ == '__main__': if __name__ == '__main__':
try: try:
# get parameters form tuner # get parameters form tuner
RCV_PARAMS = nni.get_next_parameter() tuner_params = nni.get_next_parameter()
logger.debug(RCV_PARAMS) logger.debug(tuner_params)
# run params = vars(get_params())
params = generate_default_params() params.update(tuner_params)
params.update(RCV_PARAMS)
main(params) main(params)
except Exception as exception: except Exception as exception:
logger.exception(exception) logger.exception(exception)
......
<svg id="图层_1" data-name="图层 1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 806.55 233.23"><defs><style>.cls-1,.cls-10,.cls-11,.cls-12,.cls-2,.cls-3,.cls-5,.cls-9{fill:none;}.cls-1{clip-rule:evenodd;}.cls-14,.cls-2{fill-rule:evenodd;}.cls-11,.cls-3{stroke:#c1c1c1;}.cls-3{stroke-miterlimit:10;}.cls-14,.cls-4{fill:#fff;}.cls-12,.cls-5{stroke:#000;}.cls-10,.cls-11,.cls-12,.cls-5,.cls-9{stroke-miterlimit:8;stroke-width:1.53px;}.cls-6{font-size:18px;fill:#0071bc;font-family:SegoeUI, Segoe UI;}.cls-7{letter-spacing:-0.1em;}.cls-8{font-family:SegoeUIBlack, Segoe UI;}.cls-10,.cls-9{stroke:#505050;}.cls-10,.cls-11,.cls-12{stroke-linecap:square;}.cls-13{clip-path:url(#clip-path);}</style><clipPath id="clip-path"><polygon class="cls-1" points="643.38 226.46 702.27 226.46 702.27 171.36 643.38 171.36 643.38 226.46 643.38 226.46"/></clipPath></defs><title>overview</title><path class="cls-3" d="M701.49,57.48H104.59"/><path class="cls-3" d="M110.48,198.91H686.9"/><path class="cls-3" d="M104.59,198.91a70.72,70.72,0,0,1,0-141.43"/><path class="cls-3" d="M698.51,57.48a70.72,70.72,0,0,1,0,141.43"/><path class="cls-4" d="M396.26,164a34.61,34.61,0,1,1-34.59,34.6A34.6,34.6,0,0,1,396.26,164"/><path class="cls-4" d="M134.8,154.46a42,42,0,1,1-42,42,42,42,0,0,1,42-42"/><path class="cls-5" d="M377.34,201.92c0-1-.15-2-.15-3a20.37,20.37,0,0,1,20.27-20.48c1,0,1.84.14,2.69.14m-8.36,40.11a24.59,24.59,0,0,0,5.67.71,20.37,20.37,0,0,0,20.27-20.48,12.59,12.59,0,0,0-.15-2.44M397.46,215.1l-5.81,3.43,3.54,5.73m-.71-42.11,5.81-3.44L396.75,173M414,184h-8.36v8.31H414V184ZM383,216.1a5.88,5.88,0,1,0-5.81-5.87A5.9,5.9,0,0,0,383,216.1Z"/><path class="cls-4" d="M672.21,154.46a42,42,0,1,1-42,42,42,42,0,0,1,42-42"/><rect class="cls-4" x="210.1" y="23.41" width="86.5" height="113"/><rect class="cls-4" x="488.78" y="25.62" width="107" height="113"/><text class="cls-6" transform="translate(174.39 103.07)">Command Line <tspan class="cls-7" x="124.94" y="0">T</tspan><tspan x="132.61" y="0">ool</tspan><tspan class="cls-8"><tspan x="45.01" y="23.6">NNICTL</tspan></tspan></text><text class="cls-6" transform="translate(493.22 105.07)">Visualized UI<tspan class="cls-8"><tspan x="3.81" y="23.6">NNI Board</tspan></tspan></text><path class="cls-9" d="M240.73,53.54a12.69,12.69,0,1,1,12.69,12.38,12.55,12.55,0,0,1-12.69-12.38Zm20.42-21.33a23.92,23.92,0,0,0-7.73-1.28c-12.84,0-23.19,10.1-23.19,22.61A22.58,22.58,0,0,0,243.5,74m2.19.85a22.63,22.63,0,0,0,7.73,1.28c12.83,0,23.19-10.1,23.19-22.62a22.48,22.48,0,0,0-1.17-7m-.87-2.42a23.09,23.09,0,0,0-10.79-10.81m-4.09,5.55-1.17,3.27m10.8,5.54L265.23,49m-.14,9.25,4.08,1.56m-8.9,8.82-1.89-3.7m-11.09,3.84,1.61-3.69M238.1,59.8l3.65-1.42M238.1,47.71,241.6,49m7.15-7-1.6-3.12"/><path class="cls-10" d="M567.21,74.12h-46.5V37.28h46.5V74.12m-46.5-27.63h45.74m-27.14,6.15h-12.4V68h12.4V52.64m4.65,0h9.3M544,58.77h9.3M544,64.91h6.2"/><polyline class="cls-3" points="357.3 194.3 361.8 198.62 357.3 202.95"/><polyline class="cls-3" points="625.71 194.3 630.21 198.62 625.71 202.95"/><path class="cls-4" d="M265.88,179.36a20,20,0,1,1-20,20,20,20,0,0,1,20-20"/><path class="cls-4" d="M537.65,179.34a20,20,0,1,1-20,20,20,20,0,0,1,20-20"/><path class="cls-11" d="M265.88,187.49A11.87,11.87,0,1,1,254,199.35a11.86,11.86,0,0,1,11.86-11.86ZM264,205.39l6-6-6-6"/><path class="cls-11" d="M537.65,187.49a11.87,11.87,0,1,1-11.86,11.86,11.86,11.86,0,0,1,11.86-11.86Zm-1.9,17.9,6-6-6-6"/><path class="cls-5" d="M112,177.18v8.74m17.11-8.74v8.74M146,177.18v8.74m-22.51-2.73v-3.28a2.88,2.88,0,0,0-3-2.73h0a2.88,2.88,0,0,0-3,2.73v3.28a3,3,0,0,0,3,2.91h0a3,3,0,0,0,3-2.91Zm16.93,0v-3.28a2.89,2.89,0,0,0-3-2.73h0a2.83,2.83,0,0,0-2.79,2.73v3.28a3,3,0,0,0,2.79,2.91h0a3,3,0,0,0,3-2.91Zm17.11,0v-3.28a3,3,0,0,0-3-2.73h0a2.88,2.88,0,0,0-3,2.73v3.28a3,3,0,0,0,3,2.91h0a3.17,3.17,0,0,0,3-2.91ZM112,206.85v8.92m17.11-8.92v8.92M146,206.85v8.92m-22.51-2.91v-3.1a2.9,2.9,0,0,0-3-2.91h0a2.9,2.9,0,0,0-3,2.91v3.1a2.91,2.91,0,0,0,3,2.91h0a2.91,2.91,0,0,0,3-2.91Zm16.93,0v-3.1a2.91,2.91,0,0,0-3-2.91h0a2.86,2.86,0,0,0-2.79,2.91v3.1a2.87,2.87,0,0,0,2.79,2.91h0a2.92,2.92,0,0,0,3-2.91Zm17.11,0v-3.1a3,3,0,0,0-3-2.91h0a2.9,2.9,0,0,0-3,2.91v3.1a2.91,2.91,0,0,0,3,2.91h0a3,3,0,0,0,3-2.91Zm-34-20.57V201m16.93-8.73V201M118,198.29V195a2.87,2.87,0,0,0-3-2.73h0a2.87,2.87,0,0,0-3,2.73v3.27a3,3,0,0,0,3,2.92h0a3,3,0,0,0,3-2.92Zm16.93,0V195a2.88,2.88,0,0,0-3-2.73h0a2.83,2.83,0,0,0-2.79,2.73v3.27a3,3,0,0,0,2.79,2.92h0a3,3,0,0,0,3-2.92Zm17.11,0V195a3,3,0,0,0-3-2.73h0a2.87,2.87,0,0,0-3,2.73v3.27a3,3,0,0,0,3,2.92h0a3.17,3.17,0,0,0,3-2.92Zm5.58-6V201"/><path class="cls-12" d="M691,208.69v.17H658.49a14.59,14.59,0,0,1-14.34-14.73,14.45,14.45,0,0,1,14.34-14.56,17.76,17.76,0,0,1,3,.17,15.64,15.64,0,0,1,13.34-7.62,15.84,15.84,0,0,1,15.67,14.39h0a11.1,11.1,0,0,1,.5,22.18Z"/><g class="cls-13"><polygon class="cls-14" points="670.15 197.38 699.21 197.38 699.21 226.46 670.15 226.46 670.15 197.38 670.15 197.38"/><g class="cls-13"><path class="cls-5" d="M682,194.24a3,3,0,0,1,6.05,0v1h4v4H678v-4h4v-1Zm-4,1h-4v25.87h22.17V195.24h-4m1,9.95h-8.07m8.07,6h-8.07m8.07,6h-8.07m-7-13.93,2,2,3-3m-5,7,2,2,3-3m-5,7,2,2,3-3"/></g></g></svg>
...@@ -25,6 +25,7 @@ const UPDATE_SEARCH_SPACE = 'SS'; ...@@ -25,6 +25,7 @@ const UPDATE_SEARCH_SPACE = 'SS';
const ADD_CUSTOMIZED_TRIAL_JOB = 'AD'; const ADD_CUSTOMIZED_TRIAL_JOB = 'AD';
const TRIAL_END = 'EN'; const TRIAL_END = 'EN';
const TERMINATE = 'TE'; const TERMINATE = 'TE';
const PING = 'PI';
const INITIALIZED = 'ID'; const INITIALIZED = 'ID';
const NEW_TRIAL_JOB = 'TR'; const NEW_TRIAL_JOB = 'TR';
...@@ -39,6 +40,7 @@ const TUNER_COMMANDS: Set<string> = new Set([ ...@@ -39,6 +40,7 @@ const TUNER_COMMANDS: Set<string> = new Set([
UPDATE_SEARCH_SPACE, UPDATE_SEARCH_SPACE,
ADD_CUSTOMIZED_TRIAL_JOB, ADD_CUSTOMIZED_TRIAL_JOB,
TERMINATE, TERMINATE,
PING,
INITIALIZED, INITIALIZED,
NEW_TRIAL_JOB, NEW_TRIAL_JOB,
...@@ -63,6 +65,7 @@ export { ...@@ -63,6 +65,7 @@ export {
ADD_CUSTOMIZED_TRIAL_JOB, ADD_CUSTOMIZED_TRIAL_JOB,
TRIAL_END, TRIAL_END,
TERMINATE, TERMINATE,
PING,
INITIALIZED, INITIALIZED,
NEW_TRIAL_JOB, NEW_TRIAL_JOB,
NO_MORE_TRIAL_JOBS, NO_MORE_TRIAL_JOBS,
......
...@@ -35,15 +35,15 @@ import { ...@@ -35,15 +35,15 @@ import {
import { import {
TrainingService, TrialJobApplicationForm, TrialJobDetail, TrialJobMetric, TrialJobStatus TrainingService, TrialJobApplicationForm, TrialJobDetail, TrialJobMetric, TrialJobStatus
} from '../common/trainingService'; } from '../common/trainingService';
import { delay, getLogDir, getCheckpointDir, getMsgDispatcherCommand, mkDirP } from '../common/utils'; import { delay, getCheckpointDir, getLogDir, getMsgDispatcherCommand, mkDirP } from '../common/utils';
import { import {
ADD_CUSTOMIZED_TRIAL_JOB, INITIALIZE, INITIALIZED, KILL_TRIAL_JOB, NEW_TRIAL_JOB, NO_MORE_TRIAL_JOBS, ADD_CUSTOMIZED_TRIAL_JOB, INITIALIZE, INITIALIZED, KILL_TRIAL_JOB, NEW_TRIAL_JOB, NO_MORE_TRIAL_JOBS, PING,
REPORT_METRIC_DATA, REQUEST_TRIAL_JOBS, SEND_TRIAL_JOB_PARAMETER, TERMINATE, TRIAL_END, UPDATE_SEARCH_SPACE REPORT_METRIC_DATA, REQUEST_TRIAL_JOBS, SEND_TRIAL_JOB_PARAMETER, TERMINATE, TRIAL_END, UPDATE_SEARCH_SPACE
} from './commands'; } from './commands';
import { createDispatcherInterface, IpcInterface } from './ipcInterface'; import { createDispatcherInterface, IpcInterface } from './ipcInterface';
/** /**
* NNIManager * NNIManager which implements Manager interface
*/ */
class NNIManager implements Manager { class NNIManager implements Manager {
private trainingService: TrainingService; private trainingService: TrainingService;
...@@ -360,6 +360,16 @@ class NNIManager implements Manager { ...@@ -360,6 +360,16 @@ class NNIManager implements Manager {
} }
} }
private async pingDispatcher(): Promise<void> {
if (this.dispatcher === undefined) {
throw new Error('Error: tuner has not been setup');
}
while (!['ERROR', 'STOPPING', 'STOPPED'].includes(this.status.status)) {
await delay(1000 * 5);
this.dispatcher.sendCommand(PING);
}
}
private async requestTrialJobsStatus(): Promise<number> { private async requestTrialJobsStatus(): Promise<number> {
let finishedTrialJobNum: number = 0; let finishedTrialJobNum: number = 0;
if (this.dispatcher === undefined) { if (this.dispatcher === undefined) {
...@@ -424,7 +434,7 @@ class NNIManager implements Manager { ...@@ -424,7 +434,7 @@ class NNIManager implements Manager {
if (this.dispatcher === undefined) { if (this.dispatcher === undefined) {
throw new Error('Error: tuner has not been setup'); throw new Error('Error: tuner has not been setup');
} }
let allFinishedTrialJobNum: number = 0; let allFinishedTrialJobNum: number = this.currSubmittedTrialNum;
let waitSubmittedToFinish: number; let waitSubmittedToFinish: number;
while (this.status.status !== 'STOPPING' && this.status.status !== 'STOPPED') { while (this.status.status !== 'STOPPING' && this.status.status !== 'STOPPED') {
const finishedTrialJobNum: number = await this.requestTrialJobsStatus(); const finishedTrialJobNum: number = await this.requestTrialJobsStatus();
...@@ -536,6 +546,9 @@ class NNIManager implements Manager { ...@@ -536,6 +546,9 @@ class NNIManager implements Manager {
await Promise.all([ await Promise.all([
this.periodicallyUpdateExecDuration(), this.periodicallyUpdateExecDuration(),
this.pingDispatcher().catch((err: Error) => {
throw new NNIError('Dispatcher error', `Dispatcher error: ${err.message}`, err);
}),
this.trainingService.run().catch((err: Error) => { this.trainingService.run().catch((err: Error) => {
throw new NNIError('Training service error', `Training service error: ${err.message}`, err); throw new NNIError('Training service error', `Training service error: ${err.message}`, err);
}), }),
......
...@@ -108,6 +108,7 @@ export namespace ValidationSchemas { ...@@ -108,6 +108,7 @@ export namespace ValidationSchemas {
}), }),
frameworkcontroller_config: joi.object({ frameworkcontroller_config: joi.object({
storage: joi.string().min(1), storage: joi.string().min(1),
serviceAccountName: joi.string().min(1),
nfs: joi.object({ nfs: joi.object({
server: joi.string().min(1).required(), server: joi.string().min(1).required(),
path: joi.string().min(1).required() path: joi.string().min(1).required()
......
...@@ -18,8 +18,11 @@ ...@@ -18,8 +18,11 @@
*/ */
'use strict'; 'use strict';
import * as assert from 'assert';
import { KubernetesTrialConfig, KubernetesTrialConfigTemplate } from '../kubernetesConfig' import { KubernetesTrialConfig, KubernetesTrialConfigTemplate, KubernetesClusterConfigAzure,
KubernetesClusterConfigNFS, NFSConfig, KubernetesStorageKind, keyVaultConfig, AzureStorage, KubernetesClusterConfig,
StorageConfig } from '../kubernetesConfig'
export class FrameworkAttemptCompletionPolicy { export class FrameworkAttemptCompletionPolicy {
public readonly minFailedTaskCount: number; public readonly minFailedTaskCount: number;
...@@ -57,6 +60,80 @@ export class FrameworkControllerTrialConfig extends KubernetesTrialConfig{ ...@@ -57,6 +60,80 @@ export class FrameworkControllerTrialConfig extends KubernetesTrialConfig{
} }
} }
export class FrameworkControllerClusterConfig extends KubernetesClusterConfig {
public readonly serviceAccountName: string;
constructor(apiVersion: string, serviceAccountName: string) {
super(apiVersion);
this.serviceAccountName = serviceAccountName;
}
}
export class FrameworkControllerClusterConfigNFS extends KubernetesClusterConfigNFS {
public readonly serviceAccountName: string;
constructor(
serviceAccountName: string,
apiVersion: string,
nfs: NFSConfig,
storage?: KubernetesStorageKind
) {
super(apiVersion, nfs, storage);
this.serviceAccountName = serviceAccountName;
}
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigNFS {
let kubeflowClusterConfigObjectNFS = <FrameworkControllerClusterConfigNFS>jsonObject;
assert (kubeflowClusterConfigObjectNFS !== undefined)
return new FrameworkControllerClusterConfigNFS(
kubeflowClusterConfigObjectNFS.serviceAccountName,
kubeflowClusterConfigObjectNFS.apiVersion,
kubeflowClusterConfigObjectNFS.nfs,
kubeflowClusterConfigObjectNFS.storage
);
}
}
export class FrameworkControllerClusterConfigAzure extends KubernetesClusterConfigAzure {
public readonly serviceAccountName: string;
constructor(
serviceAccountName: string,
apiVersion: string,
keyVault: keyVaultConfig,
azureStorage: AzureStorage,
storage?: KubernetesStorageKind
) {
super(apiVersion, keyVault, azureStorage,storage);
this.serviceAccountName = serviceAccountName;
}
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigAzure {
let kubeflowClusterConfigObjectAzure = <FrameworkControllerClusterConfigAzure>jsonObject;
return new FrameworkControllerClusterConfigAzure(
kubeflowClusterConfigObjectAzure.serviceAccountName,
kubeflowClusterConfigObjectAzure.apiVersion,
kubeflowClusterConfigObjectAzure.keyVault,
kubeflowClusterConfigObjectAzure.azureStorage,
kubeflowClusterConfigObjectAzure.storage
);
}
}
export class FrameworkControllerClusterConfigFactory {
public static generateFrameworkControllerClusterConfig(jsonObject: object): FrameworkControllerClusterConfig {
let storageConfig = <StorageConfig>jsonObject;
if(!storageConfig) {
throw new Error("Invalid json object as a StorageConfig instance");
}
if(storageConfig.storage && storageConfig.storage === 'azureStorage') {
return FrameworkControllerClusterConfigAzure.getInstance(jsonObject);
} else if (storageConfig.storage === undefined || storageConfig.storage === 'nfs') {
return FrameworkControllerClusterConfigNFS.getInstance(jsonObject);
}
throw new Error(`Invalid json object ${jsonObject}`);
}
}
export type FrameworkControllerJobStatus = 'AttemptRunning' | 'Completed' | 'AttemptCreationPending' | 'AttemptCreationRequested' | 'AttemptPreparing' | 'AttemptCompleted'; export type FrameworkControllerJobStatus = 'AttemptRunning' | 'Completed' | 'AttemptCreationPending' | 'AttemptCreationRequested' | 'AttemptPreparing' | 'AttemptCompleted';
export type FrameworkControllerJobCompleteStatus = 'Succeeded' | 'Failed'; export type FrameworkControllerJobCompleteStatus = 'Succeeded' | 'Failed';
\ No newline at end of file
...@@ -32,12 +32,13 @@ import { ...@@ -32,12 +32,13 @@ import {
TrialJobDetail, NNIManagerIpConfig TrialJobDetail, NNIManagerIpConfig
} from '../../../common/trainingService'; } from '../../../common/trainingService';
import { delay, generateParamFileName, getExperimentRootDir, uniqueString } from '../../../common/utils'; import { delay, generateParamFileName, getExperimentRootDir, uniqueString } from '../../../common/utils';
import { NFSConfig, KubernetesClusterConfigNFS, KubernetesClusterConfigAzure, KubernetesClusterConfigFactory } from '../kubernetesConfig' import { NFSConfig } from '../kubernetesConfig'
import { KubernetesTrialJobDetail } from '../kubernetesData'; import { KubernetesTrialJobDetail } from '../kubernetesData';
import { validateCodeDir } from '../../common/util'; import { validateCodeDir } from '../../common/util';
import { AzureStorageClientUtility } from '../azureStorageClientUtils'; import { AzureStorageClientUtility } from '../azureStorageClientUtils';
import { KubernetesTrainingService } from '../kubernetesTrainingService'; import { KubernetesTrainingService } from '../kubernetesTrainingService';
import { FrameworkControllerTrialConfig } from './frameworkcontrollerConfig'; import { FrameworkControllerTrialConfig, FrameworkControllerClusterConfig, FrameworkControllerClusterConfigAzure, FrameworkControllerClusterConfigNFS,
FrameworkControllerClusterConfigFactory} from './frameworkcontrollerConfig';
import { FrameworkControllerJobRestServer } from './frameworkcontrollerJobRestServer'; import { FrameworkControllerJobRestServer } from './frameworkcontrollerJobRestServer';
import { FrameworkControllerClient } from './frameworkcontrollerApiClient'; import { FrameworkControllerClient } from './frameworkcontrollerApiClient';
import { FrameworkControllerJobInfoCollector } from './frameworkcontrollerJobInfoCollector'; import { FrameworkControllerJobInfoCollector } from './frameworkcontrollerJobInfoCollector';
...@@ -50,6 +51,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -50,6 +51,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
private fcTrialConfig?: FrameworkControllerTrialConfig; // frameworkcontroller trial configuration private fcTrialConfig?: FrameworkControllerTrialConfig; // frameworkcontroller trial configuration
private fcJobInfoCollector: FrameworkControllerJobInfoCollector; // frameworkcontroller job info collector private fcJobInfoCollector: FrameworkControllerJobInfoCollector; // frameworkcontroller job info collector
private fcContainerPortMap = new Map<string, number>(); // store frameworkcontroller container port private fcContainerPortMap = new Map<string, number>(); // store frameworkcontroller container port
private fcClusterConfig?: FrameworkControllerClusterConfig;
constructor() { constructor() {
super(); super();
...@@ -73,7 +75,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -73,7 +75,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
} }
public async submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail> { public async submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail> {
if(!this.kubernetesClusterConfig) { if(!this.fcClusterConfig) {
throw new Error('frameworkcontrollerClusterConfig is not initialized'); throw new Error('frameworkcontrollerClusterConfig is not initialized');
} }
if(!this.kubernetesCRDClient) { if(!this.kubernetesCRDClient) {
...@@ -129,13 +131,13 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -129,13 +131,13 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
* return: trialJobOutputUrl * return: trialJobOutputUrl
*/ */
private async uploadCodeFiles(trialJobId: string, trialLocalTempFolder: string): Promise<string> { private async uploadCodeFiles(trialJobId: string, trialLocalTempFolder: string): Promise<string> {
if(!this.kubernetesClusterConfig) { if(!this.fcClusterConfig) {
throw new Error('Kubeflow Cluster config is not initialized'); throw new Error('Kubeflow Cluster config is not initialized');
} }
let trialJobOutputUrl: string = ''; let trialJobOutputUrl: string = '';
if(this.kubernetesClusterConfig.storageType === 'azureStorage') { if(this.fcClusterConfig.storageType === 'azureStorage') {
try{ try{
//upload local files to azure storage //upload local files to azure storage
await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient, await AzureStorageClientUtility.uploadDirectory(this.azureStorageClient,
...@@ -146,8 +148,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -146,8 +148,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
this.log.error(error); this.log.error(error);
return Promise.reject(error); return Promise.reject(error);
} }
} else if(this.kubernetesClusterConfig.storageType === 'nfs') { } else if(this.fcClusterConfig.storageType === 'nfs') {
let nfsFrameworkControllerClusterConfig: KubernetesClusterConfigNFS = <KubernetesClusterConfigNFS>this.kubernetesClusterConfig; let nfsFrameworkControllerClusterConfig: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
// Creat work dir for current trial in NFS directory // Creat work dir for current trial in NFS directory
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}`); await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/nni/${getExperimentId()}/${trialJobId}`);
// Copy code files from local dir to NFS mounted dir // Copy code files from local dir to NFS mounted dir
...@@ -170,7 +172,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -170,7 +172,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
throw new Error('frameworkcontroller trial config is not initialized'); throw new Error('frameworkcontroller trial config is not initialized');
} }
for(let taskRole of this.fcTrialConfig.taskRoles) { for(let taskRole of this.fcTrialConfig.taskRoles) {
portScript += `${taskRole.name}_port=${this.fcContainerPortMap.get(taskRole.name)} `; portScript += `FB_${taskRole.name.toUpperCase()}_PORT=${this.fcContainerPortMap.get(taskRole.name)} `;
} }
return `${portScript} . /mnt/frameworkbarrier/injector.sh && ${command}`; return `${portScript} . /mnt/frameworkbarrier/injector.sh && ${command}`;
} }
...@@ -229,9 +231,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -229,9 +231,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
case TrialConfigMetadataKey.FRAMEWORKCONTROLLER_CLUSTER_CONFIG: case TrialConfigMetadataKey.FRAMEWORKCONTROLLER_CLUSTER_CONFIG:
let frameworkcontrollerClusterJsonObject = JSON.parse(value); let frameworkcontrollerClusterJsonObject = JSON.parse(value);
this.kubernetesClusterConfig = KubernetesClusterConfigFactory.generateKubernetesClusterConfig(frameworkcontrollerClusterJsonObject); this.fcClusterConfig = FrameworkControllerClusterConfigFactory.generateFrameworkControllerClusterConfig(frameworkcontrollerClusterJsonObject);
if(this.kubernetesClusterConfig.storageType === 'azureStorage') { if(this.fcClusterConfig.storageType === 'azureStorage') {
let azureFrameworkControllerClusterConfig = <KubernetesClusterConfigAzure>this.kubernetesClusterConfig; let azureFrameworkControllerClusterConfig = <FrameworkControllerClusterConfigAzure>this.fcClusterConfig;
this.azureStorageAccountName = azureFrameworkControllerClusterConfig.azureStorage.accountName; this.azureStorageAccountName = azureFrameworkControllerClusterConfig.azureStorage.accountName;
this.azureStorageShare = azureFrameworkControllerClusterConfig.azureStorage.azureShare; this.azureStorageShare = azureFrameworkControllerClusterConfig.azureStorage.azureShare;
await this.createAzureStorage( await this.createAzureStorage(
...@@ -240,8 +242,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -240,8 +242,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
azureFrameworkControllerClusterConfig.azureStorage.accountName, azureFrameworkControllerClusterConfig.azureStorage.accountName,
azureFrameworkControllerClusterConfig.azureStorage.azureShare azureFrameworkControllerClusterConfig.azureStorage.azureShare
); );
} else if(this.kubernetesClusterConfig.storageType === 'nfs') { } else if(this.fcClusterConfig.storageType === 'nfs') {
let nfsFrameworkControllerClusterConfig = <KubernetesClusterConfigNFS>this.kubernetesClusterConfig; let nfsFrameworkControllerClusterConfig = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
await this.createNFSStorage( await this.createNFSStorage(
nfsFrameworkControllerClusterConfig.nfs.server, nfsFrameworkControllerClusterConfig.nfs.server,
nfsFrameworkControllerClusterConfig.nfs.path nfsFrameworkControllerClusterConfig.nfs.path
...@@ -292,7 +294,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -292,7 +294,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
* @param podResources pod template * @param podResources pod template
*/ */
private generateFrameworkControllerJobConfig(trialJobId: string, trialWorkingFolder: string, frameworkcontrollerJobName : string, podResources : any) : any { private generateFrameworkControllerJobConfig(trialJobId: string, trialWorkingFolder: string, frameworkcontrollerJobName : string, podResources : any) : any {
if(!this.kubernetesClusterConfig) { if(!this.fcClusterConfig) {
throw new Error('frameworkcontroller Cluster config is not initialized'); throw new Error('frameworkcontroller Cluster config is not initialized');
} }
...@@ -346,16 +348,16 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -346,16 +348,16 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
private generateTaskRoleConfig(trialWorkingFolder: string, replicaImage: string, runScriptFile: string, podResources: any, containerPort: number): any { private generateTaskRoleConfig(trialWorkingFolder: string, replicaImage: string, runScriptFile: string, podResources: any, containerPort: number): any {
if(!this.kubernetesClusterConfig) { if(!this.fcClusterConfig) {
throw new Error('frameworkcontroller Cluster config is not initialized'); throw new Error('frameworkcontroller Cluster config is not initialized');
} }
if(!this.fcTrialConfig) { if(!this.fcTrialConfig) {
throw new Error('frameworkcontroller trial config is not initialized'); throw new Error('frameworkcontroller trial config is not initialized');
} }
let volumeSpecMap = new Map<string, object>(); let volumeSpecMap = new Map<string, object>();
if(this.kubernetesClusterConfig.storageType === 'azureStorage'){ if(this.fcClusterConfig.storageType === 'azureStorage'){
volumeSpecMap.set('nniVolumes', [ volumeSpecMap.set('nniVolumes', [
{ {
name: 'nni-vol', name: 'nni-vol',
...@@ -369,7 +371,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -369,7 +371,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
emptyDir: {} emptyDir: {}
}]) }])
}else { }else {
let frameworkcontrollerClusterConfigNFS: KubernetesClusterConfigNFS = <KubernetesClusterConfigNFS> this.kubernetesClusterConfig; let frameworkcontrollerClusterConfigNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS> this.fcClusterConfig;
volumeSpecMap.set('nniVolumes', [ volumeSpecMap.set('nniVolumes', [
{ {
name: 'nni-vol', name: 'nni-vol',
...@@ -382,41 +384,49 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -382,41 +384,49 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
emptyDir: {} emptyDir: {}
}]) }])
} }
let containers = [
{
name: 'framework',
image: replicaImage,
command: ["sh", `${path.join(trialWorkingFolder, runScriptFile)}`],
volumeMounts: [
{
name: 'nni-vol',
mountPath: this.CONTAINER_MOUNT_PATH
},{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}],
resources: podResources,
ports: [{
containerPort: containerPort
}]
}]
let initContainers = [
{
name: 'frameworkbarrier',
image: 'frameworkcontroller/frameworkbarrier',
volumeMounts: [
{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}]
}]
let spec: any = {
containers: containers,
initContainers: initContainers,
restartPolicy: 'OnFailure',
volumes: volumeSpecMap.get('nniVolumes'),
hostNetwork: false
};
if(this.fcClusterConfig.serviceAccountName) {
spec.serviceAccountName = this.fcClusterConfig.serviceAccountName;
}
let taskRole = { let taskRole = {
pod: { pod: {
spec: { spec: spec
containers: [
{
name: 'framework',
image: replicaImage,
command: ["sh", `${path.join(trialWorkingFolder, runScriptFile)}`],
volumeMounts: [
{
name: 'nni-vol',
mountPath: this.CONTAINER_MOUNT_PATH
},{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}],
resources: podResources,
ports: [{
containerPort: containerPort
}]
}],
initContainers: [
{
name: 'frameworkbarrier',
image: 'frameworkcontroller/frameworkbarrier',
volumeMounts: [
{
name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier'
}]
}],
restartPolicy: 'OnFailure',
volumes: volumeSpecMap.get('nniVolumes'),
hostNetwork: false
}
} }
} }
return taskRole; return taskRole;
......
...@@ -32,8 +32,8 @@ export type OperatorApiVersion = 'v1alpha2' | 'v1beta1'; ...@@ -32,8 +32,8 @@ export type OperatorApiVersion = 'v1alpha2' | 'v1beta1';
export class KubeflowClusterConfig extends KubernetesClusterConfig { export class KubeflowClusterConfig extends KubernetesClusterConfig {
public readonly operator: KubeflowOperator; public readonly operator: KubeflowOperator;
constructor(codeDir: string, operator: KubeflowOperator) { constructor(apiVersion: string, operator: KubeflowOperator) {
super(codeDir); super(apiVersion);
this.operator = operator; this.operator = operator;
} }
} }
......
...@@ -83,7 +83,8 @@ class MsgDispatcherBase(Recoverable): ...@@ -83,7 +83,8 @@ class MsgDispatcherBase(Recoverable):
_logger.debug('handle request: command: [{}], data: [{}]'.format(command, data)) _logger.debug('handle request: command: [{}], data: [{}]'.format(command, data))
data = json_tricks.loads(data) if data:
data = json_tricks.loads(data)
command_handlers = { command_handlers = {
# Tunner commands: # Tunner commands:
...@@ -96,12 +97,16 @@ class MsgDispatcherBase(Recoverable): ...@@ -96,12 +97,16 @@ class MsgDispatcherBase(Recoverable):
CommandType.ReportMetricData: self.handle_report_metric_data, CommandType.ReportMetricData: self.handle_report_metric_data,
CommandType.TrialEnd: self.handle_trial_end, CommandType.TrialEnd: self.handle_trial_end,
CommandType.Ping: self.handle_ping,
} }
if command not in command_handlers: if command not in command_handlers:
raise AssertionError('Unsupported command: {}'.format(command)) raise AssertionError('Unsupported command: {}'.format(command))
return command_handlers[command](data) return command_handlers[command](data)
def handle_ping(self, data):
pass
def handle_initialize(self, data): def handle_initialize(self, data):
raise NotImplementedError('handle_initialize not implemented') raise NotImplementedError('handle_initialize not implemented')
......
...@@ -33,6 +33,7 @@ class CommandType(Enum): ...@@ -33,6 +33,7 @@ class CommandType(Enum):
AddCustomizedTrialJob = b'AD' AddCustomizedTrialJob = b'AD'
TrialEnd = b'EN' TrialEnd = b'EN'
Terminate = b'TE' Terminate = b'TE'
Ping = b'PI'
# out # out
Initialized = b'ID' Initialized = b'ID'
......
...@@ -6,7 +6,7 @@ import { ...@@ -6,7 +6,7 @@ import {
Experiment, TableObj, Experiment, TableObj,
Parameters, TrialNumber Parameters, TrialNumber
} from '../static/interface'; } from '../static/interface';
import { getFinalResult } from '../static/function'; import { getFinal } from '../static/function';
import SuccessTable from './overview/SuccessTable'; import SuccessTable from './overview/SuccessTable';
import Title1 from './overview/Title1'; import Title1 from './overview/Title1';
import Progressed from './overview/Progress'; import Progressed from './overview/Progress';
...@@ -62,17 +62,7 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -62,17 +62,7 @@ class Overview extends React.Component<{}, OverviewState> {
tuner: {}, tuner: {},
trainingServicePlatform: '' trainingServicePlatform: ''
}, },
tableData: [{ tableData: [],
key: 0,
sequenceId: 0,
id: '',
duration: 0,
status: '',
acc: 0,
description: {
parameters: {}
}
}],
option: {}, option: {},
noData: '', noData: '',
// accuracy // accuracy
...@@ -224,7 +214,7 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -224,7 +214,7 @@ class Overview extends React.Component<{}, OverviewState> {
parameters: {} parameters: {}
}; };
const duration = (tableData[item].endTime - tableData[item].startTime) / 1000; const duration = (tableData[item].endTime - tableData[item].startTime) / 1000;
const acc = getFinalResult(tableData[item].finalMetricData); const acc = getFinal(tableData[item].finalMetricData);
// if hyperparameters is undefine, show error message, else, show parameters value // if hyperparameters is undefine, show error message, else, show parameters value
if (tableData[item].hyperParameters) { if (tableData[item].hyperParameters) {
const parameters = JSON.parse(tableData[item].hyperParameters[0]).parameters; const parameters = JSON.parse(tableData[item].hyperParameters[0]).parameters;
...@@ -256,16 +246,16 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -256,16 +246,16 @@ class Overview extends React.Component<{}, OverviewState> {
const { isTop10 } = this.state; const { isTop10 } = this.state;
if (isTop10 === true) { if (isTop10 === true) {
topTableData.sort((a: TableObj, b: TableObj) => { topTableData.sort((a: TableObj, b: TableObj) => {
if (a.acc && b.acc) { if (a.acc !== undefined && b.acc !== undefined) {
return b.acc - a.acc; return JSON.parse(b.acc.default) - JSON.parse(a.acc.default);
} else { } else {
return NaN; return NaN;
} }
}); });
} else { } else {
topTableData.sort((a: TableObj, b: TableObj) => { topTableData.sort((a: TableObj, b: TableObj) => {
if (a.acc && b.acc) { if (a.acc !== undefined && b.acc !== undefined) {
return a.acc - b.acc; return JSON.parse(a.acc.default) - JSON.parse(b.acc.default);
} else { } else {
return NaN; return NaN;
} }
...@@ -275,7 +265,7 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -275,7 +265,7 @@ class Overview extends React.Component<{}, OverviewState> {
let bestDefaultMetric = 0; let bestDefaultMetric = 0;
if (topTableData[0] !== undefined) { if (topTableData[0] !== undefined) {
if (topTableData[0].acc !== undefined) { if (topTableData[0].acc !== undefined) {
bestDefaultMetric = topTableData[0].acc; bestDefaultMetric = JSON.parse(topTableData[0].acc.default);
} }
} }
if (this._isMounted) { if (this._isMounted) {
...@@ -308,7 +298,7 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -308,7 +298,7 @@ class Overview extends React.Component<{}, OverviewState> {
const indexarr: Array<number> = []; const indexarr: Array<number> = [];
Object.keys(sourcePoint).map(item => { Object.keys(sourcePoint).map(item => {
const items = sourcePoint[item]; const items = sourcePoint[item];
accarr.push(items.acc); accarr.push(items.acc.default);
indexarr.push(items.sequenceId); indexarr.push(items.sequenceId);
}); });
const accOption = { const accOption = {
......
...@@ -3,7 +3,7 @@ import axios from 'axios'; ...@@ -3,7 +3,7 @@ import axios from 'axios';
import { MANAGER_IP } from '../static/const'; import { MANAGER_IP } from '../static/const';
import { Row, Col, Tabs, Input, Select, Button } from 'antd'; import { Row, Col, Tabs, Input, Select, Button } from 'antd';
const Option = Select.Option; const Option = Select.Option;
import { TableObjFianl, Parameters, DetailAccurPoint, TooltipForAccuracy } from '../static/interface'; import { TableObj, Parameters, DetailAccurPoint, TooltipForAccuracy } from '../static/interface';
import { getFinalResult, getFinal } from '../static/function'; import { getFinalResult, getFinal } from '../static/function';
import Accuracy from './overview/Accuracy'; import Accuracy from './overview/Accuracy';
import Duration from './trial-detail/Duration'; import Duration from './trial-detail/Duration';
...@@ -16,8 +16,8 @@ import '../static/style/trialsDetail.scss'; ...@@ -16,8 +16,8 @@ import '../static/style/trialsDetail.scss';
interface TrialDetailState { interface TrialDetailState {
accSource: object; accSource: object;
accNodata: string; accNodata: string;
tableListSource: Array<TableObjFianl>; tableListSource: Array<TableObj>;
searchResultSource: Array<TableObjFianl>; searchResultSource: Array<TableObj>;
isHasSearch: boolean; isHasSearch: boolean;
experimentStatus: string; experimentStatus: string;
entriesTable: number; entriesTable: number;
...@@ -136,7 +136,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { ...@@ -136,7 +136,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
.then(res => { .then(res => {
if (res.status === 200) { if (res.status === 200) {
const trialJobs = res.data; const trialJobs = res.data;
const trialTable: Array<TableObjFianl> = []; const trialTable: Array<TableObj> = [];
Object.keys(trialJobs).map(item => { Object.keys(trialJobs).map(item => {
// only succeeded trials have finalMetricData // only succeeded trials have finalMetricData
let desc: Parameters = { let desc: Parameters = {
...@@ -189,7 +189,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { ...@@ -189,7 +189,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
Object.keys(searchResultSource).map(index => { Object.keys(searchResultSource).map(index => {
temp.push(searchResultSource[index].id); temp.push(searchResultSource[index].id);
}); });
const searchResultList: Array<TableObjFianl> = []; const searchResultList: Array<TableObj> = [];
for (let i = 0; i < temp.length; i++) { for (let i = 0; i < temp.length; i++) {
Object.keys(trialTable).map(key => { Object.keys(trialTable).map(key => {
const item = trialTable[key]; const item = trialTable[key];
...@@ -221,7 +221,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { ...@@ -221,7 +221,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
.then(res => { .then(res => {
if (res.status === 200) { if (res.status === 200) {
const trialJobs = res.data; const trialJobs = res.data;
const trialTable: Array<TableObjFianl> = []; const trialTable: Array<TableObj> = [];
Object.keys(trialJobs).map(item => { Object.keys(trialJobs).map(item => {
// only succeeded trials have finalMetricData // only succeeded trials have finalMetricData
let desc: Parameters = { let desc: Parameters = {
...@@ -312,7 +312,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> { ...@@ -312,7 +312,7 @@ class TrialsDetail extends React.Component<{}, TrialDetailState> {
} else { } else {
window.clearInterval(this.interAllTableList); window.clearInterval(this.interAllTableList);
const { tableListSource } = this.state; const { tableListSource } = this.state;
const searchResultList: Array<TableObjFianl> = []; const searchResultList: Array<TableObj> = [];
Object.keys(tableListSource).map(key => { Object.keys(tableListSource).map(key => {
const item = tableListSource[key]; const item = tableListSource[key];
if (item.sequenceId.toString() === targetValue || item.id.includes(targetValue)) { if (item.sequenceId.toString() === targetValue || item.id.includes(targetValue)) {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment