Commits · 21a2bb0b0b7c81da725dfa04b940c4abdc10f4b1 · OpenDAS / nni

28 Nov, 2018 1 commit

SparkSnail authored Nov 28, 2018

Support aks of kuberflow training service
Support nnictl set nniManagerIp

21a2bb0b

25 Nov, 2018 1 commit

Fix trialjobstate (#385) · c4d1aefe

QuanluZhang authored Nov 26, 2018

* add one more trial job status, EARLY_STOPPED

* fix datastore/nnimanager/mockeddatastore. test/webui/metrics_reader not done. USER_TO_CANCEL

* fix bug

* modifications based on Deshui's comments

* fix bug

* fix bug in remote mode

c4d1aefe

23 Nov, 2018 3 commits

Add nniManagerIp in nnictl and trainingService (#393) · c2a4ce6c

SparkSnail authored Nov 23, 2018

Add nniManager Ip in nnictl, pai TrainingService and kubeflow TrainingService.
If users set nniManagerIp, pai and kubeflow will use this ip instead of using getIPV4() function.
Web UI will also use this nniManagerIp.

c2a4ce6c

Move the call of experimentDoneCleanUp into stopExperiment() method (#390) · cb7c7ff0

fishyds authored Nov 23, 2018

* Adjust sleep position for sdk_test.py

* Exit dispather process if receive Terminate command

* Add comment for sleep change in sdk_test.py

cb7c7ff0

[Kubeflow Training Service] Explicitly set cuda_visible_devices env var (#388) · 28e26ae9
fishyds authored Nov 23, 2018
```
* Use different output folder for ps and worker

* Add cuda_visible_devices env var if gpuNum is 0
```
28e26ae9

22 Nov, 2018 1 commit

[Kubeflow training service] Update kubeflow exp job config schema to support... · e341df81

fishyds authored Nov 22, 2018

[Kubeflow training service] Update kubeflow exp job config schema to support distributed training (#387)

* Support distributed training on tf-operator, for worker and ps

* Update validation rule for kubeflow config

* small code refactor adjustment for private methods

* Use different output folder for ps and worker

e341df81

20 Nov, 2018 1 commit

[Kubeflow Training Service] V1, merge from kubeflow branch to master branch (#382) · 806afeb6

fishyds authored Nov 20, 2018

* Kubeflow TrainingService support, v1 (#373)

1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs
2. Update nni python SDK to support new kubeflow platform
3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file
4. This version only supports Tensorflow operator. Will add more operators' support in future versions

806afeb6