Commits · ff834ceab792f05b558da919954eca8033d75449 · OpenDAS / nni

20 Dec, 2018 1 commit

[V0.4.1 Release] Merge v0.4.1 branch back to Master (#509) · ff834cea

fishyds authored Dec 20, 2018

* Update nnictl.py

Fix the issue that nnictl --version via pip installation doesn't work

* Update kubeflow training service document (#494)

* Remove kubectl related document, add messages for kubeconfig
* Add design section for kubeflow training service
* Move the image files for PAI training service doc into img folder.

* Update KubeflowMode.md (#498)

Update KubeflowMode.md, small terms change

* [V0.4.1 bug fix] Cannot run kubeflow training service due to trial_keeper change (#503)

* Update kubeflow training service document

* fix bug a that kubeflow trial job cannot run

* upgrade version number (#499)

* [V0.4.1 bug fix] Support read K8S config from KUBECONFIG environment variable (#507)

* Add KUBCONFIG env variable support

* In main.ts, throw cached error to make sure nnictl can show the error in stderr

ff834cea

14 Dec, 2018 1 commit

quick fix create secret in kubeflowTrainingService (#474) · 482cf1d0

SparkSnail authored Dec 14, 2018

rest api of kubernetes does not use base64 to encode chars, now use base64 to encode username and then create secret.

482cf1d0

13 Dec, 2018 1 commit
- [Kubeflow training service] Use Kubernete API server to replace kubectl dependency (#472) · d8e55165
  fishyds authored Dec 13, 2018
```
[Kubeflow training service] Use Kubernete API server to replace kubectl dependency
```
  d8e55165
07 Dec, 2018 1 commit
- Support kuberflow pytorch-operator (#406) · c265903e
  SparkSnail authored Dec 07, 2018
```
1.Support pytorch-operator
2.remove unsupported operator
```
  c265903e
05 Dec, 2018 1 commit
- [V0.4 Release] Kubeflow training service: Remove unued kubernetesServer config entry (#444) · 311d3da6
  fishyds authored Dec 04, 2018
```
* Remove unused kubernetesServer config entry in config file and schema validation
```
  311d3da6
30 Nov, 2018 1 commit
- [Kubeflow training service] fix bug that wrongly split kube delete cmd into 2 lines (#425) · 5426cfe8
  fishyds authored Nov 30, 2018
```
* [Kubeflow training service] fix bug that wrongly split kube delete cmd into 2 lines

* Adjust white space
```
  5426cfe8
29 Nov, 2018 1 commit

Add codeDir file count validation for setClusterConfig (#409) · cf3d434f

fishyds authored Nov 29, 2018

* Add codeDir file count validation for setClusterConfig

* fix a small bug if find command is not installed

* Remove codeDir validation for local training service

* Remove useless import

cf3d434f

28 Nov, 2018 1 commit

Support Azure k8s (#383) · 21a2bb0b

SparkSnail authored Nov 28, 2018

Support aks of kuberflow training service
Support nnictl set nniManagerIp

21a2bb0b

25 Nov, 2018 1 commit

Fix trialjobstate (#385) · c4d1aefe

QuanluZhang authored Nov 26, 2018

* add one more trial job status, EARLY_STOPPED

* fix datastore/nnimanager/mockeddatastore. test/webui/metrics_reader not done. USER_TO_CANCEL

* fix bug

* modifications based on Deshui's comments

* fix bug

* fix bug in remote mode

c4d1aefe

23 Nov, 2018 3 commits

Add nniManagerIp in nnictl and trainingService (#393) · c2a4ce6c

SparkSnail authored Nov 23, 2018

Add nniManager Ip in nnictl, pai TrainingService and kubeflow TrainingService.
If users set nniManagerIp, pai and kubeflow will use this ip instead of using getIPV4() function.
Web UI will also use this nniManagerIp.

c2a4ce6c

Move the call of experimentDoneCleanUp into stopExperiment() method (#390) · cb7c7ff0

fishyds authored Nov 23, 2018

* Adjust sleep position for sdk_test.py

* Exit dispather process if receive Terminate command

* Add comment for sleep change in sdk_test.py

cb7c7ff0

[Kubeflow Training Service] Explicitly set cuda_visible_devices env var (#388) · 28e26ae9
fishyds authored Nov 23, 2018
```
* Use different output folder for ps and worker

* Add cuda_visible_devices env var if gpuNum is 0
```
28e26ae9

22 Nov, 2018 1 commit

[Kubeflow training service] Update kubeflow exp job config schema to support... · e341df81

fishyds authored Nov 22, 2018

[Kubeflow training service] Update kubeflow exp job config schema to support distributed training (#387)

* Support distributed training on tf-operator, for worker and ps

* Update validation rule for kubeflow config

* small code refactor adjustment for private methods

* Use different output folder for ps and worker

e341df81

20 Nov, 2018 1 commit

[Kubeflow Training Service] V1, merge from kubeflow branch to master branch (#382) · 806afeb6

fishyds authored Nov 20, 2018

* Kubeflow TrainingService support, v1 (#373)

1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs
2. Update nni python SDK to support new kubeflow platform
3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file
4. This version only supports Tensorflow operator. Will add more operators' support in future versions

806afeb6