Commits · 143c6615a18cd9dbc1d84a56cbfcbe325fb9ac58 · OpenDAS / nni

30 Jul, 2020 1 commit
- Reusable environment support GPU scheduler, add test cases and refactoring. (#2627) · 143c6615
  Chi Song authored Jul 30, 2020
  
  143c6615
24 Jul, 2020 1 commit
- Add timeout for web_channel in trial_runner (#2710) · 54fef7fa
  SparkSnail authored Jul 24, 2020
  
  54fef7fa
01 Jul, 2020 1 commit
- Support aml (#2615) · 93f96d4f
  SparkSnail authored Jul 01, 2020
  
  93f96d4f
30 Jun, 2020 1 commit

Reuse OpenPAI jobs to run multiple trials (#2521) · 0b9d6ce6

Chi Song authored Jun 30, 2020

Designed new interface to support reusable training service, currently only applies to OpenPAI, and default disabled.

Replace trial_keeper.py to trial_runner.py, trial_runner holds an environment, and receives commands from nni manager to run or stop an trial, and return events to nni manager.
Add trial dispatcher, which inherits from original trianing service interface. It uses to share as many as possible code of all training service, and isolate with training services.
Add EnvironmentService interface to manage environment, including start/stop an environment, refresh status of environments.
Add command channel on both nni manager and trial runner parts, it supports different ways to pass messages between them. Current supported channels are file, web sockets. and supported commands from nni manager are start, kill trial, send new parameters; from runner are initialized(support some channel doesn't know which runner connected), trial end, stdout ((new type), including metric like before), version check (new type), gpu info (new type).
Add storage service to wrapper a storage to standard file operations, like NFS, azure storage and so on.
Partial support run multiple trials in parallel on runner side, but not supported by trial dispatcher side.
Other minor changes,

Add log_level to TS UT, so that UT can show debug level log.
Expose platform to start info.
Add RouterTrainingService to keep origianl OpenPAI training service, and support dynamic IOC binding.
Add more GPU info for future usage, including GPU mem total/free/used, gpu type.
Make some license information consistence.
Fix async/await problems on Array.forEach, this method doesn't support async actually.
Fix IT errors on download data, which causes by my #2484 .
Accelerate some run loop pattern by reducing sleep seconds.

0b9d6ce6

19 May, 2020 1 commit
- Support Windows as remote node. (#2431) · 69cae211
  Chi Song authored May 20, 2020
  
  69cae211
18 Mar, 2020 1 commit
- Fix trialkeeper flush (#2174) · 4a07f9ed
  SparkSnail authored Mar 18, 2020
  
  4a07f9ed
17 Mar, 2020 1 commit
- Add flush in trial_keeper log (#2156) · 3c0ef842
  SparkSnail authored Mar 17, 2020
  
  3c0ef842
23 Dec, 2019 1 commit
- Support pai and paiYarn trainingservice (#1853) · 9cbbf6f8
  SparkSnail authored Dec 23, 2019
  
  9cbbf6f8
25 Nov, 2019 1 commit
- Update license header (#1757) · 587dd3af
  liuzhe-lz authored Nov 25, 2019
  
  587dd3af
04 Nov, 2019 1 commit
- Dev pylint (#1697) · eea50784
  chicm-ms authored Nov 04, 2019
```
Fix pylint errors
```
  eea50784
17 Jul, 2019 1 commit
- Fix final metrics collection when metrics data is not at beginning of line (#1293) · 04c30254
  chicm-ms authored Jul 17, 2019
```
* Fix final metrics (#1289)

* Fix final metrics with trial keeper
```
  04c30254
25 Jun, 2019 1 commit
- exclude multiphase with batch and gridsearch tuner test cases (#1203) · c2179921
  chicm-ms authored Jun 25, 2019
  
  c2179921
24 Jun, 2019 2 commits
- fix trial keeper (#1199) · b83e3b3b
  chicm-ms authored Jun 24, 2019
  
  b83e3b3b
- Multiphase refactor and support OpenPAI training service. (#1138) · ac6aee81
  chicm-ms authored Jun 24, 2019
```
* Refactor multiphase interface

* Implement multiphase on PAI

* update multiphase doc
```
  ac6aee81
19 Jun, 2019 1 commit
- Remove all whitespace at end of line (#1162) · ae7a72bc
  Hongarc authored Jun 19, 2019
  
  ae7a72bc
27 Mar, 2019 1 commit
- Support showing version check error message in WebUI (#922) · 21b48d29
  SparkSnail authored Mar 27, 2019
  
  21b48d29
22 Mar, 2019 2 commits

Support remoteLoggingType (#901) · c297650a

SparkSnail authored Mar 22, 2019

If user set remoteloggingType in config file, log content will not be transmitted from trialkeeper

c297650a

Fix version check (#906) · d10b8bca

SparkSnail authored Mar 22, 2019

There is one kind of version string like 'v0.5.2-gews11f', it is generated by installing from source code.
In current trialKeeper, use exact version match, and this version string will cause code break in msranni/nni image, because our offical image use clean number version.
Change the logic to fuzzy match, only match the main number of nni.

d10b8bca

15 Mar, 2019 1 commit

Support version check of nni (#807) · d0b22fc7

SparkSnail authored Mar 15, 2019

check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService
add a debug mode in config file

d0b22fc7

25 Feb, 2019 1 commit

Support webhdfs path in python hdfs client (#722) · 8c4c0ef2

SparkSnail authored Feb 25, 2019

trial_keeper use 50070 port to connect to webhdfs server, and PAI use a mapping method to map 50070 port to 5070 port to visit restful server, this method has some risk for PAI may not support this kind of mapping in later release.Now use Pylon path(/webhdfs/api/v1) instead of 50070 port in webhdfs client of trial_keeper, the path is transmitted in trainingService.
In this pr, we have these changes:

1. Change to use webhdfs path instead of 50070 port in hdfs client.
2. Change to use new hdfs package "PythonWebHDFS", which is build to support pylon by myself. You could test the new function from "sparksnail/nni:dev-pai" image to test pai trainingService.
3. Update some variables' name according to comments.

8c4c0ef2

24 Jan, 2019 1 commit

Configurable nniManager log path and log level (#644) · d9c83c0c

chicm-ms authored Jan 24, 2019

* Pull code (#22)

* Support distributed job for frameworkcontroller (#612)

support distributed job for frameworkcontroller

* Multiphase doc (#519)

* multiPhase doc

* updates

* updates

* Add time parser for 'nnictl update duration' (#632)

Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d}

* fix experiment state bug (#629)

* update top README.md (#622)

* Update README.md

* update (#634)

* Integration tests refactoring (#625)

* Integration test refactoring (#21) (#616)

* Integration test refactoring (#21)

* Refactoring integration tests

* test metrics

* update azure pipeline

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* update trigger

* Integration test refactoring (#618)

* updates

* updates

* update pipeline (#619)

* update pipeline

* updates

* updates

* updates

* updates

* updates

* test pipeline (#623)

* test pipeline

* updates

* updates

* updates

* Update integration test (#624)

* Update integration test

* updates

* updates

* updates

* updates

* updates

* updates

* Revert "Pull code (#22)"

This reverts commit 62fc165ad7b2ba724eead3b99f010aa34491e2c7.

* Configurable nniManager log path

* Configure log level

* add --debug command line for nnictl

* updates

d9c83c0c

08 Jan, 2019 1 commit
- Fix a race condidtion issue in trial_keeper for reading log from pipe (#578) · 95d19478
  fishyds authored Jan 08, 2019
```
* Fix a race condidtion issue in trial_keeper for reading log from pipe
```
  95d19478
02 Jan, 2019 1 commit

[Logging architecture refactor] Remove unused metrics related code in nni... · 37354dff

fishyds authored Jan 02, 2019

[Logging architecture refactor] Remove unused metrics related code in nni trial_tools, support kubeflow mode for logging architecture refactor (#551)

* Remove unused metrics related code in nni trial_tools, support kubeflow mode for logging architecture refactor

37354dff

29 Dec, 2018 1 commit

NNI logging architecture improvement (#539) · cb83ac0f

fishyds authored Dec 29, 2018

* Removed unused log code, refactor to rename some class name in nni sdk and trial_tools

* Fix the regression bug that loca/remote mode doesnt work

cb83ac0f

20 Dec, 2018 1 commit

[V0.4.1 Release] Merge v0.4.1 branch back to Master (#509) · ff834cea

fishyds authored Dec 20, 2018

* Update nnictl.py

Fix the issue that nnictl --version via pip installation doesn't work

* Update kubeflow training service document (#494)

* Remove kubectl related document, add messages for kubeconfig
* Add design section for kubeflow training service
* Move the image files for PAI training service doc into img folder.

* Update KubeflowMode.md (#498)

Update KubeflowMode.md, small terms change

* [V0.4.1 bug fix] Cannot run kubeflow training service due to trial_keeper change (#503)

* Update kubeflow training service document

* fix bug a that kubeflow trial job cannot run

* upgrade version number (#499)

* [V0.4.1 bug fix] Support read K8S config from KUBECONFIG environment variable (#507)

* Add KUBCONFIG env variable support

* In main.ts, throw cached error to make sure nnictl can show the error in stderr

ff834cea

17 Dec, 2018 1 commit

[PAITrainingService] Improve uploading codeDir efficiency (#479) · 9397b6f6

fishyds authored Dec 17, 2018

* [PAI training service] codeDir files upload improvement

* Create full local temp folder

* Organize the folder structure for experiment and trial files

9397b6f6

29 Nov, 2018 1 commit
- Trial keeper refactor (#411) · 2b126039
  fishyds authored Nov 29, 2018
```
* [Trial keeper refactor] refactor trial keeper stdout output
```
  2b126039
20 Nov, 2018 1 commit

[Kubeflow Training Service] V1, merge from kubeflow branch to master branch (#382) · 806afeb6

fishyds authored Nov 20, 2018

* Kubeflow TrainingService support, v1 (#373)

1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs
2. Update nni python SDK to support new kubeflow platform
3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file
4. This version only supports Tensorflow operator. Will add more operators' support in future versions

806afeb6

12 Nov, 2018 1 commit

[PAI training service] Support running multiple PAI experiment (#348) · b1d4c129

fishyds authored Nov 12, 2018

* Change base image from devel to runtime, to reduce docker image size

* Support running multiple experiment for PAI

* Fix a bug regarding to recuisively reference between paiRestServer and
paiTrainingService

b1d4c129

05 Nov, 2018 1 commit
- Uniform the names of python modules · e3872ba1
  Gems Guo authored Nov 05, 2018
  
  e3872ba1