Commits · 19173aa4370e36cba96ee7049eaaa0dceda5007c · OpenDAS / nni

14 Aug, 2019 1 commit
- merge v1.0(bug bash) back to master (#1462) · 19173aa4
  Guoxin authored Aug 14, 2019
```
* squash commits in v1.0 first round bug bash
```
  19173aa4
02 Aug, 2019 1 commit
- Set gpuNum as optional (#1389) · 204b1eba
  SparkSnail authored Aug 02, 2019
  
  204b1eba
20 Jun, 2019 1 commit
- Pass tslint for training service (#1177) · 22993e5d
  demianzhang authored Jun 20, 2019
```
* fix local and remote training services tslint
```
  22993e5d
19 Jun, 2019 1 commit
- Remove all whitespace at end of line (#1162) · ae7a72bc
  Hongarc authored Jun 19, 2019
  
  ae7a72bc
03 Jun, 2019 1 commit
- Fix remote gpu scheduler bug (#1143) · 139e0a90
  SparkSnail authored Jun 03, 2019
  
  139e0a90
28 May, 2019 1 commit
- Support multi trial jobs on same GPU (#1109) · 252d35e0
  SparkSnail authored May 28, 2019
  
  252d35e0
27 May, 2019 1 commit

NNI on Windows for NNI Remote mode (#1073) · a1f92666

demianzhang authored May 27, 2019

* test python

* test python36

* debug python

* debug python

* debug

* python version

* test python

* debug

* install nni

* install nni

* test powershell

* debug python

* test

* test python

* use python

* test python

* test python

* test

* update

* test powershell

* debug python

* debug python

* debug python

* debug powershell

* debug

* debug

* debug install.ps1

* add continueOnError: true

* debug

* debug

* update

* update

* add unittest

* test node

* update

* update joi

* debug joi

* add joi

* debug joi

* Update install

* update

* update

* add unittest

* add convert command

* add example

* fix windows commands

* debug

* fix tensorflow version

* fix pipeline

* update

* add gpu logic in windows

* update

* update

* debug

* fix commands

* fix commands

* update

* update

* Fix comments

* update

* fix kill command

* fix package.json

* Update package.json

* Refactor runScript

* Fix bug

* Fix comments

* Fix execKill

* Update

* Update

* Add unittest back

* Rollback install node

* Fix gpu memory

* Update

* Rollback check process

* Update mnist-hyperband.test.yml

* Update pipelines-it-local-windows.yml

* Update uninstall.ps1

* Fix virtual environment

* Fix tar

* Fix isAlive

* change gpu index logic

* test gpu index

* fix pipeline

* add cifar10

* fix cifar10

* remove gpu in cifar10

* test mnist gpu

* update

* debug

* Fix comments

* debug

* Update install.ps1

* debug

* update gpu metrics shell

* debug

* debug

* debug

* debug

* debug

* debug sigbreak

* Preinstall node-pre-gyp

* Update Installation.md

* Update Installation.md

* Remove install node-pre-gyp

* use taskkill to stop node process

* use ctl+c event to stop process

* add sigtrem signal in stop logic

* add ctl+break command

* Update isAlive

* debug sigterm

* Update pypi readme

* Update

* fix stop logic

* fix pipeline, add cifar10

* revert mnist, remove gpu

* Fix virtualenv

* Fix comments

* Update

* Update

* Fix install

* Update install.ps1

* Update install.ps1

* Fix comments

* Fix virtualenv install

* Update

* Update

* Fix comments

* Update

* Update install.ps1

* Update

* Update localTrainingService.ts

* Update

* Update

* Update

* Update

* Update

* Update util.ts

* Update utils.ts

* Fix system slash

* Update tmp dir

* Fix system slash

* Use python3 in remote

* Write tar command to file

* Update tar

* Update

* Update

* Fix stop

* Update StopSignal type

* Add removeTrialJobMetricListener

* remove Listeners

* Update listener

* Update

* Use Temp dir

* Use Temp dir

* Add remote windows pipeline

* Update pipelines-it-remote-windows.yml

* Update

* remote build wheel

* Update pipelines-it-remote-windows.yml

* debug

* debug

* Use docker source install

* Update

* Update

* Rollback remote build wheel

* Use self node and yarn

* Fix docker source install

* Rollback Makefile

* Upgrade docker pip

* Update

* Update

* Remote build wheel

* Use inline runOptions

* Hide wget output

* Add continueOnError

* Update

* Update

* Update

* Upgrade pip

* Add chmod

* Update

* debug

* Update

* Use pscp

* Update

* Download putty

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* debug

* exclude metis

* Refactor pathJoin

* Update

* debug metis

* debug metis

* Update

* Update dependency

* Fix comments

* Update

* Fix tslint

* Fix comments

* Fix comments

* add doc

* Fix comments

* Update

* Update doc

a1f92666

22 Apr, 2019 1 commit
- NNI on Windows for NNI Local mode (#937) · cfda0dae
  demianzhang authored Apr 22, 2019
  
  cfda0dae
01 Apr, 2019 1 commit
- Refactor local gpu scheduler (#943) · f05e685f
  SparkSnail authored Apr 01, 2019
  
  f05e685f
27 Mar, 2019 1 commit
- Support showing version check error message in WebUI (#922) · 21b48d29
  SparkSnail authored Mar 27, 2019
  
  21b48d29
22 Mar, 2019 1 commit

Support remoteLoggingType (#901) · c297650a

SparkSnail authored Mar 22, 2019

If user set remoteloggingType in config file, log content will not be transmitted from trialkeeper

c297650a

15 Mar, 2019 1 commit

Support version check of nni (#807) · d0b22fc7

SparkSnail authored Mar 15, 2019

check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService
add a debug mode in config file

d0b22fc7

14 Mar, 2019 1 commit

Fix ssh connection error (#829) · de9e2842

SparkSnail authored Mar 14, 2019

SSH client has a max number of open channels for a connection, if we set the number of trialCurrency too big, our ssh client will exec command using ssh frequently, then we will meet the error of Error: (SSH) Channel open failure: open failed.
Refactor the code, set one connection has a max trial concurrency, when the number of trial reach the ssh connection restriction, will create a new ssh connection to exec trial commands.

de9e2842

25 Feb, 2019 1 commit
- Fix a race condition bug that does not store Trial Job cancel status correctly (#707) · 9a3a75c8
  fishyds authored Feb 25, 2019
```
* Fix a race condition bug that does not store Trial Job cancel status correctly
```
  9a3a75c8
29 Jan, 2019 1 commit

Migrate remote log (#655) · 9d3d926b

SparkSnail authored Jan 29, 2019

* fix remote bug

* add document

* add document

* update

* update

* update

* update

* fix remote issue

* fix forEach

* update doc according to comments

* update

* update

* update

* remove 'any more'

* add base version for remote-log

* change launcher.py

* test

* basic version

* debug

* debug

* basic work version

* fix code

* update disable_log

* remove unused line

* add diable log in kubernetesTrainingService

* add detect frameworkcontroller

* fix comment

* update

* update

* fix kubernetesData

* debug

* debug

* debug

* fix comment

* fix conflict

* remove local temp files

* revert launcher.py

* update code by comments

* remove disableLog

* remove disable Log

* set timeout for cleanup

* fix code by comments

* update variable names

* add comments

* add delay function

* update

* update

* update by comments

* add  in remote script path

* rename variables

* update variable name

* add mkdir -p for subfolder

9d3d926b

25 Jan, 2019 1 commit

Refactoring nnimanager log (#652) · 6d591989

chicm-ms authored Jan 25, 2019

* Pull code (#22)

* Support distributed job for frameworkcontroller (#612)

support distributed job for frameworkcontroller

* Multiphase doc (#519)

* multiPhase doc

* updates

* updates

* Add time parser for 'nnictl update duration' (#632)

Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d}

* fix experiment state bug (#629)

* update top README.md (#622)

* Update README.md

* update (#634)

* Integration tests refactoring (#625)

* Integration test refactoring (#21) (#616)

* Integration test refactoring (#21)

* Refactoring integration tests

* test metrics

* update azure pipeline

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* update trigger

* Integration test refactoring (#618)

* updates

* updates

* update pipeline (#619)

* update pipeline

* updates

* updates

* updates

* updates

* updates

* test pipeline (#623)

* test pipeline

* updates

* updates

* updates

* Update integration test (#624)

* Update integration test

* updates

* updates

* updates

* updates

* updates

* updates

* Revert "Pull code (#22)"

This reverts commit 62fc165ad7b2ba724eead3b99f010aa34491e2c7.

* Update nnimanager logs

* updates

* Update README.md

* Revert "Update README.md"

This reverts commit bc67061160e5d57305a6e7fb63d491d12d0e9002.

* updates

* updates

6d591989

04 Jan, 2019 1 commit

Fix remote TrainingService bug, change forEach to "for of" (#564) · e3332641

SparkSnail authored Jan 04, 2019

trial job could not be stopped in remote machine when experiment is stopped, because awit/async does not work normally in forEach, refer https://codeburst.io/javascript-async-await-with-foreach-b6ba62bbf404.

e3332641

03 Jan, 2019 1 commit
- fix remote issue · 91612098
  Shinai Yang (FA TALENT) authored Jan 03, 2019
  
  91612098
29 Nov, 2018 1 commit

Add codeDir file count validation for setClusterConfig (#409) · cf3d434f

fishyds authored Nov 29, 2018

* Add codeDir file count validation for setClusterConfig

* fix a small bug if find command is not installed

* Remove codeDir validation for local training service

* Remove useless import

cf3d434f

27 Nov, 2018 1 commit

mac support with local, remote & pai mode (#386) · 101b02ff

Yan Ni authored Nov 27, 2018

* update Makefile for mac support, wait for aka.ms support

* refix Makefile for colorful echo

* update Makefile with shorturl

* fix false fail on mac webui

* fix cross os remote tmpdir issue

* add readonly to RemoteMachineTrainingService.remoteOS

* fix var name for PR 386

101b02ff

25 Nov, 2018 1 commit

Fix trialjobstate (#385) · c4d1aefe

QuanluZhang authored Nov 26, 2018

* add one more trial job status, EARLY_STOPPED

* fix datastore/nnimanager/mockeddatastore. test/webui/metrics_reader not done. USER_TO_CANCEL

* fix bug

* modifications based on Deshui's comments

* fix bug

* fix bug in remote mode

c4d1aefe

20 Nov, 2018 1 commit

[Kubeflow Training Service] V1, merge from kubeflow branch to master branch (#382) · 806afeb6

fishyds authored Nov 20, 2018

* Kubeflow TrainingService support, v1 (#373)

1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs
2. Update nni python SDK to support new kubeflow platform
3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file
4. This version only supports Tensorflow operator. Will add more operators' support in future versions

806afeb6

02 Nov, 2018 1 commit
- Fix sequence id issue on resuming experiment (#316) · f56f688b
  chicm-ms authored Nov 02, 2018
  
  f56f688b
16 Oct, 2018 1 commit

Add idompotent support for get_parameters() in nni sdk (#216) · 9bb479bb

fishyds authored Oct 16, 2018

* Updated based on comments

* Fix bug, make get_parameters() idompotent

* Add idompotent support for get_parameters() in LocalTrainingService

9bb479bb

12 Oct, 2018 2 commits

Add api nni.get_sequence_id() (#203) · 1388d763

chicm-ms authored Oct 12, 2018

* Pull latest code (#2)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* Bug fix for SQuAD example tuner. (#134)

* Update Makefile (#151)

* test

* update setup.py

* update Makefile and install.sh

* rever setup.py

* change color

* update doc

* update doc

* fix auto-completion's extra space

* update Makefile

* update webui

* Update doc image (#163)

* update doc

* trivial

* trivial

* trivial

* trivial

* trivial

* trivial

* update image

* update image size

* Update ga squad (#104)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* sklearn examples (#169)

* fix nnictl bug

* fix install.sh

* add sklearn-regression example

* add sklearn classification

* update sklearn

* update example

* remove additional code

* Update batch tuner (#158)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* update batch tuner

* Quickly fix cascading search space bug in tuner (#156)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* quickly fix cascading searchspace bug in tuner

* Add iterative search space example (#119)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* add iterative search space example

* update

* update readme

* change name

* Add api nni.get_sequence_id()

* Add sequence_id to TrialJobDetail

1388d763

Fix OpenPAI training service failed issue after multiphase training code merged (#206) · f4ee9f8a
fishyds authored Oct 12, 2018
```
* fix parameter file name issue for multi-phase training

* Updated based on comments
```
f4ee9f8a

08 Oct, 2018 1 commit

Multi-phase training service (#148) · 39085789

chicm-ms authored Oct 08, 2018

* Dev enas  - multi-phase hyper parameters support (#96)

* Multi-phase support

* Updates

* Updates

* updates

* updates

* updates

* Merge master to dev-enas (#117)

* Multi-phase support

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* Updates

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Updates

* updates

* updates

* updates

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Merge master to dev-enas (#118)

* update document (#92)

* Edit readme.md

* updated a word

* Update GetStarted.md

* Update GetStarted.md

* refact readme, getstarted and write your trial md.

* Update README.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Update WriteYourTrial.md

* Fix nnictl bugs and add new feature (#75)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* remove Buffer warning (#100)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* Add support for debugging mode

* fix setup.py (#115)

* Add DAG model configuration format for SQuAD example.

* Explain config format for SQuAD QA model.

* Add more detailed introduction about the evolution algorithm.

* Fix install.sh add add trial log path (#109)

* fix nnictl bug

* fix nnictl create bug

* add experiment status logic

* add more information for nnictl

* fix Evolution Tuner bug

* refactor code

* fix code in updater.py

* fix nnictl --help

* fix classArgs bug

* update check response.status_code logic

* show trial log path

* update document

* fix install.sh

* set default vallue for maxTrialNum and maxExecDuration

* fix nnictl

* support multiPhase (#127)

* fix nnictl bug

* support multiPhase

* Fix multiphase datastore problem (#125)

* Fix multiphase datastore problem

* updates

* updates

* updates

* updates

* Pull latest code (#2)

* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d174836.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc

* fix bug (#147)

* Refactor nnictl and add config_pai.yml (#144)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* add config_pai.yml

* refactor nnictl create logic and add colorful print

* fix nnictl stop logic

* add annotation for config_pai.yml

* add document for start experiment

* fix config.yml

* fix document

* Fix trial keeper wrongly exit issue (#152)

* Fix trial keeper bug, use actual exitcode to exit rather than 1

* Fix bug of table sort (#145)

* Update doc for PAIMode and v0.2 release notes (#153)

* Update v0.2 documentation regards to release note and PAI training service

* Update document to describe NNI docker image

* Bug fix for SQuAD example tuner. (#134)

* Update Makefile (#151)

* test

* update setup.py

* update Makefile and install.sh

* rever setup.py

* change color

* update doc

* update doc

* fix auto-completion's extra space

* update Makefile

* update webui

* Update doc image (#163)

* update doc

* trivial

* trivial

* trivial

* trivial

* trivial

* trivial

* update image

* update image size

* Update ga squad (#104)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* sklearn examples (#169)

* fix nnictl bug

* fix install.sh

* add sklearn-regression example

* add sklearn classification

* update sklearn

* update example

* remove additional code

* Update batch tuner (#158)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* update batch tuner

* Quickly fix cascading search space bug in tuner (#156)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* quickly fix cascading searchspace bug in tuner

* Add iterative search space example (#119)

* update readme in ga_squad

* update readme

* fix typo

* Update README.md

* Update README.md

* Update README.md

* update readme

* add iterative search space example

* update

* update readme

* change name

* updates

* updates

* Updates CI

* updates

39085789

27 Sep, 2018 1 commit

PAI Training Service implementation (#128) · d3506e34

fishyds authored Sep 27, 2018

* PAI Training service implementation
**1. Implement PAITrainingService
**2. Add trial-keeper python module, and modify setup.py to install the module
**3. Add PAItrainingService rest server to collect metrics from PAI container.

d3506e34

14 Sep, 2018 1 commit

Merge latest code changes into Github Master (#54) · 3d221da9

fishyds authored Sep 14, 2018

* Merge latest code changes into Github Master

* temporary modification for travis

* temporary modification for travis

3d221da9

07 Sep, 2018 1 commit
- Merge from dogfood branch to master · 8314d6ee
  Deshui Yu authored Sep 07, 2018
  
  8314d6ee
24 Aug, 2018 1 commit
- [Code merge] Merge code from dogfood-v1 branch · 61d47a4d
  Deshui Yu authored Aug 24, 2018
  
  61d47a4d
20 Aug, 2018 1 commit
- NNI dogfood version 1 · 252f36f8
  Deshui Yu authored Aug 20, 2018
  
  252f36f8