Commits · f5d7e664a407bdac625b1d4f3764f029c4c62dd7 · OpenDAS / nni

27 May, 2019 2 commits

import finished trial data to tuner when experiment is resumed (#1107) · f5d7e664
QuanluZhang authored May 27, 2019
```
experiment resume
```
f5d7e664

NNI on Windows for NNI Remote mode (#1073) · a1f92666

demianzhang authored May 27, 2019

* test python

* test python36

* debug python

* debug python

* debug

* python version

* test python

* debug

* install nni

* install nni

* test powershell

* debug python

* test

* test python

* use python

* test python

* test python

* test

* update

* test powershell

* debug python

* debug python

* debug python

* debug powershell

* debug

* debug

* debug install.ps1

* add continueOnError: true

* debug

* debug

* update

* update

* add unittest

* test node

* update

* update joi

* debug joi

* add joi

* debug joi

* Update install

* update

* update

* add unittest

* add convert command

* add example

* fix windows commands

* debug

* fix tensorflow version

* fix pipeline

* update

* add gpu logic in windows

* update

* update

* debug

* fix commands

* fix commands

* update

* update

* Fix comments

* update

* fix kill command

* fix package.json

* Update package.json

* Refactor runScript

* Fix bug

* Fix comments

* Fix execKill

* Update

* Update

* Add unittest back

* Rollback install node

* Fix gpu memory

* Update

* Rollback check process

* Update mnist-hyperband.test.yml

* Update pipelines-it-local-windows.yml

* Update uninstall.ps1

* Fix virtual environment

* Fix tar

* Fix isAlive

* change gpu index logic

* test gpu index

* fix pipeline

* add cifar10

* fix cifar10

* remove gpu in cifar10

* test mnist gpu

* update

* debug

* Fix comments

* debug

* Update install.ps1

* debug

* update gpu metrics shell

* debug

* debug

* debug

* debug

* debug

* debug sigbreak

* Preinstall node-pre-gyp

* Update Installation.md

* Update Installation.md

* Remove install node-pre-gyp

* use taskkill to stop node process

* use ctl+c event to stop process

* add sigtrem signal in stop logic

* add ctl+break command

* Update isAlive

* debug sigterm

* Update pypi readme

* Update

* fix stop logic

* fix pipeline, add cifar10

* revert mnist, remove gpu

* Fix virtualenv

* Fix comments

* Update

* Update

* Fix install

* Update install.ps1

* Update install.ps1

* Fix comments

* Fix virtualenv install

* Update

* Update

* Fix comments

* Update

* Update install.ps1

* Update

* Update localTrainingService.ts

* Update

* Update

* Update

* Update

* Update

* Update util.ts

* Update utils.ts

* Fix system slash

* Update tmp dir

* Fix system slash

* Use python3 in remote

* Write tar command to file

* Update tar

* Update

* Update

* Fix stop

* Update StopSignal type

* Add removeTrialJobMetricListener

* remove Listeners

* Update listener

* Update

* Use Temp dir

* Use Temp dir

* Add remote windows pipeline

* Update pipelines-it-remote-windows.yml

* Update

* remote build wheel

* Update pipelines-it-remote-windows.yml

* debug

* debug

* Use docker source install

* Update

* Update

* Rollback remote build wheel

* Use self node and yarn

* Fix docker source install

* Rollback Makefile

* Upgrade docker pip

* Update

* Update

* Remote build wheel

* Use inline runOptions

* Hide wget output

* Add continueOnError

* Update

* Update

* Update

* Upgrade pip

* Add chmod

* Update

* debug

* Update

* Use pscp

* Update

* Download putty

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* debug

* exclude metis

* Refactor pathJoin

* Update

* debug metis

* debug metis

* Update

* Update dependency

* Fix comments

* Update

* Fix tslint

* Fix comments

* Fix comments

* add doc

* Fix comments

* Update

* Update doc

a1f92666

23 May, 2019 2 commits
- Support paiTrainingService on windows (#1075) · feb6f3b8
  SparkSnail authored May 23, 2019
  
  feb6f3b8
- Support v1beta2 operator in KubeflowTrainingService (#1058) · 14a79654
  SparkSnail authored May 23, 2019
  
  14a79654
22 May, 2019 1 commit
- Add file path of dispatcher.log into dispatcher error info (#1094) · 6623dff3
  chicm-ms authored May 22, 2019
  
  6623dff3
16 May, 2019 1 commit
- Fix gpu detector in localTrainingService (#1068) · af89df8c
  SparkSnail authored May 16, 2019
  
  af89df8c
15 May, 2019 2 commits
- fix bug of state transition (#1077) · 0b864c7a
  QuanluZhang authored May 15, 2019
  
  0b864c7a
- Handle string type error (#1064) · 8818cb65
  chicm-ms authored May 15, 2019
  
  8818cb65
14 May, 2019 1 commit
- Update gpuScheduler.ts (#1043) · 34988d9b
  demianzhang authored May 14, 2019
  
  34988d9b
25 Apr, 2019 1 commit
- Check gpuIndices range for remote machine training service · a05db21b
  chicm-ms authored Apr 25, 2019
  
  a05db21b
22 Apr, 2019 2 commits
- Fix for master -- Node's API changes (#1002) · cf983800
  Zejun Lin authored Apr 22, 2019
```
* fs's API changed

* fix version

* roll back utils
```
  cf983800
- NNI on Windows for NNI Local mode (#937) · cfda0dae
  demianzhang authored Apr 22, 2019
  
  cfda0dae
19 Apr, 2019 2 commits
- Implement API for user to import data and export data of type `json` or `csv` (#980) · c8ef4141
  Zejun Lin authored Apr 19, 2019
  
  c8ef4141
- Fix local gpu issue (#1000) · 1d9b0a99
  chicm-ms authored Apr 19, 2019
  
  1d9b0a99
18 Apr, 2019 1 commit

Designated gpu devices for NNI trial jobs (#991) · ca99000d

chicm-ms authored Apr 18, 2019

* Refactoring local training service
* Designated GPU for local training service
* RemoteMachine designated GPU configuration

ca99000d

17 Apr, 2019 1 commit
- Fix paiToken update logic (#976) · 9a3c61e8
  SparkSnail authored Apr 17, 2019
  
  9a3c61e8
12 Apr, 2019 1 commit
- Add BOHB Advisor (#910) · 5aca94db
  Shufan Huang authored Apr 12, 2019
```
add BOHB Advisor
```
  5aca94db
11 Apr, 2019 3 commits
- fix not-successfully-kill issue (#968) · 130a2132
  QuanluZhang authored Apr 11, 2019
  
  130a2132
- Pai training service uses job queue for submitting jobs (#973) · 69b2e9aa
  chicm-ms authored Apr 11, 2019
```
* Use job queue for PAI training service
```
  69b2e9aa
- [PAI training service] show more error message when submitting job failed (#974) · 58b259a5
  fishyds authored Apr 11, 2019
```
* Show more error msg when submitting PAI job failed
```
  58b259a5
02 Apr, 2019 1 commit
- Add version check document in PAI, remote, kubeflow and frameworkcontroller (#947) · 29a23335
  SparkSnail authored Apr 02, 2019
  
  29a23335
01 Apr, 2019 1 commit
- Refactor local gpu scheduler (#943) · f05e685f
  SparkSnail authored Apr 01, 2019
  
  f05e685f
27 Mar, 2019 1 commit
- Support showing version check error message in WebUI (#922) · 21b48d29
  SparkSnail authored Mar 27, 2019
  
  21b48d29
26 Mar, 2019 1 commit
- Fix localTrainingService stream (#885) · bd346816
  SparkSnail authored Mar 26, 2019
  
  bd346816
25 Mar, 2019 1 commit
- Optimize query job performance (#898) · 8fd18a5a
  chicm-ms authored Mar 25, 2019
```
* Optimize job query performance
```
  8fd18a5a
22 Mar, 2019 2 commits

Route tuner and assessor commands to 2 seperate queues (#891) · 63697ec5

chicm-ms authored Mar 22, 2019

1. Route tuner and assessor commands to 2 seperate queues  issue #841
2. Allow tuner to leverage intermediate result when trial is early stopped.  issue #843

63697ec5

Support remoteLoggingType (#901) · c297650a

SparkSnail authored Mar 22, 2019

If user set remoteloggingType in config file, log content will not be transmitted from trialkeeper

c297650a

21 Mar, 2019 1 commit
- Wrong spell (#893) · a65532ca
  horizon365 authored Mar 21, 2019
```
Wrong spell in INFO log.
```
  a65532ca
20 Mar, 2019 2 commits
- Support setting dispatcher log level (#820) · ba8f3af3
  SparkSnail authored Mar 20, 2019
  
  ba8f3af3
- Support setting shmMB in pai config (#847) · 01d609a2
  SparkSnail authored Mar 20, 2019
  
  01d609a2
15 Mar, 2019 1 commit

Support version check of nni (#807) · d0b22fc7

SparkSnail authored Mar 15, 2019

check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService
add a debug mode in config file

d0b22fc7

14 Mar, 2019 1 commit

Fix ssh connection error (#829) · de9e2842

SparkSnail authored Mar 14, 2019

SSH client has a max number of open channels for a connection, if we set the number of trialCurrency too big, our ssh client will exec command using ssh frequently, then we will meet the error of Error: (SSH) Channel open failure: open failed.
Refactor the code, set one connection has a max trial concurrency, when the number of trial reach the ssh connection restriction, will create a new ssh connection to exec trial commands.

de9e2842

13 Mar, 2019 1 commit

Fix inconsistent time format in nnimanager and dispatcher (#819) · 7d91796c

SparkSnail authored Mar 13, 2019

* fix remote bug

* add document

* add document

* fix remote issue

* fix forEach

* update doc according to comments

* remove 'any more'

* set nniManager.log and dispatcher.log time format to local time

7d91796c

25 Feb, 2019 5 commits

Local TrainingService UT (#772) · 51fbf695
SparkSnail authored Feb 25, 2019

51fbf695

Support webhdfs path in python hdfs client (#722) · 8c4c0ef2

SparkSnail authored Feb 25, 2019

trial_keeper use 50070 port to connect to webhdfs server, and PAI use a mapping method to map 50070 port to 5070 port to visit restful server, this method has some risk for PAI may not support this kind of mapping in later release.Now use Pylon path(/webhdfs/api/v1) instead of 50070 port in webhdfs client of trial_keeper, the path is transmitted in trainingService.
In this pr, we have these changes:

1. Change to use webhdfs path instead of 50070 port in hdfs client.
2. Change to use new hdfs package "PythonWebHDFS", which is build to support pylon by myself. You could test the new function from "sparksnail/nni:dev-pai" image to test pai trainingService.
3. Update some variables' name according to comments.

8c4c0ef2

Support remote trialkeeper_log (#763) · b8e31971
SparkSnail authored Feb 25, 2019
```
* add trialkeeper_stdout and trialkeeper_stderr
* fix nnictl set remote nniManagerIP
```
b8e31971
Fix a race condition bug that does not store Trial Job cancel status correctly (#707) · 9a3a75c8
fishyds authored Feb 25, 2019
```
* Fix a race condition bug that does not store Trial Job cancel status correctly
```
9a3a75c8
Unit test for nnimanager (#770) · 982b30b5
demianzhang authored Feb 25, 2019
```
Unit test for nnimanager
```
982b30b5

30 Jan, 2019 1 commit
- Fix db init log and log format issue (#686) · cc95fef8
  chicm-ms authored Jan 30, 2019
  
  cc95fef8
29 Jan, 2019 1 commit

Migrate remote log (#655) · 9d3d926b

SparkSnail authored Jan 29, 2019

* fix remote bug

* add document

* add document

* update

* update

* update

* update

* fix remote issue

* fix forEach

* update doc according to comments

* update

* update

* update

* remove 'any more'

* add base version for remote-log

* change launcher.py

* test

* basic version

* debug

* debug

* basic work version

* fix code

* update disable_log

* remove unused line

* add diable log in kubernetesTrainingService

* add detect frameworkcontroller

* fix comment

* update

* update

* fix kubernetesData

* debug

* debug

* debug

* fix comment

* fix conflict

* remove local temp files

* revert launcher.py

* update code by comments

* remove disableLog

* remove disable Log

* set timeout for cleanup

* fix code by comments

* update variable names

* add comments

* add delay function

* update

* update

* update by comments

* add  in remote script path

* rename variables

* update variable name

* add mkdir -p for subfolder

9d3d926b