- 27 May, 2019 2 commits
-
-
QuanluZhang authored
experiment resume
-
demianzhang authored
* test python * test python36 * debug python * debug python * debug * python version * test python * debug * install nni * install nni * test powershell * debug python * test * test python * use python * test python * test python * test * update * test powershell * debug python * debug python * debug python * debug powershell * debug * debug * debug install.ps1 * add continueOnError: true * debug * debug * update * update * add unittest * test node * update * update joi * debug joi * add joi * debug joi * Update install * update * update * add unittest * add convert command * add example * fix windows commands * debug * fix tensorflow version * fix pipeline * update * add gpu logic in windows * update * update * debug * fix commands * fix commands * update * update * Fix comments * update * fix kill command * fix package.json * Update package.json * Refactor runScript * Fix bug * Fix comments * Fix execKill * Update * Update * Add unittest back * Rollback install node * Fix gpu memory * Update * Rollback check process * Update mnist-hyperband.test.yml * Update pipelines-it-local-windows.yml * Update uninstall.ps1 * Fix virtual environment * Fix tar * Fix isAlive * change gpu index logic * test gpu index * fix pipeline * add cifar10 * fix cifar10 * remove gpu in cifar10 * test mnist gpu * update * debug * Fix comments * debug * Update install.ps1 * debug * update gpu metrics shell * debug * debug * debug * debug * debug * debug sigbreak * Preinstall node-pre-gyp * Update Installation.md * Update Installation.md * Remove install node-pre-gyp * use taskkill to stop node process * use ctl+c event to stop process * add sigtrem signal in stop logic * add ctl+break command * Update isAlive * debug sigterm * Update pypi readme * Update * fix stop logic * fix pipeline, add cifar10 * revert mnist, remove gpu * Fix virtualenv * Fix comments * Update * Update * Fix install * Update install.ps1 * Update install.ps1 * Fix comments * Fix virtualenv install * Update * Update * Fix comments * Update * Update install.ps1 * Update * Update localTrainingService.ts * Update * Update * Update * Update * Update * Update util.ts * Update utils.ts * Fix system slash * Update tmp dir * Fix system slash * Use python3 in remote * Write tar command to file * Update tar * Update * Update * Fix stop * Update StopSignal type * Add removeTrialJobMetricListener * remove Listeners * Update listener * Update * Use Temp dir * Use Temp dir * Add remote windows pipeline * Update pipelines-it-remote-windows.yml * Update * remote build wheel * Update pipelines-it-remote-windows.yml * debug * debug * Use docker source install * Update * Update * Rollback remote build wheel * Use self node and yarn * Fix docker source install * Rollback Makefile * Upgrade docker pip * Update * Update * Remote build wheel * Use inline runOptions * Hide wget output * Add continueOnError * Update * Update * Update * Upgrade pip * Add chmod * Update * debug * Update * Use pscp * Update * Download putty * Update * Update * Update * Update * Update * Update * Update * Update * Update * debug * exclude metis * Refactor pathJoin * Update * debug metis * debug metis * Update * Update dependency * Fix comments * Update * Fix tslint * Fix comments * Fix comments * add doc * Fix comments * Update * Update doc
-
- 23 May, 2019 2 commits
-
-
SparkSnail authored
-
SparkSnail authored
-
- 22 May, 2019 1 commit
-
-
chicm-ms authored
-
- 16 May, 2019 1 commit
-
-
SparkSnail authored
-
- 15 May, 2019 2 commits
-
-
QuanluZhang authored
-
chicm-ms authored
-
- 14 May, 2019 1 commit
-
-
demianzhang authored
-
- 25 Apr, 2019 1 commit
-
-
chicm-ms authored
-
- 22 Apr, 2019 2 commits
-
-
Zejun Lin authored
* fs's API changed * fix version * roll back utils
-
demianzhang authored
-
- 19 Apr, 2019 2 commits
- 18 Apr, 2019 1 commit
-
-
chicm-ms authored
* Refactoring local training service * Designated GPU for local training service * RemoteMachine designated GPU configuration
-
- 17 Apr, 2019 1 commit
-
-
SparkSnail authored
-
- 12 Apr, 2019 1 commit
-
-
Shufan Huang authored
add BOHB Advisor
-
- 11 Apr, 2019 3 commits
-
-
QuanluZhang authored
-
chicm-ms authored
* Use job queue for PAI training service
-
fishyds authored
* Show more error msg when submitting PAI job failed
-
- 02 Apr, 2019 1 commit
-
-
SparkSnail authored
-
- 01 Apr, 2019 1 commit
-
-
SparkSnail authored
-
- 27 Mar, 2019 1 commit
-
-
SparkSnail authored
-
- 26 Mar, 2019 1 commit
-
-
SparkSnail authored
-
- 25 Mar, 2019 1 commit
-
-
chicm-ms authored
* Optimize job query performance
-
- 22 Mar, 2019 2 commits
-
-
chicm-ms authored
1. Route tuner and assessor commands to 2 seperate queues issue #841 2. Allow tuner to leverage intermediate result when trial is early stopped. issue #843
-
SparkSnail authored
If user set remoteloggingType in config file, log content will not be transmitted from trialkeeper
-
- 21 Mar, 2019 1 commit
-
-
horizon365 authored
Wrong spell in INFO log.
-
- 20 Mar, 2019 2 commits
-
-
SparkSnail authored
-
SparkSnail authored
-
- 15 Mar, 2019 1 commit
-
-
SparkSnail authored
check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService add a debug mode in config file
-
- 14 Mar, 2019 1 commit
-
-
SparkSnail authored
SSH client has a max number of open channels for a connection, if we set the number of trialCurrency too big, our ssh client will exec command using ssh frequently, then we will meet the error of Error: (SSH) Channel open failure: open failed. Refactor the code, set one connection has a max trial concurrency, when the number of trial reach the ssh connection restriction, will create a new ssh connection to exec trial commands.
-
- 13 Mar, 2019 1 commit
-
-
SparkSnail authored
* fix remote bug * add document * add document * fix remote issue * fix forEach * update doc according to comments * remove 'any more' * set nniManager.log and dispatcher.log time format to local time
-
- 25 Feb, 2019 5 commits
-
-
SparkSnail authored
-
SparkSnail authored
trial_keeper use 50070 port to connect to webhdfs server, and PAI use a mapping method to map 50070 port to 5070 port to visit restful server, this method has some risk for PAI may not support this kind of mapping in later release.Now use Pylon path(/webhdfs/api/v1) instead of 50070 port in webhdfs client of trial_keeper, the path is transmitted in trainingService. In this pr, we have these changes: 1. Change to use webhdfs path instead of 50070 port in hdfs client. 2. Change to use new hdfs package "PythonWebHDFS", which is build to support pylon by myself. You could test the new function from "sparksnail/nni:dev-pai" image to test pai trainingService. 3. Update some variables' name according to comments.
-
SparkSnail authored
* add trialkeeper_stdout and trialkeeper_stderr * fix nnictl set remote nniManagerIP
-
fishyds authored
* Fix a race condition bug that does not store Trial Job cancel status correctly
-
demianzhang authored
Unit test for nnimanager
-
- 30 Jan, 2019 1 commit
-
-
chicm-ms authored
-
- 29 Jan, 2019 1 commit
-
-
SparkSnail authored
* fix remote bug * add document * add document * update * update * update * update * fix remote issue * fix forEach * update doc according to comments * update * update * update * remove 'any more' * add base version for remote-log * change launcher.py * test * basic version * debug * debug * basic work version * fix code * update disable_log * remove unused line * add diable log in kubernetesTrainingService * add detect frameworkcontroller * fix comment * update * update * fix kubernetesData * debug * debug * debug * fix comment * fix conflict * remove local temp files * revert launcher.py * update code by comments * remove disableLog * remove disable Log * set timeout for cleanup * fix code by comments * update variable names * add comments * add delay function * update * update * update by comments * add in remote script path * rename variables * update variable name * add mkdir -p for subfolder
-