1. 30 May, 2019 1 commit
  2. 28 May, 2019 1 commit
  3. 27 May, 2019 2 commits
    • QuanluZhang's avatar
      import finished trial data to tuner when experiment is resumed (#1107) · f5d7e664
      QuanluZhang authored
      experiment resume
      f5d7e664
    • demianzhang's avatar
      NNI on Windows for NNI Remote mode (#1073) · a1f92666
      demianzhang authored
      * test python
      
      * test python36
      
      * debug python
      
      * debug python
      
      * debug
      
      * python version
      
      * test python
      
      * debug
      
      * install nni
      
      * install nni
      
      * test powershell
      
      * debug python
      
      * test
      
      * test python
      
      * use python
      
      * test python
      
      * test python
      
      * test
      
      * update
      
      * test powershell
      
      * debug python
      
      * debug python
      
      * debug python
      
      * debug powershell
      
      * debug
      
      * debug
      
      * debug install.ps1
      
      * add continueOnError: true
      
      * debug
      
      * debug
      
      * update
      
      * update
      
      * add unittest
      
      * test node
      
      * update
      
      * update joi
      
      * debug joi
      
      * add joi
      
      * debug joi
      
      * Update install
      
      * update
      
      * update
      
      * add unittest
      
      * add convert command
      
      * add example
      
      * fix windows commands
      
      * debug
      
      * fix tensorflow version
      
      * fix pipeline
      
      * update
      
      * add gpu logic in windows
      
      * update
      
      * update
      
      * debug
      
      * fix commands
      
      * fix commands
      
      * update
      
      * update
      
      * Fix comments
      
      * update
      
      * fix kill command
      
      * fix package.json
      
      * Update package.json
      
      * Refactor runScript
      
      * Fix bug
      
      * Fix comments
      
      * Fix execKill
      
      * Update
      
      * Update
      
      * Add unittest back
      
      * Rollback install node
      
      * Fix gpu memory
      
      * Update
      
      * Rollback check process
      
      * Update mnist-hyperband.test.yml
      
      * Update pipelines-it-local-windows.yml
      
      * Update uninstall.ps1
      
      * Fix virtual environment
      
      * Fix tar
      
      * Fix isAlive
      
      * change gpu index logic
      
      * test gpu index
      
      * fix pipeline
      
      * add cifar10
      
      * fix cifar10
      
      * remove gpu in cifar10
      
      * test mnist gpu
      
      * update
      
      * debug
      
      * Fix comments
      
      * debug
      
      * Update install.ps1
      
      * debug
      
      * update gpu metrics shell
      
      * debug
      
      * debug
      
      * debug
      
      * debug
      
      * debug
      
      * debug sigbreak
      
      * Preinstall node-pre-gyp
      
      * Update Installation.md
      
      * Update Installation.md
      
      * Remove install node-pre-gyp
      
      * use taskkill to stop node process
      
      * use ctl+c event to stop process
      
      * add sigtrem signal in stop logic
      
      * add ctl+break command
      
      * Update isAlive
      
      * debug sigterm
      
      * Update pypi readme
      
      * Update
      
      * fix stop logic
      
      * fix pipeline, add cifar10
      
      * revert mnist, remove gpu
      
      * Fix virtualenv
      
      * Fix comments
      
      * Update
      
      * Update
      
      * Fix install
      
      * Update install.ps1
      
      * Update install.ps1
      
      * Fix comments
      
      * Fix virtualenv install
      
      * Update
      
      * Update
      
      * Fix comments
      
      * Update
      
      * Update install.ps1
      
      * Update
      
      * Update localTrainingService.ts
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update util.ts
      
      * Update utils.ts
      
      * Fix system slash
      
      * Update tmp dir
      
      * Fix system slash
      
      * Use python3 in remote
      
      * Write tar command to file
      
      * Update tar
      
      * Update
      
      * Update
      
      * Fix stop
      
      * Update StopSignal type
      
      * Add removeTrialJobMetricListener
      
      * remove Listeners
      
      * Update listener
      
      * Update
      
      * Use Temp dir
      
      * Use Temp dir
      
      * Add remote windows pipeline
      
      * Update pipelines-it-remote-windows.yml
      
      * Update
      
      * remote build wheel
      
      * Update pipelines-it-remote-windows.yml
      
      * debug
      
      * debug
      
      * Use docker source install
      
      * Update
      
      * Update
      
      * Rollback remote build wheel
      
      * Use self node and yarn
      
      * Fix docker source install
      
      * Rollback Makefile
      
      * Upgrade docker pip
      
      * Update
      
      * Update
      
      * Remote build wheel
      
      * Use inline runOptions
      
      * Hide wget output
      
      * Add continueOnError
      
      * Update
      
      * Update
      
      * Update
      
      * Upgrade pip
      
      * Add chmod
      
      * Update
      
      * debug
      
      * Update
      
      * Use pscp
      
      * Update
      
      * Download putty
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * Update
      
      * debug
      
      * exclude metis
      
      * Refactor pathJoin
      
      * Update
      
      * debug metis
      
      * debug metis
      
      * Update
      
      * Update dependency
      
      * Fix comments
      
      * Update
      
      * Fix tslint
      
      * Fix comments
      
      * Fix comments
      
      * add doc
      
      * Fix comments
      
      * Update
      
      * Update doc
      a1f92666
  4. 23 May, 2019 2 commits
  5. 22 May, 2019 1 commit
  6. 16 May, 2019 1 commit
  7. 15 May, 2019 2 commits
  8. 14 May, 2019 1 commit
  9. 25 Apr, 2019 1 commit
  10. 22 Apr, 2019 2 commits
  11. 19 Apr, 2019 2 commits
  12. 18 Apr, 2019 1 commit
  13. 17 Apr, 2019 1 commit
  14. 12 Apr, 2019 1 commit
  15. 11 Apr, 2019 3 commits
  16. 02 Apr, 2019 1 commit
  17. 01 Apr, 2019 1 commit
  18. 27 Mar, 2019 1 commit
  19. 26 Mar, 2019 1 commit
  20. 25 Mar, 2019 1 commit
  21. 22 Mar, 2019 2 commits
  22. 21 Mar, 2019 1 commit
  23. 20 Mar, 2019 2 commits
  24. 15 Mar, 2019 1 commit
    • SparkSnail's avatar
      Support version check of nni (#807) · d0b22fc7
      SparkSnail authored
      check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService
      add a debug mode in config file
      d0b22fc7
  25. 14 Mar, 2019 1 commit
    • SparkSnail's avatar
      Fix ssh connection error (#829) · de9e2842
      SparkSnail authored
      SSH client has a max number of open channels for a connection, if we set the number of trialCurrency too big, our ssh client will exec command using ssh frequently, then we will meet the error of Error: (SSH) Channel open failure: open failed.
      Refactor the code, set one connection has a max trial concurrency, when the number of trial reach the ssh connection restriction, will create a new ssh connection to exec trial commands.
      de9e2842
  26. 13 Mar, 2019 1 commit
  27. 25 Feb, 2019 5 commits
    • SparkSnail's avatar
      Local TrainingService UT (#772) · 51fbf695
      SparkSnail authored
      51fbf695
    • SparkSnail's avatar
      Support webhdfs path in python hdfs client (#722) · 8c4c0ef2
      SparkSnail authored
      trial_keeper use 50070 port to connect to webhdfs server, and PAI use a mapping method to map 50070 port to 5070 port to visit restful server, this method has some risk for PAI may not support this kind of mapping in later release.Now use Pylon path(/webhdfs/api/v1) instead of 50070 port in webhdfs client of trial_keeper, the path is transmitted in trainingService.
      In this pr, we have these changes:
      
      1. Change to use webhdfs path instead of 50070 port in hdfs client.
      2. Change to use new hdfs package "PythonWebHDFS", which is build to support pylon by myself. You could test the new function from "sparksnail/nni:dev-pai" image to test pai trainingService.
      3. Update some variables' name according to comments.
      8c4c0ef2
    • SparkSnail's avatar
      Support remote trialkeeper_log (#763) · b8e31971
      SparkSnail authored
      * add trialkeeper_stdout and trialkeeper_stderr
      * fix nnictl set remote nniManagerIP
      b8e31971
    • fishyds's avatar
      Fix a race condition bug that does not store Trial Job cancel status correctly (#707) · 9a3a75c8
      fishyds authored
      * Fix a race condition bug that does not store Trial Job cancel status correctly
      9a3a75c8
    • demianzhang's avatar
      Unit test for nnimanager (#770) · 982b30b5
      demianzhang authored
      Unit test for nnimanager
      982b30b5