1. 19 Apr, 2019 1 commit
  2. 18 Apr, 2019 1 commit
  3. 17 Apr, 2019 1 commit
  4. 11 Apr, 2019 2 commits
  5. 02 Apr, 2019 1 commit
  6. 01 Apr, 2019 1 commit
  7. 27 Mar, 2019 1 commit
  8. 26 Mar, 2019 1 commit
  9. 22 Mar, 2019 1 commit
  10. 20 Mar, 2019 1 commit
  11. 15 Mar, 2019 1 commit
    • SparkSnail's avatar
      Support version check of nni (#807) · d0b22fc7
      SparkSnail authored
      check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService
      add a debug mode in config file
      d0b22fc7
  12. 14 Mar, 2019 1 commit
    • SparkSnail's avatar
      Fix ssh connection error (#829) · de9e2842
      SparkSnail authored
      SSH client has a max number of open channels for a connection, if we set the number of trialCurrency too big, our ssh client will exec command using ssh frequently, then we will meet the error of Error: (SSH) Channel open failure: open failed.
      Refactor the code, set one connection has a max trial concurrency, when the number of trial reach the ssh connection restriction, will create a new ssh connection to exec trial commands.
      de9e2842
  13. 25 Feb, 2019 4 commits
    • SparkSnail's avatar
      Local TrainingService UT (#772) · 51fbf695
      SparkSnail authored
      51fbf695
    • SparkSnail's avatar
      Support webhdfs path in python hdfs client (#722) · 8c4c0ef2
      SparkSnail authored
      trial_keeper use 50070 port to connect to webhdfs server, and PAI use a mapping method to map 50070 port to 5070 port to visit restful server, this method has some risk for PAI may not support this kind of mapping in later release.Now use Pylon path(/webhdfs/api/v1) instead of 50070 port in webhdfs client of trial_keeper, the path is transmitted in trainingService.
      In this pr, we have these changes:
      
      1. Change to use webhdfs path instead of 50070 port in hdfs client.
      2. Change to use new hdfs package "PythonWebHDFS", which is build to support pylon by myself. You could test the new function from "sparksnail/nni:dev-pai" image to test pai trainingService.
      3. Update some variables' name according to comments.
      8c4c0ef2
    • SparkSnail's avatar
      Support remote trialkeeper_log (#763) · b8e31971
      SparkSnail authored
      * add trialkeeper_stdout and trialkeeper_stderr
      * fix nnictl set remote nniManagerIP
      b8e31971
    • fishyds's avatar
      Fix a race condition bug that does not store Trial Job cancel status correctly (#707) · 9a3a75c8
      fishyds authored
      * Fix a race condition bug that does not store Trial Job cancel status correctly
      9a3a75c8
  14. 29 Jan, 2019 1 commit
    • SparkSnail's avatar
      Migrate remote log (#655) · 9d3d926b
      SparkSnail authored
      * fix remote bug
      
      * add document
      
      * add document
      
      * update
      
      * update
      
      * update
      
      * update
      
      * fix remote issue
      
      * fix forEach
      
      * update doc according to comments
      
      * update
      
      * update
      
      * update
      
      * remove 'any more'
      
      * add base version for remote-log
      
      * change launcher.py
      
      * test
      
      * basic version
      
      * debug
      
      * debug
      
      * basic work version
      
      * fix code
      
      * update disable_log
      
      * remove unused line
      
      * add diable log in kubernetesTrainingService
      
      * add detect frameworkcontroller
      
      * fix comment
      
      * update
      
      * update
      
      * fix kubernetesData
      
      * debug
      
      * debug
      
      * debug
      
      * fix comment
      
      * fix conflict
      
      * remove local temp files
      
      * revert launcher.py
      
      * update code by comments
      
      * remove disableLog
      
      * remove disable Log
      
      * set timeout for cleanup
      
      * fix code by comments
      
      * update variable names
      
      * add comments
      
      * add delay function
      
      * update
      
      * update
      
      * update by comments
      
      * add  in remote script path
      
      * rename variables
      
      * update variable name
      
      * add mkdir -p for subfolder
      9d3d926b
  15. 25 Jan, 2019 2 commits
    • fishyds's avatar
      [PAI bug fixing] Fix the incorrect PAI webhdfs endpoint path (#653) · 6bc12de0
      fishyds authored
      
      * Fix PAI webhdfs api endpoint
      6bc12de0
    • chicm-ms's avatar
      Refactoring nnimanager log (#652) · 6d591989
      chicm-ms authored
      * Pull code (#22)
      
      * Support distributed job for frameworkcontroller (#612)
      
      support distributed job for frameworkcontroller
      
      * Multiphase doc (#519)
      
      * multiPhase doc
      
      * updates
      
      * updates
      
      * Add time parser for 'nnictl update duration' (#632)
      
      Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d}
      
      * fix experiment state bug (#629)
      
      * update top README.md (#622)
      
      * Update README.md
      
      * update (#634)
      
      * Integration tests refactoring (#625)
      
      * Integration test refactoring (#21) (#616)
      
      * Integration test refactoring (#21)
      
      * Refactoring integration tests
      
      * test metrics
      
      * update azure pipeline
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * update trigger
      
      * Integration test refactoring (#618)
      
      * updates
      
      * updates
      
      * update pipeline (#619)
      
      * update pipeline
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * test pipeline (#623)
      
      * test pipeline
      
      * updates
      
      * updates
      
      * updates
      
      * Update integration test (#624)
      
      * Update integration test
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * Revert "Pull code (#22)"
      
      This reverts commit 62fc165ad7b2ba724eead3b99f010aa34491e2c7.
      
      * Update nnimanager logs
      
      * updates
      
      * Update README.md
      
      * Revert "Update README.md"
      
      This reverts commit bc67061160e5d57305a6e7fb63d491d12d0e9002.
      
      * updates
      
      * updates
      6d591989
  16. 22 Jan, 2019 1 commit
  17. 16 Jan, 2019 1 commit
  18. 04 Jan, 2019 1 commit
  19. 29 Dec, 2018 1 commit
  20. 25 Dec, 2018 1 commit
    • SparkSnail's avatar
      support frameworkcontroller training service (#484) · 36dbc0fe
      SparkSnail authored
      Add frameworkcontroller training service based on kubeflow training service.
      Refactor code structure, add kubernetes training service as father class, and set kubeflow training service and frameworkcontroller training service as child class.
      36dbc0fe
  21. 21 Dec, 2018 1 commit
  22. 20 Dec, 2018 1 commit
    • fishyds's avatar
      [V0.4.1 Release] Merge v0.4.1 branch back to Master (#509) · ff834cea
      fishyds authored
      * Update nnictl.py
      
      Fix the issue that nnictl --version via pip installation doesn't work
      
      * Update kubeflow training service document (#494)
      
      * Remove kubectl related document, add messages for kubeconfig
      * Add design section for kubeflow training service
      * Move the image files for PAI training service doc into img folder.
      
      * Update KubeflowMode.md (#498)
      
      Update KubeflowMode.md, small terms change
      
      * [V0.4.1 bug fix] Cannot run kubeflow training service due to trial_keeper change (#503)
      
      * Update kubeflow training service document
      
      * fix bug a that kubeflow trial job cannot run
      
      * upgrade version number (#499)
      
      * [V0.4.1 bug fix] Support read K8S config from KUBECONFIG environment variable (#507)
      
      * Add KUBCONFIG env variable support
      
      * In main.ts, throw cached error to make sure nnictl can show the error in stderr
      ff834cea
  23. 19 Dec, 2018 1 commit
  24. 17 Dec, 2018 1 commit
  25. 14 Dec, 2018 1 commit
  26. 13 Dec, 2018 1 commit
  27. 10 Dec, 2018 1 commit
  28. 07 Dec, 2018 2 commits
  29. 05 Dec, 2018 2 commits
    • Yan Ni's avatar
      backward compatibility for mac: job end timestamp (#451) · b256dfc6
      Yan Ni authored
      * add pycharm project files to .gitignore list
      
      * update pylintrc to conform vscode settings
      
      * fix RemoteMachineMode for wrong trainingServicePlatform
      
      * add python cache files to gitignore list
      
      * move extract scalar reward logic from dispatcher to tuner
      
      * update tuner code corresponding to last commit
      
      * update doc for receive_trial_result api change
      
      * add numpy to package whitelist of pylint
      
      * distinguish param value from return reward for tuner.extract_scalar_reward
      
      * update pylintrc
      
      * add comments to dispatcher.handle_report_metric_data
      
      * update install for mac support
      
      * fix root mode bug on Makefile
      
      * Quick fix bug: nnictl port value error (#245)
      
      * fix port bug
      
      * Dev exp stop more (#221)
      
      * Exp stop refactor (#161)
      
      * Update RemoteMachineMode.md (#63)
      
      * Remove unused classes for SQuAD QA example.
      
      * Remove more unused functions for SQuAD QA example.
      
      * Fix default dataset config.
      
      * Add Makefile README (#64)
      
      * update document (#92)
      
      * Edit readme.md
      
      * updated a word
      
      * Update GetStarted.md
      
      * Update GetStarted.md
      
      * refact readme, getstarted and write your trial md.
      
      * Update README.md
      
      * Update WriteYourTrial.md
      
      * Update WriteYourTrial.md
      
      * Update WriteYourTrial.md
      
      * Update WriteYourTrial.md
      
      * Fix nnictl bugs and add new feature (#75)
      
      * fix nnictl bug
      
      * fix nnictl create bug
      
      * add experiment status logic
      
      * add more information for nnictl
      
      * fix Evolution Tuner bug
      
      * refactor code
      
      * fix code in updater.py
      
      * fix nnictl --help
      
      * fix classArgs bug
      
      * update check response.status_code logic
      
      * remove Buffer warning (#100)
      
      * update readme in ga_squad
      
      * update readme
      
      * fix typo
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * Add support for debugging mode
      
      * fix setup.py (#115)
      
      * Add DAG model configuration format for SQuAD example.
      
      * Explain config format for SQuAD QA model.
      
      * Add more detailed introduction about the evolution algorithm.
      
      * Fix install.sh add add trial log path (#109)
      
      * fix nnictl bug
      
      * fix nnictl create bug
      
      * add experiment status logic
      
      * add more information for nnictl
      
      * fix Evolution Tuner bug
      
      * refactor code
      
      * fix code in updater.py
      
      * fix nnictl --help
      
      * fix classArgs bug
      
      * update check response.status_code logic
      
      * show trial log path
      
      * update document
      
      * fix install.sh
      
      * set default vallue for maxTrialNum and maxExecDuration
      
      * fix nnictl
      
      * Dev smac (#116)
      
      * support package install (#91)
      
      * fix nnictl bug
      
      * support package install
      
      * update
      
      * update package install logic
      
      * Fix package install issue (#95)
      
      * fix nnictl bug
      
      * fix pakcage install
      
      * support SMAC as a tuner on nni (#81)
      
      * update doc
      
      * update doc
      
      * update doc
      
      * update hyperopt installation
      
      * update doc
      
      * update doc
      
      * update description in setup.py
      
      * update setup.py
      
      * modify encoding
      
      * encoding
      
      * add encoding
      
      * remove pymc3
      
      * update doc
      
      * update builtin tuner spec
      
      * support smac in sdk, fix logging issue
      
      * support smac tuner
      
      * add optimize_mode
      
      * update config in nnictl
      
      * add __init__.py
      
      * update smac
      
      * update import path
      
      * update setup.py: remove entry_point
      
      * update rest server validation
      
      * fix bug in nnictl launcher
      
      * support classArgs: optimize_mode
      
      * quick fix bug
      
      * test travis
      
      * add dependency
      
      * add dependency
      
      * add dependency
      
      * add dependency
      
      * create smac python package
      
      * fix trivial points
      
      * optimize import of tuners, modify nnictl accordingly
      
      * fix bug: incorrect algorithm_name
      
      * trivial refactor
      
      * for debug
      
      * support virtual
      
      * update doc of SMAC
      
      * update smac requirements
      
      * update requirements
      
      * change debug mode
      
      * update doc
      
      * update doc
      
      * refactor based on comments
      
      * fix comments
      
      * modify example config path to relative path and increase maxTrialNum (#94)
      
      * modify example config path to relative path and increase maxTrialNum
      
      * add document
      
      * support conda (#90) (#110)
      
      * support install from venv and travis CI
      
      * support install from venv and travis CI
      
      * support install from venv and travis CI
      
      * support conda
      
      * support conda
      
      * modify example config path to relative path and increase maxTrialNum
      
      * undo messy commit
      
      * undo messy commit
      
      * Support pip install as root (#77)
      
      * Typo on #58 (#122)
      
      * PAI Training Service implementation (#128)
      
      * PAI Training service implementation
      **1. Implement PAITrainingService
      **2. Add trial-keeper python module, and modify setup.py to install the module
      **3. Add PAItrainingService rest server to collect metrics from PAI container.
      
      * fix datastore for multiple final result (#129)
      
      * Update NNI v0.2 release notes (#132)
      
      Update NNI v0.2 release notes
      
      * Update setup.py Makefile and documents (#130)
      
      * update makefile and setup.py
      
      * update makefile and setup.py
      
      * update document
      
      * update document
      
      * Update Makefile no travis
      
      *  update doc
      
      *  update doc
      
      * fix convert from ss to pcs (#133)
      
      * Fix bugs about webui (#131)
      
      * Fix webui bugs
      
      * Fix tslint
      
      * webui logpath and document (#135)
      
      * Add webui document and logpath as a href
      
      * fix tslint
      
      * fix comments by Chengmin
      
      * Pai training service bug fix and enhancement (#136)
      
      * Add NNI installation scripts
      
      * Update pai script, update NNI_out_dir
      
      * Update NNI dir in nni sdk local.py
      
      * Create .nni folder in nni sdk local.py
      
      * Add check before creating .nni folder
      
      * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT
      
      * Improve annotation (#138)
      
      * Improve annotation
      
      * Minor bugfix
      
      * Selectively install through pip (#139)
      
      Selectively install through pip 
      * update setup.py
      
      * fix paiTrainingService bugs (#137)
      
      * fix nnictl bug
      
      * add hdfs host validation
      
      * fix bugs
      
      * fix dockerfile
      
      * fix install.sh
      
      * update install.sh
      
      * fix dockerfile
      
      * Set timeout for HDFSUtility exists function
      
      * remove unused TODO
      
      * fix sdk
      
      * add optional for outputDir and dataDir
      
      * refactor dockerfile.base
      
      * Remove unused import in hdfsclientUtility
      
      * Add documentation for NNI PAI mode experiment (#141)
      
      * Add documentation for NNI PAI mode
      
      * Fix typo based on PR comments
      
      * Exit with subprocess return code of trial keeper
      
      * Remove additional exit code
      
      * Fix typo based on PR comments
      
      * update doc for smac tuner (#140)
      
      * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)
      
      * Revert "Selectively install through pip (#139)"
      
      This reverts commit 1d174836.
      
      * Add exit code of subprocess for trial_keeper
      
      * Update README, add link to PAImode doc
      
      * Merge branch V0.2 to Master (#143)
      
      * webui logpath and document (#135)
      
      * Add webui document and logpath as a href
      
      * fix tslint
      
      * fix comments by Chengmin
      
      * Pai training service bug fix and enhancement (#136)
      
      * Add NNI installation scripts
      
      * Update pai script, update NNI_out_dir
      
      * Update NNI dir in nni sdk local.py
      
      * Create .nni folder in nni sdk local.py
      
      * Add check before creating .nni folder
      
      * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT
      
      * Improve annotation (#138)
      
      * Improve annotation
      
      * Minor bugfix
      
      * Selectively install through pip (#139)
      
      Selectively install through pip 
      * update setup.py
      
      * fix paiTrainingService bugs (#137)
      
      * fix nnictl bug
      
      * add hdfs host validation
      
      * fix bugs
      
      * fix dockerfile
      
      * fix install.sh
      
      * update install.sh
      
      * fix dockerfile
      
      * Set timeout for HDFSUtility exists function
      
      * remove unused TODO
      
      * fix sdk
      
      * add optional for outputDir and dataDir
      
      * refactor dockerfile.base
      
      * Remove unused import in hdfsclientUtility
      
      * Add documentation for NNI PAI mode experiment (#141)
      
      * Add documentation for NNI PAI mode
      
      * Fix typo based on PR comments
      
      * Exit with subprocess return code of trial keeper
      
      * Remove additional exit code
      
      * Fix typo based on PR comments
      
      * update doc for smac tuner (#140)
      
      * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)
      
      * Revert "Selectively install through pip (#139)"
      
      This reverts commit 1d174836.
      
      * Add exit code of subprocess for trial_keeper
      
      * Update README, add link to PAImode doc
      
      * fix bug (#147)
      
      * Refactor nnictl and add config_pai.yml (#144)
      
      * fix nnictl bug
      
      * add hdfs host validation
      
      * fix bugs
      
      * fix dockerfile
      
      * fix install.sh
      
      * update install.sh
      
      * fix dockerfile
      
      * Set timeout for HDFSUtility exists function
      
      * remove unused TODO
      
      * fix sdk
      
      * add optional for outputDir and dataDir
      
      * refactor dockerfile.base
      
      * Remove unused import in hdfsclientUtility
      
      * add config_pai.yml
      
      * refactor nnictl create logic and add colorful print
      
      * fix nnictl stop logic
      
      * add annotation for config_pai.yml
      
      * add document for start experiment
      
      * fix config.yml
      
      * fix document
      
      * Fix trial keeper wrongly exit issue (#152)
      
      * Fix trial keeper bug, use actual exitcode to exit rather than 1
      
      * Fix bug of table sort (#145)
      
      * Update doc for PAIMode and v0.2 release notes (#153)
      
      * Update v0.2 documentation regards to release note and PAI training service
      
      * Update document to describe NNI docker image
      
      * fix antd (#159)
      
      * refactor experiment stopping logic
      
      * support change concurrency
      
      * remove trialJobs.ts
      
      * trivial changes
      
      * fix bugs
      
      * fix bug
      
      * support updating maxTrialNum
      
      * Modify IT scripts for supporting multiple experiments
      
      * Update ci (#175)
      
      * Update RemoteMachineMode.md (#63)
      
      * Remove unused classes for SQuAD QA example.
      
      * Remove more unused functions for SQuAD QA example.
      
      * Fix default dataset config.
      
      * Add Makefile README (#64)
      
      * update document (#92)
      
      * Edit readme.md
      
      * updated a word
      
      * Update GetStarted.md
      
      * Update GetStarted.md
      
      * refact readme, getstarted and write your trial md.
      
      * Update README.md
      
      * Update WriteYourTrial.md
      
      * Update WriteYourTrial.md
      
      * Update WriteYourTrial.md
      
      * Update WriteYourTrial.md
      
      * Fix nnictl bugs and add new feature (#75)
      
      * fix nnictl bug
      
      * fix nnictl create bug
      
      * add experiment status logic
      
      * add more information for nnictl
      
      * fix Evolution Tuner bug
      
      * refactor code
      
      * fix code in updater.py
      
      * fix nnictl --help
      
      * fix classArgs bug
      
      * update check response.status_code logic
      
      * remove Buffer warning (#100)
      
      * update readme in ga_squad
      
      * update readme
      
      * fix typo
      
      * Update README.md
      
      * Update README.md
      
      * Update README.md
      
      * Add support for debugging mode
      
      * modify CI cuz of refracting exp stop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * update CI for expstop
      
      * file saving
      
      * fix issues from code merge
      
      * remove $(INSTALL_PREFIX)/nni/nni_manager before install
      
      * fix indent
      
      * fix merge issue
      
      * socket close
      
      * update port
      
      * fix merge error
      
      * modify ci logic in nnimanager
      
      * fix ci
      
      * fix bug
      
      * change suspended to done
      
      * update ci (#229)
      
      * update ci
      
      * update ci
      
      * update ci (#232)
      
      * update ci
      
      * update ci
      
      * update azure-pipelines
      
      * update azure-pipelines
      
      * update ci (#233)
      
      * update ci
      
      * update ci
      
      * update azure-pipelines
      
      * update azure-pipelines
      
      * update azure-pipelines
      
      * run.py (#238)
      
      * Nnupdate ci (#239)
      
      * run.py
      
      * test ci
      
      * Nnupdate ci (#240)
      
      * run.py
      
      * test ci
      
      * test ci
      
      * Udci (#241)
      
      * run.py
      
      * test ci
      
      * test ci
      
      * test ci
      
      * update ci (#242)
      
      * run.py
      
      * test ci
      
      * test ci
      
      * test ci
      
      * update ci
      
      * revert install.sh (#244)
      
      * run.py
      
      * test ci
      
      * test ci
      
      * test ci
      
      * update ci
      
      * revert install.sh
      
      * add comments
      
      * remove assert
      
      * trivial change
      
      * trivial change
      
      * update Makefile (#246)
      
      * update Makefile
      
      * update Makefile
      
      * quick fix for ci (#248)
      
      * add update trialNum and fix bugs (#261)
      
      * Add builtin tuner to CI (#247)
      
      * update Makefile
      
      * update Makefile
      
      * add builtin-tuner test
      
      * add builtin-tuner test
      
      * refractor ci
      
      * update azure.yml
      
      * add built-in tuner test
      
      * fix bugs
      
      * Doc refactor (#258)
      
      * doc refactor
      
      * image name refactor
      
      * Refactor nnictl to support listing stopped experiments. (#256)
      
      Refactor nnictl to support listing stopped experiments.
      
      * Show experiment parameters more beautifully (#262)
      
      * fix error on example of RemoteMachineMode (#269)
      
      * add pycharm project files to .gitignore list
      
      * update pylintrc to conform vscode settings
      
      * fix RemoteMachineMode for wrong trainingServicePlatform
      
      * Update docker file to use latest nni release (#263)
      
      * fix bug about execDuration and endTime (#270)
      
      * fix bug about execDuration and endTime
      
      * modify time interval to 30 seconds
      
      * refactor based on Gems's suggestion
      
      * for triggering ci
      
      * Refactor dockerfile (#264)
      
      * refactor Dockerfile
      
      * Support nnictl tensorboard (#268)
      
      support tensorboard
      
      * Sdk update (#272)
      
      * Rename get_parameters to get_next_parameter
      
      * annotations add get_next_parameter
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * updates
      
      * add experiment log path to experiment profile (#276)
      
      * refactor extract reward from dict by tuner
      
      * update Makefile for mac support, wait for aka.ms support
      
      * refix Makefile for colorful echo
      
      * update Makefile with shorturl
      
      * fix false fail on mac webui
      
      * fix cross os remote tmpdir issue
      
      * add readonly to RemoteMachineTrainingService.remoteOS
      
      * fix var name for PR 386
      
      * cross platform package
      
      * update pypi/makefile for multiple platform support
      
      * update linux os spec
      
      * udpate doc for installation & pypi
      
      * update readme
      
      * job timestamp compatibility for mac
      b256dfc6
    • fishyds's avatar
      [V0.4 Release] Kubeflow training service: Remove unued kubernetesServer config entry (#444) · 311d3da6
      fishyds authored
      * Remove unused kubernetesServer config entry in config file and schema validation
      311d3da6
  30. 30 Nov, 2018 2 commits
  31. 29 Nov, 2018 1 commit
  32. 28 Nov, 2018 1 commit