"docs/zh_CN/TrainingService/PaiMode.md" did not exist on "b8b49a347c6637231ba59b006fc32170fbd6bb30"
- 07 May, 2020 1 commit
-
-
Chi Song authored
update nni manager eslint rules. 1. add rule to ignore unused args, and treat other unused as error. 2. add run to treat no explicit function return type as error. 3. fix all errors.
-
- 30 Apr, 2020 1 commit
-
-
Chi Song authored
To support Windows node in remote mode, this PR adds a layer of commands (osCommands) to deal difference between Windows and Unix-like OS. To share code, ShellExecutor is added to enrich original SshClient class. I will implement windows version commands in next phase. This pattern can be expanded to Local or other platform in future, so I moved related code to common folder for sharing.
-
- 26 Apr, 2020 1 commit
-
-
Chi Song authored
Add shell support for ssh connection, so that remote script can be started with user environment. Minor fixes, 1. Fix gpu_metrics_collector to support pyenv. As pyenv will create one more process, so that original pgrep code always got extra processes, and cannot start gpu_metrics_collector. 2. Fix NASUI failure on dev-install-node-modules, to create subfolder every time. 3. Fix MakeFile to reduce mis-created links, and other minor issues. 4. Add node --watch for nni_manager for better dev experience.
-
- 15 Jan, 2020 1 commit
-
-
chicm-ms authored
-
- 18 Dec, 2019 1 commit
-
-
chicm-ms authored
* Fix local system as remote machine issue #1852
-
- 11 Dec, 2019 1 commit
-
-
chicm-ms authored
* enable eslint * remove tslint
-
- 10 Dec, 2019 1 commit
-
-
chicm-ms authored
* update eslint rules * auto fix eslint * manually fix eslint (#1833)
-
- 25 Nov, 2019 1 commit
-
-
liuzhe-lz authored
-
- 22 Nov, 2019 2 commits
-
-
SparkSnail authored
-
SparkSnail authored
-
- 21 Nov, 2019 1 commit
-
-
liuzhe-lz authored
* fix gpu script permission issue * make gpu tool local to user
-
- 26 Sep, 2019 1 commit
-
-
liuzhe-lz authored
* Refactor web UI to support incremental metric loading * refactor * Remove host job * Move sequence ID to NNI manager * implement incremental loading
-
- 14 Aug, 2019 1 commit
-
-
Guoxin authored
* squash commits in v1.0 first round bug bash
-
- 12 Aug, 2019 1 commit
-
-
suiguoxin authored
-
- 02 Aug, 2019 1 commit
-
-
SparkSnail authored
-
- 20 Jun, 2019 1 commit
-
-
demianzhang authored
* fix local and remote training services tslint
-
- 19 Jun, 2019 1 commit
-
-
Hongarc authored
-
- 03 Jun, 2019 1 commit
-
-
SparkSnail authored
-
- 28 May, 2019 1 commit
-
-
SparkSnail authored
-
- 27 May, 2019 1 commit
-
-
demianzhang authored
* test python * test python36 * debug python * debug python * debug * python version * test python * debug * install nni * install nni * test powershell * debug python * test * test python * use python * test python * test python * test * update * test powershell * debug python * debug python * debug python * debug powershell * debug * debug * debug install.ps1 * add continueOnError: true * debug * debug * update * update * add unittest * test node * update * update joi * debug joi * add joi * debug joi * Update install * update * update * add unittest * add convert command * add example * fix windows commands * debug * fix tensorflow version * fix pipeline * update * add gpu logic in windows * update * update * debug * fix commands * fix commands * update * update * Fix comments * update * fix kill command * fix package.json * Update package.json * Refactor runScript * Fix bug * Fix comments * Fix execKill * Update * Update * Add unittest back * Rollback install node * Fix gpu memory * Update * Rollback check process * Update mnist-hyperband.test.yml * Update pipelines-it-local-windows.yml * Update uninstall.ps1 * Fix virtual environment * Fix tar * Fix isAlive * change gpu index logic * test gpu index * fix pipeline * add cifar10 * fix cifar10 * remove gpu in cifar10 * test mnist gpu * update * debug * Fix comments * debug * Update install.ps1 * debug * update gpu metrics shell * debug * debug * debug * debug * debug * debug sigbreak * Preinstall node-pre-gyp * Update Installation.md * Update Installation.md * Remove install node-pre-gyp * use taskkill to stop node process * use ctl+c event to stop process * add sigtrem signal in stop logic * add ctl+break command * Update isAlive * debug sigterm * Update pypi readme * Update * fix stop logic * fix pipeline, add cifar10 * revert mnist, remove gpu * Fix virtualenv * Fix comments * Update * Update * Fix install * Update install.ps1 * Update install.ps1 * Fix comments * Fix virtualenv install * Update * Update * Fix comments * Update * Update install.ps1 * Update * Update localTrainingService.ts * Update * Update * Update * Update * Update * Update util.ts * Update utils.ts * Fix system slash * Update tmp dir * Fix system slash * Use python3 in remote * Write tar command to file * Update tar * Update * Update * Fix stop * Update StopSignal type * Add removeTrialJobMetricListener * remove Listeners * Update listener * Update * Use Temp dir * Use Temp dir * Add remote windows pipeline * Update pipelines-it-remote-windows.yml * Update * remote build wheel * Update pipelines-it-remote-windows.yml * debug * debug * Use docker source install * Update * Update * Rollback remote build wheel * Use self node and yarn * Fix docker source install * Rollback Makefile * Upgrade docker pip * Update * Update * Remote build wheel * Use inline runOptions * Hide wget output * Add continueOnError * Update * Update * Update * Upgrade pip * Add chmod * Update * debug * Update * Use pscp * Update * Download putty * Update * Update * Update * Update * Update * Update * Update * Update * Update * debug * exclude metis * Refactor pathJoin * Update * debug metis * debug metis * Update * Update dependency * Fix comments * Update * Fix tslint * Fix comments * Fix comments * add doc * Fix comments * Update * Update doc
-
- 22 Apr, 2019 1 commit
-
-
demianzhang authored
-
- 01 Apr, 2019 1 commit
-
-
SparkSnail authored
-
- 27 Mar, 2019 1 commit
-
-
SparkSnail authored
-
- 22 Mar, 2019 1 commit
-
-
SparkSnail authored
If user set remoteloggingType in config file, log content will not be transmitted from trialkeeper
-
- 15 Mar, 2019 1 commit
-
-
SparkSnail authored
check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService add a debug mode in config file
-
- 14 Mar, 2019 1 commit
-
-
SparkSnail authored
SSH client has a max number of open channels for a connection, if we set the number of trialCurrency too big, our ssh client will exec command using ssh frequently, then we will meet the error of Error: (SSH) Channel open failure: open failed. Refactor the code, set one connection has a max trial concurrency, when the number of trial reach the ssh connection restriction, will create a new ssh connection to exec trial commands.
-
- 25 Feb, 2019 1 commit
-
-
fishyds authored
* Fix a race condition bug that does not store Trial Job cancel status correctly
-
- 29 Jan, 2019 1 commit
-
-
SparkSnail authored
* fix remote bug * add document * add document * update * update * update * update * fix remote issue * fix forEach * update doc according to comments * update * update * update * remove 'any more' * add base version for remote-log * change launcher.py * test * basic version * debug * debug * basic work version * fix code * update disable_log * remove unused line * add diable log in kubernetesTrainingService * add detect frameworkcontroller * fix comment * update * update * fix kubernetesData * debug * debug * debug * fix comment * fix conflict * remove local temp files * revert launcher.py * update code by comments * remove disableLog * remove disable Log * set timeout for cleanup * fix code by comments * update variable names * add comments * add delay function * update * update * update by comments * add in remote script path * rename variables * update variable name * add mkdir -p for subfolder
-
- 25 Jan, 2019 1 commit
-
-
chicm-ms authored
* Pull code (#22) * Support distributed job for frameworkcontroller (#612) support distributed job for frameworkcontroller * Multiphase doc (#519) * multiPhase doc * updates * updates * Add time parser for 'nnictl update duration' (#632) Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d} * fix experiment state bug (#629) * update top README.md (#622) * Update README.md * update (#634) * Integration tests refactoring (#625) * Integration test refactoring (#21) (#616) * Integration test refactoring (#21) * Refactoring integration tests * test metrics * update azure pipeline * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * update trigger * Integration test refactoring (#618) * updates * updates * update pipeline (#619) * update pipeline * updates * updates * updates * updates * updates * test pipeline (#623) * test pipeline * updates * updates * updates * Update integration test (#624) * Update integration test * updates * updates * updates * updates * updates * updates * Revert "Pull code (#22)" This reverts commit 62fc165ad7b2ba724eead3b99f010aa34491e2c7. * Update nnimanager logs * updates * Update README.md * Revert "Update README.md" This reverts commit bc67061160e5d57305a6e7fb63d491d12d0e9002. * updates * updates
-
- 04 Jan, 2019 1 commit
-
-
SparkSnail authored
trial job could not be stopped in remote machine when experiment is stopped, because awit/async does not work normally in forEach, refer https://codeburst.io/javascript-async-await-with-foreach-b6ba62bbf404.
-
- 03 Jan, 2019 1 commit
-
-
Shinai Yang (FA TALENT) authored
-
- 29 Nov, 2018 1 commit
-
-
fishyds authored
* Add codeDir file count validation for setClusterConfig * fix a small bug if find command is not installed * Remove codeDir validation for local training service * Remove useless import
-
- 27 Nov, 2018 1 commit
-
-
Yan Ni authored
* update Makefile for mac support, wait for aka.ms support * refix Makefile for colorful echo * update Makefile with shorturl * fix false fail on mac webui * fix cross os remote tmpdir issue * add readonly to RemoteMachineTrainingService.remoteOS * fix var name for PR 386
-
- 25 Nov, 2018 1 commit
-
-
QuanluZhang authored
* add one more trial job status, EARLY_STOPPED * fix datastore/nnimanager/mockeddatastore. test/webui/metrics_reader not done. USER_TO_CANCEL * fix bug * modifications based on Deshui's comments * fix bug * fix bug in remote mode
-
- 20 Nov, 2018 1 commit
-
-
fishyds authored
* Kubeflow TrainingService support, v1 (#373) 1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs 2. Update nni python SDK to support new kubeflow platform 3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file 4. This version only supports Tensorflow operator. Will add more operators' support in future versions
-
- 02 Nov, 2018 1 commit
-
-
chicm-ms authored
-
- 16 Oct, 2018 1 commit
-
-
fishyds authored
* Updated based on comments * Fix bug, make get_parameters() idompotent * Add idompotent support for get_parameters() in LocalTrainingService
-
- 12 Oct, 2018 2 commits
-
-
chicm-ms authored
* Pull latest code (#2) * webui logpath and document (#135) * Add webui document and logpath as a href * fix tslint * fix comments by Chengmin * Pai training service bug fix and enhancement (#136) * Add NNI installation scripts * Update pai script, update NNI_out_dir * Update NNI dir in nni sdk local.py * Create .nni folder in nni sdk local.py * Add check before creating .nni folder * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT * Improve annotation (#138) * Improve annotation * Minor bugfix * Selectively install through pip (#139) Selectively install through pip * update setup.py * fix paiTrainingService bugs (#137) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * Add documentation for NNI PAI mode experiment (#141) * Add documentation for NNI PAI mode * Fix typo based on PR comments * Exit with subprocess return code of trial keeper * Remove additional exit code * Fix typo based on PR comments * update doc for smac tuner (#140) * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142) * Revert "Selectively install through pip (#139)" This reverts commit 1d174836. * Add exit code of subprocess for trial_keeper * Update README, add link to PAImode doc * fix bug (#147) * Refactor nnictl and add config_pai.yml (#144) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * add config_pai.yml * refactor nnictl create logic and add colorful print * fix nnictl stop logic * add annotation for config_pai.yml * add document for start experiment * fix config.yml * fix document * Fix trial keeper wrongly exit issue (#152) * Fix trial keeper bug, use actual exitcode to exit rather than 1 * Fix bug of table sort (#145) * Update doc for PAIMode and v0.2 release notes (#153) * Update v0.2 documentation regards to release note and PAI training service * Update document to describe NNI docker image * Bug fix for SQuAD example tuner. (#134) * Update Makefile (#151) * test * update setup.py * update Makefile and install.sh * rever setup.py * change color * update doc * update doc * fix auto-completion's extra space * update Makefile * update webui * Update doc image (#163) * update doc * trivial * trivial * trivial * trivial * trivial * trivial * update image * update image size * Update ga squad (#104) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * sklearn examples (#169) * fix nnictl bug * fix install.sh * add sklearn-regression example * add sklearn classification * update sklearn * update example * remove additional code * Update batch tuner (#158) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * update batch tuner * Quickly fix cascading search space bug in tuner (#156) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * quickly fix cascading searchspace bug in tuner * Add iterative search space example (#119) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * add iterative search space example * update * update readme * change name * Add api nni.get_sequence_id() * Add sequence_id to TrialJobDetail
-
fishyds authored
* fix parameter file name issue for multi-phase training * Updated based on comments
-
- 08 Oct, 2018 1 commit
-
-
chicm-ms authored
* Dev enas - multi-phase hyper parameters support (#96) * Multi-phase support * Updates * Updates * updates * updates * updates * Merge master to dev-enas (#117) * Multi-phase support * update document (#92) * Edit readme.md * updated a word * Update GetStarted.md * Update GetStarted.md * refact readme, getstarted and write your trial md. * Update README.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Fix nnictl bugs and add new feature (#75) * fix nnictl bug * fix nnictl create bug * add experiment status logic * add more information for nnictl * fix Evolution Tuner bug * refactor code * fix code in updater.py * fix nnictl --help * fix classArgs bug * update check response.status_code logic * Updates * remove Buffer warning (#100) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * Updates * updates * updates * updates * Add support for debugging mode * fix setup.py (#115) * Add DAG model configuration format for SQuAD example. * Explain config format for SQuAD QA model. * Add more detailed introduction about the evolution algorithm. * Merge master to dev-enas (#118) * update document (#92) * Edit readme.md * updated a word * Update GetStarted.md * Update GetStarted.md * refact readme, getstarted and write your trial md. * Update README.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Fix nnictl bugs and add new feature (#75) * fix nnictl bug * fix nnictl create bug * add experiment status logic * add more information for nnictl * fix Evolution Tuner bug * refactor code * fix code in updater.py * fix nnictl --help * fix classArgs bug * update check response.status_code logic * remove Buffer warning (#100) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * Add support for debugging mode * fix setup.py (#115) * Add DAG model configuration format for SQuAD example. * Explain config format for SQuAD QA model. * Add more detailed introduction about the evolution algorithm. * Fix install.sh add add trial log path (#109) * fix nnictl bug * fix nnictl create bug * add experiment status logic * add more information for nnictl * fix Evolution Tuner bug * refactor code * fix code in updater.py * fix nnictl --help * fix classArgs bug * update check response.status_code logic * show trial log path * update document * fix install.sh * set default vallue for maxTrialNum and maxExecDuration * fix nnictl * support multiPhase (#127) * fix nnictl bug * support multiPhase * Fix multiphase datastore problem (#125) * Fix multiphase datastore problem * updates * updates * updates * updates * Pull latest code (#2) * webui logpath and document (#135) * Add webui document and logpath as a href * fix tslint * fix comments by Chengmin * Pai training service bug fix and enhancement (#136) * Add NNI installation scripts * Update pai script, update NNI_out_dir * Update NNI dir in nni sdk local.py * Create .nni folder in nni sdk local.py * Add check before creating .nni folder * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT * Improve annotation (#138) * Improve annotation * Minor bugfix * Selectively install through pip (#139) Selectively install through pip * update setup.py * fix paiTrainingService bugs (#137) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * Add documentation for NNI PAI mode experiment (#141) * Add documentation for NNI PAI mode * Fix typo based on PR comments * Exit with subprocess return code of trial keeper * Remove additional exit code * Fix typo based on PR comments * update doc for smac tuner (#140) * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142) * Revert "Selectively install through pip (#139)" This reverts commit 1d174836. * Add exit code of subprocess for trial_keeper * Update README, add link to PAImode doc * fix bug (#147) * Refactor nnictl and add config_pai.yml (#144) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * add config_pai.yml * refactor nnictl create logic and add colorful print * fix nnictl stop logic * add annotation for config_pai.yml * add document for start experiment * fix config.yml * fix document * Fix trial keeper wrongly exit issue (#152) * Fix trial keeper bug, use actual exitcode to exit rather than 1 * Fix bug of table sort (#145) * Update doc for PAIMode and v0.2 release notes (#153) * Update v0.2 documentation regards to release note and PAI training service * Update document to describe NNI docker image * Bug fix for SQuAD example tuner. (#134) * Update Makefile (#151) * test * update setup.py * update Makefile and install.sh * rever setup.py * change color * update doc * update doc * fix auto-completion's extra space * update Makefile * update webui * Update doc image (#163) * update doc * trivial * trivial * trivial * trivial * trivial * trivial * update image * update image size * Update ga squad (#104) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * sklearn examples (#169) * fix nnictl bug * fix install.sh * add sklearn-regression example * add sklearn classification * update sklearn * update example * remove additional code * Update batch tuner (#158) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * update batch tuner * Quickly fix cascading search space bug in tuner (#156) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * quickly fix cascading searchspace bug in tuner * Add iterative search space example (#119) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * add iterative search space example * update * update readme * change name * updates * updates * Updates CI * updates
-