- 20 Jun, 2019 1 commit
-
-
demianzhang authored
* fix local and remote training services tslint
-
- 19 Jun, 2019 2 commits
-
-
Hongarc authored
-
demianzhang authored
* Catch the error in pai training service * no retry
-
- 23 May, 2019 1 commit
-
-
SparkSnail authored
-
- 17 Apr, 2019 1 commit
-
-
SparkSnail authored
-
- 11 Apr, 2019 2 commits
- 27 Mar, 2019 1 commit
-
-
SparkSnail authored
-
- 22 Mar, 2019 1 commit
-
-
SparkSnail authored
If user set remoteloggingType in config file, log content will not be transmitted from trialkeeper
-
- 20 Mar, 2019 1 commit
-
-
SparkSnail authored
-
- 15 Mar, 2019 1 commit
-
-
SparkSnail authored
check nni version in trialkeeper, to make sure the version of trialkeeper is consistent with trainingService add a debug mode in config file
-
- 25 Feb, 2019 2 commits
-
-
SparkSnail authored
trial_keeper use 50070 port to connect to webhdfs server, and PAI use a mapping method to map 50070 port to 5070 port to visit restful server, this method has some risk for PAI may not support this kind of mapping in later release.Now use Pylon path(/webhdfs/api/v1) instead of 50070 port in webhdfs client of trial_keeper, the path is transmitted in trainingService. In this pr, we have these changes: 1. Change to use webhdfs path instead of 50070 port in hdfs client. 2. Change to use new hdfs package "PythonWebHDFS", which is build to support pylon by myself. You could test the new function from "sparksnail/nni:dev-pai" image to test pai trainingService. 3. Update some variables' name according to comments.
-
fishyds authored
* Fix a race condition bug that does not store Trial Job cancel status correctly
-
- 25 Jan, 2019 2 commits
-
-
fishyds authored
* Fix PAI webhdfs api endpoint
-
chicm-ms authored
* Pull code (#22) * Support distributed job for frameworkcontroller (#612) support distributed job for frameworkcontroller * Multiphase doc (#519) * multiPhase doc * updates * updates * Add time parser for 'nnictl update duration' (#632) Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d} * fix experiment state bug (#629) * update top README.md (#622) * Update README.md * update (#634) * Integration tests refactoring (#625) * Integration test refactoring (#21) (#616) * Integration test refactoring (#21) * Refactoring integration tests * test metrics * update azure pipeline * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * update trigger * Integration test refactoring (#618) * updates * updates * update pipeline (#619) * update pipeline * updates * updates * updates * updates * updates * test pipeline (#623) * test pipeline * updates * updates * updates * Update integration test (#624) * Update integration test * updates * updates * updates * updates * updates * updates * Revert "Pull code (#22)" This reverts commit 62fc165ad7b2ba724eead3b99f010aa34491e2c7. * Update nnimanager logs * updates * Update README.md * Revert "Update README.md" This reverts commit bc67061160e5d57305a6e7fb63d491d12d0e9002. * updates * updates
-
- 04 Jan, 2019 1 commit
-
-
SparkSnail authored
trial job could not be stopped in remote machine when experiment is stopped, because awit/async does not work normally in forEach, refer https://codeburst.io/javascript-async-await-with-foreach-b6ba62bbf404.
-
- 29 Dec, 2018 1 commit
-
-
fishyds authored
* Removed unused log code, refactor to rename some class name in nni sdk and trial_tools * Fix the regression bug that loca/remote mode doesnt work
-
- 19 Dec, 2018 1 commit
-
-
fishyds authored
* Small refactor: remove useless INFO log, and pring valid PAI token error message
-
- 17 Dec, 2018 1 commit
-
-
fishyds authored
* [PAI training service] codeDir files upload improvement * Create full local temp folder * Organize the folder structure for experiment and trial files
-
- 10 Dec, 2018 1 commit
-
-
SparkSnail authored
quick fix paiTrainingService, add deferred.resolve();
-
- 07 Dec, 2018 1 commit
-
-
SparkSnail authored
Update pai token every 2 hours.
-
- 30 Nov, 2018 1 commit
-
-
Lijiao authored
* Support to show 2 logPath * fix lint * Update trial status color
-
- 29 Nov, 2018 1 commit
-
-
fishyds authored
* Add codeDir file count validation for setClusterConfig * fix a small bug if find command is not installed * Remove codeDir validation for local training service * Remove useless import
-
- 28 Nov, 2018 1 commit
-
-
fishyds authored
* [PAI training service] Support virtual cluster config * fix a small bug to convert virtualCluster to string
-
- 25 Nov, 2018 1 commit
-
-
QuanluZhang authored
* add one more trial job status, EARLY_STOPPED * fix datastore/nnimanager/mockeddatastore. test/webui/metrics_reader not done. USER_TO_CANCEL * fix bug * modifications based on Deshui's comments * fix bug * fix bug in remote mode
-
- 23 Nov, 2018 1 commit
-
-
SparkSnail authored
Add nniManager Ip in nnictl, pai TrainingService and kubeflow TrainingService. If users set nniManagerIp, pai and kubeflow will use this ip instead of using getIPV4() function. Web UI will also use this nniManagerIp.
-
- 20 Nov, 2018 1 commit
-
-
fishyds authored
* Kubeflow TrainingService support, v1 (#373) 1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs 2. Update nni python SDK to support new kubeflow platform 3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file 4. This version only supports Tensorflow operator. Will add more operators' support in future versions
-
- 12 Nov, 2018 1 commit
-
-
fishyds authored
* Change base image from devel to runtime, to reduce docker image size * Support running multiple experiment for PAI * Fix a bug regarding to recuisively reference between paiRestServer and paiTrainingService
-
- 05 Nov, 2018 2 commits
- 02 Nov, 2018 2 commits
- 31 Oct, 2018 3 commits
- 17 Oct, 2018 1 commit
-
-
fishyds authored
Fix paiTrainingService broken issue comes from OpenPAI API upgrade
-
- 12 Oct, 2018 2 commits
-
-
chicm-ms authored
* Pull latest code (#2) * webui logpath and document (#135) * Add webui document and logpath as a href * fix tslint * fix comments by Chengmin * Pai training service bug fix and enhancement (#136) * Add NNI installation scripts * Update pai script, update NNI_out_dir * Update NNI dir in nni sdk local.py * Create .nni folder in nni sdk local.py * Add check before creating .nni folder * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT * Improve annotation (#138) * Improve annotation * Minor bugfix * Selectively install through pip (#139) Selectively install through pip * update setup.py * fix paiTrainingService bugs (#137) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * Add documentation for NNI PAI mode experiment (#141) * Add documentation for NNI PAI mode * Fix typo based on PR comments * Exit with subprocess return code of trial keeper * Remove additional exit code * Fix typo based on PR comments * update doc for smac tuner (#140) * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142) * Revert "Selectively install through pip (#139)" This reverts commit 1d174836. * Add exit code of subprocess for trial_keeper * Update README, add link to PAImode doc * fix bug (#147) * Refactor nnictl and add config_pai.yml (#144) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * add config_pai.yml * refactor nnictl create logic and add colorful print * fix nnictl stop logic * add annotation for config_pai.yml * add document for start experiment * fix config.yml * fix document * Fix trial keeper wrongly exit issue (#152) * Fix trial keeper bug, use actual exitcode to exit rather than 1 * Fix bug of table sort (#145) * Update doc for PAIMode and v0.2 release notes (#153) * Update v0.2 documentation regards to release note and PAI training service * Update document to describe NNI docker image * Bug fix for SQuAD example tuner. (#134) * Update Makefile (#151) * test * update setup.py * update Makefile and install.sh * rever setup.py * change color * update doc * update doc * fix auto-completion's extra space * update Makefile * update webui * Update doc image (#163) * update doc * trivial * trivial * trivial * trivial * trivial * trivial * update image * update image size * Update ga squad (#104) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * sklearn examples (#169) * fix nnictl bug * fix install.sh * add sklearn-regression example * add sklearn classification * update sklearn * update example * remove additional code * Update batch tuner (#158) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * update batch tuner * Quickly fix cascading search space bug in tuner (#156) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * quickly fix cascading searchspace bug in tuner * Add iterative search space example (#119) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * update readme * add iterative search space example * update * update readme * change name * Add api nni.get_sequence_id() * Add sequence_id to TrialJobDetail
-
fishyds authored
* fix parameter file name issue for multi-phase training * Updated based on comments
-
- 09 Oct, 2018 1 commit
-
-
fishyds authored
-
- 29 Sep, 2018 1 commit
-
-
fishyds authored
* webui logpath and document (#135) * Add webui document and logpath as a href * fix tslint * fix comments by Chengmin * Pai training service bug fix and enhancement (#136) * Add NNI installation scripts * Update pai script, update NNI_out_dir * Update NNI dir in nni sdk local.py * Create .nni folder in nni sdk local.py * Add check before creating .nni folder * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT * Improve annotation (#138) * Improve annotation * Minor bugfix * Selectively install through pip (#139) Selectively install through pip * update setup.py * fix paiTrainingService bugs (#137) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * Add documentation for NNI PAI mode experiment (#141) * Add documentation for NNI PAI mode * Fix typo based on PR comments * Exit with subprocess return code of trial keeper * Remove additional exit code * Fix typo based on PR comments * update doc for smac tuner (#140) * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142) * Revert "Selectively install through pip (#139)" This reverts commit 1d174836. * Add exit code of subprocess for trial_keeper * Update README, add link to PAImode doc
-