Commits · ebca3cec44e32527a5df6f351ea9c540537610b5 · OpenDAS / nni

14 Sep, 2020 1 commit
- support annotation in python 3.8 (#2881) · ebca3cec
  J-shang authored Sep 14, 2020
```
Co-authored-by: Ning Shang <nishang@microsoft.com>
```
  ebca3cec
07 Sep, 2020 1 commit
- shut validator ipc warning (#2864) · 42879806
  liuzhe-lz authored Sep 07, 2020
  
  42879806
19 Aug, 2020 1 commit
- fix nnictl experiment delete (delete log folder in ~/.local/nnictl) (#2799) · 593d2d20
  Junwei Sun authored Aug 19, 2020
  
  593d2d20
14 Aug, 2020 1 commit
- Fix nnictl experiment delete (#2791) · bbb9137c
  SparkSnail authored Aug 14, 2020
  
  bbb9137c
13 Aug, 2020 2 commits
- Update python version requirements to 3.6 (#2790) · 07cb73b1
  chicm-ms authored Aug 13, 2020
  
  07cb73b1
- update experiment dir (#2753) · 6ec2adee
  Junwei Sun authored Aug 13, 2020
  
  6ec2adee
12 Aug, 2020 4 commits
- update nnicli (#2713) · f82ef623
  Junwei Sun authored Aug 12, 2020
  
  f82ef623
- add nnictl command to list trial results with highest/lowest metric (#2747) · 44954e0c
  Tab Zhang authored Aug 12, 2020
  
  44954e0c
- feature: export experiment results (#2706) · d654eff4
  Tab Zhang authored Aug 12, 2020
  
  d654eff4
- Support save and open experiments (#2750) · d5072a29
  SparkSnail authored Aug 12, 2020
  
  d5072a29
11 Aug, 2020 1 commit
- Enable gpu scheduler in AML mode (#2769) · 995f6259
  SparkSnail authored Aug 11, 2020
  
  995f6259
30 Jul, 2020 1 commit
- Reusable environment support GPU scheduler, add test cases and refactoring. (#2627) · 143c6615
  Chi Song authored Jul 30, 2020
  
  143c6615
24 Jul, 2020 1 commit
- Add timeout for web_channel in trial_runner (#2710) · 54fef7fa
  SparkSnail authored Jul 24, 2020
  
  54fef7fa
01 Jul, 2020 1 commit
- Support aml (#2615) · 93f96d4f
  SparkSnail authored Jul 01, 2020
  
  93f96d4f
30 Jun, 2020 1 commit

Reuse OpenPAI jobs to run multiple trials (#2521) · 0b9d6ce6

Chi Song authored Jun 30, 2020

Designed new interface to support reusable training service, currently only applies to OpenPAI, and default disabled.

Replace trial_keeper.py to trial_runner.py, trial_runner holds an environment, and receives commands from nni manager to run or stop an trial, and return events to nni manager.
Add trial dispatcher, which inherits from original trianing service interface. It uses to share as many as possible code of all training service, and isolate with training services.
Add EnvironmentService interface to manage environment, including start/stop an environment, refresh status of environments.
Add command channel on both nni manager and trial runner parts, it supports different ways to pass messages between them. Current supported channels are file, web sockets. and supported commands from nni manager are start, kill trial, send new parameters; from runner are initialized(support some channel doesn't know which runner connected), trial end, stdout ((new type), including metric like before), version check (new type), gpu info (new type).
Add storage service to wrapper a storage to standard file operations, like NFS, azure storage and so on.
Partial support run multiple trials in parallel on runner side, but not supported by trial dispatcher side.
Other minor changes,

Add log_level to TS UT, so that UT can show debug level log.
Expose platform to start info.
Add RouterTrainingService to keep origianl OpenPAI training service, and support dynamic IOC binding.
Add more GPU info for future usage, including GPU mem total/free/used, gpu type.
Make some license information consistence.
Fix async/await problems on Array.forEach, this method doesn't support async actually.
Fix IT errors on download data, which causes by my #2484 .
Accelerate some run loop pattern by reducing sleep seconds.

0b9d6ce6

29 Jun, 2020 1 commit
- Fix pai pipeline (#2607) · 92b5fa14
  SparkSnail authored Jun 29, 2020
  
  92b5fa14
23 Jun, 2020 1 commit
- Install builtin tuners (#2439) · a5764016
  chicm-ms authored Jun 23, 2020
  
  a5764016
22 Jun, 2020 1 commit
- Check eth0 in nnictl (#2566) · 76586fca
  SparkSnail authored Jun 22, 2020
  
  76586fca
12 Jun, 2020 1 commit
- Support paiStorageConfigName (#2536) · 8a60d624
  SparkSnail authored Jun 12, 2020
  
  8a60d624
05 Jun, 2020 1 commit

remove optimize_mode from curve fitting (#2471) · e75a9f5a

Chi Song authored Jun 05, 2020

others

1. fix failed curve fitting UTs, due to code changes.
1. move all SDK UTs to tests folder, so that they can be run in default tests.
1. fix some deprecated ut assert function calls.

e75a9f5a

25 May, 2020 1 commit
- Improve stablability of remote training service. (#2474) · be09f11c
  Chi Song authored May 25, 2020
  
  be09f11c
19 May, 2020 1 commit
- Support Windows as remote node. (#2431) · 69cae211
  Chi Song authored May 20, 2020
  
  69cae211
12 May, 2020 1 commit
- Fix it local windows (#2421) · 8238ddc6
  Yan Ni authored May 12, 2020
  
  8238ddc6
06 May, 2020 1 commit
- Update pai yaml merge method (#2369) · ee25377d
  SparkSnail authored May 06, 2020
  
  ee25377d
30 Apr, 2020 1 commit
- fix trial export (#2303) · 62d74565
  Yan Ni authored Apr 30, 2020
  
  62d74565
26 Apr, 2020 1 commit

fix #1578 and some improvements (#2370) · 1c6f1efa

Chi Song authored Apr 26, 2020

Add shell support for ssh connection, so that remote script can be started with user environment.

Minor fixes,

1. Fix gpu_metrics_collector to support pyenv. As pyenv will create one more process, so that original pgrep code always got extra processes, and cannot start gpu_metrics_collector.
2. Fix NASUI failure on dev-install-node-modules, to create subfolder every time.
3. Fix MakeFile to reduce mis-created links, and other minor issues.
4. Add node --watch for nni_manager for better dev experience.

1c6f1efa

30 Mar, 2020 1 commit
- add PBT tuner (#2139) · a82b4a3b
  RayMeng8 authored Mar 30, 2020
  
  a82b4a3b
27 Mar, 2020 1 commit
- Support tensorboard 2.x command (#2242) · 9efce4fe
  SparkSnail authored Mar 27, 2020
  
  9efce4fe
25 Mar, 2020 1 commit

Support nas installation (#2217) · 39f211c1

SparkSnail authored Mar 25, 2020



* check pylint for nni_cmd

* fix id error

* init

* remove comments
Co-authored-by: SparkSnail <Administrator@MININT-27KBA7M.fareast.corp.microsoft.com>

39f211c1

18 Mar, 2020 1 commit
- Fix trialkeeper flush (#2174) · 4a07f9ed
  SparkSnail authored Mar 18, 2020
  
  4a07f9ed
17 Mar, 2020 1 commit
- Add flush in trial_keeper log (#2156) · 3c0ef842
  SparkSnail authored Mar 17, 2020
  
  3c0ef842
05 Mar, 2020 1 commit
- Increase the default heapsize of node to 4GB (#2125) · 31afa426
  chicm-ms authored Mar 05, 2020
  
  31afa426
03 Mar, 2020 1 commit
- NAS visualization (#2085) · 9987014e
  Yuge Zhang authored Mar 03, 2020
  
  9987014e
02 Mar, 2020 1 commit

DLTS integration (#1945) · 134368fa

George Cheng authored Mar 02, 2020



* skeleton of dlts training service (#1844)

* Hello, DLTS!

* Revert version

* Remove fs-extra

* Add some default cluster config

* schema

* fix

* Optional cluster (default to `.default`)

Depends on DLWorkspace#837

* fix

* fix

* optimize gpu type

* No more copy

* Format

* Code clean up

* Issue fix

* Add optional fields in config

* Issue fix

* Lint

* Lint

* Validate email, password and team

* Doc

* Doc fix

* Set TMPDIR

* Use metadata instead of gpu_capacity

* Cancel paused DLTS job

* workaround lint rules

* pylint

* doc
Co-authored-by: QuanluZhang <z.quanluzhang@gmail.com>

134368fa

27 Feb, 2020 1 commit
- Handle exception and kill restserver in nnictl (#2086) · ba5d18c4
  SparkSnail authored Feb 27, 2020
  
  ba5d18c4
14 Feb, 2020 1 commit
- Fix nnictl view command (#2044) · b4ab371b
  SparkSnail authored Feb 14, 2020
  
  b4ab371b
09 Feb, 2020 1 commit
- merge from master (#2019) · e9f137f0
  QuanluZhang authored Feb 09, 2020
  
  e9f137f0
07 Feb, 2020 2 commits
- Add foreground mode in nnictl (#1956) · d7920fd2
  SparkSnail authored Feb 07, 2020
  
  d7920fd2
- Merge pai config (#1965) · 26aa1136
  SparkSnail authored Feb 07, 2020
  
  26aa1136
04 Feb, 2020 1 commit
- Change validation order in machineList (#1966) · 1c54b40c
  SparkSnail authored Feb 04, 2020
  
  1c54b40c