ExperimentConfig.md 23.5 KB
Newer Older
Scarlett Li's avatar
Scarlett Li committed
1
# Experiment config reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

Chi Song's avatar
Chi Song committed
7
8
9
10
- [Experiment config reference](#experiment-config-reference)
  - [Template](#template)
  - [Configuration spec](#configuration-spec)
  - [Examples](#examples)
Yan Ni's avatar
Yan Ni committed
11

Deshui Yu's avatar
Deshui Yu committed
12
## Template
Chi Song's avatar
Chi Song committed
13

Chi Song's avatar
Chi Song committed
14
15
16
17
18
19
20
21
* __light weight(without Annotation and Assessor)__

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
22
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
23
24
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
25
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
26
useAnnotation:
chicm-ms's avatar
chicm-ms committed
27
28
29
30
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
31
32
tuner:
  #choice: TPE, Random, Anneal, Evolution
33
34
35
36
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
37
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
38
trial:
Chi Song's avatar
Chi Song committed
39
40
41
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
42
43
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
44
45
46
47
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
48
```
Chi Song's avatar
Chi Song committed
49

Deshui Yu's avatar
Deshui Yu committed
50
* __Use Assessor__
Chi Song's avatar
Chi Song committed
51

Chi Song's avatar
Chi Song committed
52
53
54
55
56
57
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
58
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
59
60
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
61
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
62
useAnnotation:
chicm-ms's avatar
chicm-ms committed
63
64
65
66
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
67
68
tuner:
  #choice: TPE, Random, Anneal, Evolution
69
70
71
72
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
73
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
74
75
assessor:
  #choice: Medianstop
76
77
78
79
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
80
trial:
Chi Song's avatar
Chi Song committed
81
82
83
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
84
85
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
86
87
88
89
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
90
```
Chi Song's avatar
Chi Song committed
91

Deshui Yu's avatar
Deshui Yu committed
92
* __Use Annotation__
Chi Song's avatar
Chi Song committed
93

Chi Song's avatar
Chi Song committed
94
95
96
97
98
99
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
100
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
101
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
102
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
103
useAnnotation:
chicm-ms's avatar
chicm-ms committed
104
105
106
107
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
108
109
tuner:
  #choice: TPE, Random, Anneal, Evolution
110
111
112
113
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
114
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
115
116
assessor:
  #choice: Medianstop
117
118
119
120
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
121
trial:
Chi Song's avatar
Chi Song committed
122
123
124
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
125
126
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
127
128
129
130
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
131
```
Chi Song's avatar
Chi Song committed
132

Yan Ni's avatar
Yan Ni committed
133
## Configuration spec
Chi Song's avatar
Chi Song committed
134

Deshui Yu's avatar
Deshui Yu committed
135
* __authorName__
136
  * Description
Chi Song's avatar
Chi Song committed
137
138

    __authorName__ is the name of the author who create the experiment.
139
140

    TBD: add default value
Chi Song's avatar
Chi Song committed
141

Deshui Yu's avatar
Deshui Yu committed
142
143
* __experimentName__
  * Description
Chi Song's avatar
Chi Song committed
144

145
    __experimentName__ is the name of the experiment created.
146

147
    TBD: add default value
Chi Song's avatar
Chi Song committed
148

Deshui Yu's avatar
Deshui Yu committed
149
150
* __trialConcurrency__
  * Description
Chi Song's avatar
Chi Song committed
151

152
    __trialConcurrency__ specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
153
154
155

    Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.

Deshui Yu's avatar
Deshui Yu committed
156
157
* __maxExecDuration__
  * Description
Yan Ni's avatar
Yan Ni committed
158

159
    __maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Chi Song's avatar
Chi Song committed
160
161
162

    Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

163
164
165
* __versionCheck__
  * Description
  
166
    NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
167

168
169
170
* __debug__
  * Description

171
    Debug mode will set versionCheck be False and set logLevel be 'debug'
172

Deshui Yu's avatar
Deshui Yu committed
173
* __maxTrialNum__
Chi Song's avatar
Chi Song committed
174
175
  * Description

176
   __maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
Chi Song's avatar
Chi Song committed
177

Deshui Yu's avatar
Deshui Yu committed
178
179
* __trainingServicePlatform__
  * Description
Chi Song's avatar
Chi Song committed
180

181
    __trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
Chi Song's avatar
Chi Song committed
182

183
    * __local__ run an experiment on local ubuntu machine.
Chi Song's avatar
Chi Song committed
184

185
    * __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
SparkSnail's avatar
SparkSnail committed
186

xuehui's avatar
xuehui committed
187
    * __pai__  submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
188

xuehui's avatar
xuehui committed
189
    * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). Detail please reference [KubeflowDoc](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
190

Deshui Yu's avatar
Deshui Yu committed
191
192
* __searchSpacePath__
  * Description
Chi Song's avatar
Chi Song committed
193
194
195
196
197

    __searchSpacePath__ specifies the path of search space file, which should be a valid path in the local linux machine.

    Note: if set useAnnotation=True, the searchSpacePath field should be removed.

Deshui Yu's avatar
Deshui Yu committed
198
199
* __useAnnotation__
  * Description
Chi Song's avatar
Chi Song committed
200
201
202

    __useAnnotation__ use annotation to analysis trial code and generate search space.

Chi Song's avatar
Chi Song committed
203
    Note: if set useAnnotation=True, the searchSpacePath field should be removed.
SparkSnail's avatar
SparkSnail committed
204

chicm-ms's avatar
chicm-ms committed
205
206
207
* __multiPhase__
  * Description

xuehui's avatar
xuehui committed
208
    __multiPhase__ enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md).
chicm-ms's avatar
chicm-ms committed
209
210
211
212
213
214

* __multiThread__
  * Description

    __multiThread__ enable multi-thread mode for dispatcher, if multiThread is set to `true`, dispatcher will start a thread to process each command from NNI Manager.

SparkSnail's avatar
SparkSnail committed
215
216
* __nniManagerIp__
  * Description
Chi Song's avatar
Chi Song committed
217

218
    __nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
SparkSnail's avatar
SparkSnail committed
219

Chi Song's avatar
Chi Song committed
220
    Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly.
221
222
223
224
225
226
227
228
229
230
231

* __logDir__
  * Description

    __logDir__ configures the directory to store logs and data of the experiment. The default value is `<user home directory>/nni/experiment`

* __logLevel__
  * Description

    __logLevel__ sets log level for the experiment, available log levels are: `trace, debug, info, warning, error, fatal`. The default value is `info`.

SparkSnail's avatar
SparkSnail committed
232
233
* __logCollection__
  * Description
234

SparkSnail's avatar
SparkSnail committed
235
236
    __logCollection__ set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.

Deshui Yu's avatar
Deshui Yu committed
237
238
* __tuner__
  * Description
Chi Song's avatar
Chi Song committed
239

240
    __tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
241
242
  * __builtinTunerName__ and __classArgs__
    * __builtinTunerName__
Chi Song's avatar
Chi Song committed
243

244
      __builtinTunerName__ specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
245

Chi Song's avatar
Chi Song committed
246
    * __classArgs__
Chi Song's avatar
Chi Song committed
247

248
      __classArgs__ specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
249
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
250
251
252
253
254
255
256
257
258
259
260
261
    * __codeDir__

      __codeDir__ specifies the directory of tuner code.
    * __classFileName__

      __classFileName__ specifies the name of tuner file.
    * __className__

      __className__ specifies the name of tuner class.
    * __classArgs__

      __classArgs__ specifies the arguments of tuner algorithm.
262

263
  * __gpuIndices__
Chi Song's avatar
Chi Song committed
264

265
      __gpuIndices__ specifies the gpus that can be used by the tuner process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner.
Deshui Yu's avatar
Deshui Yu committed
266

267
268
269
270
  * __includeIntermediateResults__

      If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of __includeIntermediateResults__ is false.

271
272
  Note: users could only use one way to specify tuner, either specifying `builtinTunerName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`.

Deshui Yu's avatar
Deshui Yu committed
273
* __assessor__
Chi Song's avatar
Chi Song committed
274

Deshui Yu's avatar
Deshui Yu committed
275
  * Description
Chi Song's avatar
Chi Song committed
276

277
    __assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
278
279
  * __builtinAssessorName__ and __classArgs__
    * __builtinAssessorName__
Chi Song's avatar
Chi Song committed
280

281
      __builtinAssessorName__ specifies the name of built-in assessor, NNI sdk provides different assessors introducted [here](../Assessor/BuiltinAssessor.md).
Chi Song's avatar
Chi Song committed
282
283
    * __classArgs__

Chi Song's avatar
Chi Song committed
284
285
      __classArgs__ specifies the arguments of assessor algorithm

286
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
287

Chi Song's avatar
Chi Song committed
288
    * __codeDir__
Chi Song's avatar
Chi Song committed
289
290
291

      __codeDir__ specifies the directory of assessor code.

Chi Song's avatar
Chi Song committed
292
    * __classFileName__
Chi Song's avatar
Chi Song committed
293
294
295

      __classFileName__ specifies the name of assessor file.

Chi Song's avatar
Chi Song committed
296
    * __className__
Chi Song's avatar
Chi Song committed
297
298
299

      __className__ specifies the name of assessor class.

Chi Song's avatar
Chi Song committed
300
    * __classArgs__
Chi Song's avatar
Chi Song committed
301
302
303

      __classArgs__ specifies the arguments of assessor algorithm.

304
305
306
307
  Note: users could only use one way to specify assessor, either specifying `builtinAssessorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`. If users do not want to use assessor, assessor fileld should leave to empty.

* __advisor__
  * Description
Deshui Yu's avatar
Deshui Yu committed
308

309
310
311
    __advisor__ specifies the advisor algorithm in the experiment, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
  * __builtinAdvisorName__ and __classArgs__
    * __builtinAdvisorName__
Chi Song's avatar
Chi Song committed
312

313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
      __builtinAdvisorName__ specifies the name of a built-in advisor, NNI sdk provides [different advisors](../Tuner/BuiltinTuner.md).

    * __classArgs__

      __classArgs__ specifies the arguments of the advisor algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in advisor.
  * __codeDir__, __classFileName__, __className__ and __classArgs__
    * __codeDir__

      __codeDir__ specifies the directory of advisor code.
    * __classFileName__

      __classFileName__ specifies the name of advisor file.
    * __className__

      __className__ specifies the name of advisor class.
    * __classArgs__

      __classArgs__ specifies the arguments of advisor algorithm.

  * __gpuIndices__

Chi Song's avatar
Chi Song committed
334
      __gpuIndices__ specifies the gpus that can be used by the advisor process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner.
335
336

  Note: users could only use one way to specify advisor, either specifying `builtinAdvisorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`.
Chi Song's avatar
Chi Song committed
337

SparkSnail's avatar
SparkSnail committed
338
* __trial(local, remote)__
Chi Song's avatar
Chi Song committed
339

340
  * __command__
Deshui Yu's avatar
Deshui Yu committed
341

Chi Song's avatar
Chi Song committed
342
343
    __command__  specifies the command to run trial process.

344
  * __codeDir__
Chi Song's avatar
Chi Song committed
345
346
347

    __codeDir__ specifies the directory of your own trial file.

348
  * __gpuNum__
Chi Song's avatar
Chi Song committed
349
350

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.
SparkSnail's avatar
SparkSnail committed
351
352

* __trial(pai)__
Chi Song's avatar
Chi Song committed
353

SparkSnail's avatar
SparkSnail committed
354
355
  * __command__

Chi Song's avatar
Chi Song committed
356
357
    __command__  specifies the command to run trial process.

SparkSnail's avatar
SparkSnail committed
358
  * __codeDir__
Chi Song's avatar
Chi Song committed
359
360
361

    __codeDir__ specifies the directory of the own trial file.

SparkSnail's avatar
SparkSnail committed
362
  * __gpuNum__
Chi Song's avatar
Chi Song committed
363
364
365

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.

SparkSnail's avatar
SparkSnail committed
366
367
368
  * __cpuNum__

    __cpuNum__ is the cpu number of cpu to be used in pai container.
Chi Song's avatar
Chi Song committed
369

SparkSnail's avatar
SparkSnail committed
370
371
372
  * __memoryMB__

    __memoryMB__ set the momory size to be used in pai's container.
Chi Song's avatar
Chi Song committed
373

SparkSnail's avatar
SparkSnail committed
374
375
376
377
378
  * __image__

    __image__ set the image to be used in pai.

* __trial(kubeflow)__
Chi Song's avatar
Chi Song committed
379

SparkSnail's avatar
SparkSnail committed
380
  * __codeDir__
Chi Song's avatar
Chi Song committed
381

SparkSnail's avatar
SparkSnail committed
382
    __codeDir__ is the local directory where the code files in.
Chi Song's avatar
Chi Song committed
383

SparkSnail's avatar
SparkSnail committed
384
  * __ps(optional)__
Chi Song's avatar
Chi Song committed
385
386
387

    __ps__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
388
    * __replicas__
Chi Song's avatar
Chi Song committed
389

SparkSnail's avatar
SparkSnail committed
390
      __replicas__ is the replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
391

SparkSnail's avatar
SparkSnail committed
392
    * __command__
Chi Song's avatar
Chi Song committed
393

SparkSnail's avatar
SparkSnail committed
394
      __command__ is the run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
395

SparkSnail's avatar
SparkSnail committed
396
    * __gpuNum__
Chi Song's avatar
Chi Song committed
397

SparkSnail's avatar
SparkSnail committed
398
      __gpuNum__ set the gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
399

SparkSnail's avatar
SparkSnail committed
400
    * __cpuNum__
Chi Song's avatar
Chi Song committed
401

SparkSnail's avatar
SparkSnail committed
402
      __cpuNum__ set the cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
403

SparkSnail's avatar
SparkSnail committed
404
    * __memoryMB__
Chi Song's avatar
Chi Song committed
405

SparkSnail's avatar
SparkSnail committed
406
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
407

SparkSnail's avatar
SparkSnail committed
408
    * __image__
Chi Song's avatar
Chi Song committed
409

Chi Song's avatar
Chi Song committed
410
      __image__ set the image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
411
412

  * __worker__
Chi Song's avatar
Chi Song committed
413
414
415

    __worker__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
416
    * __replicas__
Chi Song's avatar
Chi Song committed
417

SparkSnail's avatar
SparkSnail committed
418
      __replicas__ is the replica number of __worker__ role.
Chi Song's avatar
Chi Song committed
419

SparkSnail's avatar
SparkSnail committed
420
    * __command__
Chi Song's avatar
Chi Song committed
421

SparkSnail's avatar
SparkSnail committed
422
      __command__ is the run script in __worker__'s container.
Chi Song's avatar
Chi Song committed
423

SparkSnail's avatar
SparkSnail committed
424
    * __gpuNum__
Chi Song's avatar
Chi Song committed
425

SparkSnail's avatar
SparkSnail committed
426
      __gpuNum__ set the gpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
427

SparkSnail's avatar
SparkSnail committed
428
    * __cpuNum__
Chi Song's avatar
Chi Song committed
429

SparkSnail's avatar
SparkSnail committed
430
      __cpuNum__ set the cpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
431

SparkSnail's avatar
SparkSnail committed
432
    * __memoryMB__
Chi Song's avatar
Chi Song committed
433

SparkSnail's avatar
SparkSnail committed
434
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
435

SparkSnail's avatar
SparkSnail committed
436
    * __image__
Chi Song's avatar
Chi Song committed
437

Chi Song's avatar
Chi Song committed
438
      __image__ set the image to be used in __worker__.
SparkSnail's avatar
SparkSnail committed
439

440
441
* __localConfig__

Chi Song's avatar
Chi Song committed
442
  __localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
443
  * __gpuIndices__
444

Chi Song's avatar
Chi Song committed
445
    __gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or  `0,1,3`.
446

447
448
449
450
451
452
453
454
455
  * __maxTrialNumPerGpu__
  
    __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.
    
  * __useActiveGpu__
  
    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
  

Chi Song's avatar
Chi Song committed
456
* __machineList__
SparkSnail's avatar
SparkSnail committed
457

Chi Song's avatar
Chi Song committed
458
  __machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
SparkSnail's avatar
SparkSnail committed
459

Deshui Yu's avatar
Deshui Yu committed
460
  * __ip__
461

Chi Song's avatar
Chi Song committed
462
463
    __ip__ is the ip address of remote machine.

Deshui Yu's avatar
Deshui Yu committed
464
  * __port__
465

Chi Song's avatar
Chi Song committed
466
467
468
    __port__ is the ssh port to be used to connect machine.

     Note: if users set port empty, the default value will be 22.
Deshui Yu's avatar
Deshui Yu committed
469
  * __username__
Chi Song's avatar
Chi Song committed
470
471

    __username__ is the account of remote machine.
Deshui Yu's avatar
Deshui Yu committed
472
  * __passwd__
Chi Song's avatar
Chi Song committed
473
474

    __passwd__ specifies the password of the account.
Deshui Yu's avatar
Deshui Yu committed
475

476
477
  * __sshKeyPath__

SparkSnail's avatar
SparkSnail committed
478
    If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid.
Chi Song's avatar
Chi Song committed
479
480
481

    Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd.

482
483
  * __passphrase__

SparkSnail's avatar
SparkSnail committed
484
485
    __passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.

486
  * __gpuIndices__
487

Chi Song's avatar
Chi Song committed
488
    __gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or  `0,1,3`.
489

490
491
492
493
494
495
496
497
  * __maxTrialNumPerGpu__
  
    __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.

  * __useActiveGpu__
  
    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

SparkSnail's avatar
SparkSnail committed
498
* __kubeflowConfig__:
Chi Song's avatar
Chi Song committed
499

SparkSnail's avatar
SparkSnail committed
500
  * __operator__
Chi Song's avatar
Chi Song committed
501

502
    __operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version.
Chi Song's avatar
Chi Song committed
503

504
  * __storage__
Chi Song's avatar
Chi Song committed
505

506
    __storage__ specify the storage type of kubeflow, including {__nfs__, __azureStorage__}. This field is optional, and the default value is __nfs__. If the config use azureStorage, this field must be completed.
Chi Song's avatar
Chi Song committed
507

SparkSnail's avatar
SparkSnail committed
508
  * __nfs__
Chi Song's avatar
Chi Song committed
509

SparkSnail's avatar
SparkSnail committed
510
511
512
    __server__ is the host of nfs server

    __path__ is the mounted path of nfs
Chi Song's avatar
Chi Song committed
513

SparkSnail's avatar
SparkSnail committed
514
  * __keyVault__
Chi Song's avatar
Chi Song committed
515

SparkSnail's avatar
SparkSnail committed
516
    If users want to use azure kubernetes service, they should set keyVault to storage the private key of your azure storage account. Refer: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
SparkSnail's avatar
SparkSnail committed
517
518
519

    * __vaultName__

Chi Song's avatar
Chi Song committed
520
      __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
521
522

    * __name__
523

Chi Song's avatar
Chi Song committed
524
      __name__ is the value of `--name` used in az command.
525

SparkSnail's avatar
SparkSnail committed
526
  * __azureStorage__
Chi Song's avatar
Chi Song committed
527

SparkSnail's avatar
SparkSnail committed
528
529
530
    If users use azure kubernetes service, they should set azure storage account to store code files.

    * __accountName__
Chi Song's avatar
Chi Song committed
531

SparkSnail's avatar
SparkSnail committed
532
533
534
      __accountName__ is the name of azure storage account.

    * __azureShare__
Chi Song's avatar
Chi Song committed
535

SparkSnail's avatar
SparkSnail committed
536
537
      __azureShare__ is the share of the azure file storage.

538
539
540
541
  * __uploadRetryCount__

    If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.

SparkSnail's avatar
SparkSnail committed
542
543
544
* __paiConfig__

  * __userName__
Chi Song's avatar
Chi Song committed
545

SparkSnail's avatar
SparkSnail committed
546
547
548
    __userName__ is the user name of your pai account.

  * __password__
Chi Song's avatar
Chi Song committed
549

SparkSnail's avatar
SparkSnail committed
550
    __password__ is the password of the pai account.
Chi Song's avatar
Chi Song committed
551

SparkSnail's avatar
SparkSnail committed
552
  * __host__
Chi Song's avatar
Chi Song committed
553

SparkSnail's avatar
SparkSnail committed
554
555
    __host__ is the host of pai.

Deshui Yu's avatar
Deshui Yu committed
556
## Examples
Chi Song's avatar
Chi Song committed
557

Deshui Yu's avatar
Deshui Yu committed
558
559
* __local mode__

SparkSnail's avatar
SparkSnail committed
560
  If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
561

Chi Song's avatar
Chi Song committed
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  You can add assessor configuration.

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  Or you could specify your own tuner and assessor file as following,

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
647
648
649

* __remote mode__

650
  If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
689
690
691

* __pai mode__

Chi Song's avatar
Chi Song committed
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
717
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
718
719
720
721
722
723
724
725
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
726

Chi Song's avatar
Chi Song committed
727
* __kubeflow mode__
Chi Song's avatar
Chi Song committed
728

Chi Song's avatar
Chi Song committed
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

  kubeflow with azure storage

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```