ExperimentConfig.md 21.5 KB
Newer Older
Scarlett Li's avatar
Scarlett Li committed
1
# Experiment config reference
Deshui Yu's avatar
Deshui Yu committed
2

SparkSnail's avatar
SparkSnail committed
3
A config file is needed when create an experiment, the path of the config file is provide to nnictl.
4
The config file is written in YAML format, and need to be written correctly.
5
This document describes the rule to write config file, and will provide some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
- [Experiment config reference](#Experiment-config-reference)
  - [Template](#Template)
  - [Configuration spec](#Configuration-spec)
  - [Examples](#Examples)
Yan Ni's avatar
Yan Ni committed
11
12

<a name="Template"></a>
Deshui Yu's avatar
Deshui Yu committed
13
## Template
Chi Song's avatar
Chi Song committed
14

Chi Song's avatar
Chi Song committed
15
16
17
18
19
20
21
22
* __light weight(without Annotation and Assessor)__

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
23
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
24
25
trainingServicePlatform:
searchSpacePath:
Deshui Yu's avatar
Deshui Yu committed
26
#choice: true, false
Chi Song's avatar
Chi Song committed
27
useAnnotation:
Deshui Yu's avatar
Deshui Yu committed
28
29
tuner:
  #choice: TPE, Random, Anneal, Evolution
30
31
32
33
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
34
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
35
trial:
Chi Song's avatar
Chi Song committed
36
37
38
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
39
40
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
41
42
43
44
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
45
```
Chi Song's avatar
Chi Song committed
46

Deshui Yu's avatar
Deshui Yu committed
47
* __Use Assessor__
Chi Song's avatar
Chi Song committed
48

Chi Song's avatar
Chi Song committed
49
50
51
52
53
54
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
55
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
56
57
trainingServicePlatform:
searchSpacePath:
Deshui Yu's avatar
Deshui Yu committed
58
#choice: true, false
Chi Song's avatar
Chi Song committed
59
useAnnotation:
Deshui Yu's avatar
Deshui Yu committed
60
61
tuner:
  #choice: TPE, Random, Anneal, Evolution
62
63
64
65
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
66
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
67
68
assessor:
  #choice: Medianstop
69
70
71
72
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
73
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
74
trial:
Chi Song's avatar
Chi Song committed
75
76
77
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
78
79
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
80
81
82
83
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
84
```
Chi Song's avatar
Chi Song committed
85

Deshui Yu's avatar
Deshui Yu committed
86
* __Use Annotation__
Chi Song's avatar
Chi Song committed
87

Chi Song's avatar
Chi Song committed
88
89
90
91
92
93
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
94
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
95
trainingServicePlatform:
Deshui Yu's avatar
Deshui Yu committed
96
#choice: true, false
Chi Song's avatar
Chi Song committed
97
useAnnotation:
Deshui Yu's avatar
Deshui Yu committed
98
99
tuner:
  #choice: TPE, Random, Anneal, Evolution
100
101
102
103
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
104
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
105
106
assessor:
  #choice: Medianstop
107
108
109
110
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
111
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
112
trial:
Chi Song's avatar
Chi Song committed
113
114
115
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
116
117
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
118
119
120
121
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
122
```
Chi Song's avatar
Chi Song committed
123

Yan Ni's avatar
Yan Ni committed
124
125
<a name="Configuration"></a>
## Configuration spec
Chi Song's avatar
Chi Song committed
126

Deshui Yu's avatar
Deshui Yu committed
127
* __authorName__
128
  * Description
Chi Song's avatar
Chi Song committed
129
130

    __authorName__ is the name of the author who create the experiment.
131
132

    TBD: add default value
Chi Song's avatar
Chi Song committed
133

Deshui Yu's avatar
Deshui Yu committed
134
135
* __experimentName__
  * Description
Chi Song's avatar
Chi Song committed
136

137
    __experimentName__ is the name of the experiment created.
138

139
    TBD: add default value
Chi Song's avatar
Chi Song committed
140

Deshui Yu's avatar
Deshui Yu committed
141
142
* __trialConcurrency__
  * Description
Chi Song's avatar
Chi Song committed
143

144
    __trialConcurrency__ specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
145
146
147

    Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.

Deshui Yu's avatar
Deshui Yu committed
148
149
* __maxExecDuration__
  * Description
Yan Ni's avatar
Yan Ni committed
150

151
    __maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Chi Song's avatar
Chi Song committed
152
153
154

    Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

155
156
157
* __versionCheck__
  * Description
  
158
    NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
159

160
161
162
* __debug__
  * Description

163
    Debug mode will set versionCheck be False and set logLevel be 'debug'
164

Deshui Yu's avatar
Deshui Yu committed
165
* __maxTrialNum__
Chi Song's avatar
Chi Song committed
166
167
  * Description

168
   __maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
Chi Song's avatar
Chi Song committed
169

Deshui Yu's avatar
Deshui Yu committed
170
171
* __trainingServicePlatform__
  * Description
Chi Song's avatar
Chi Song committed
172

173
    __trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
Chi Song's avatar
Chi Song committed
174

175
    * __local__ run an experiment on local ubuntu machine.
Chi Song's avatar
Chi Song committed
176

177
    * __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
SparkSnail's avatar
SparkSnail committed
178

179
    * __pai__  submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](./PaiMode.md)
Chi Song's avatar
Chi Song committed
180

181
    * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/).
Chi Song's avatar
Chi Song committed
182

Deshui Yu's avatar
Deshui Yu committed
183
184
* __searchSpacePath__
  * Description
Chi Song's avatar
Chi Song committed
185
186
187
188
189

    __searchSpacePath__ specifies the path of search space file, which should be a valid path in the local linux machine.

    Note: if set useAnnotation=True, the searchSpacePath field should be removed.

Deshui Yu's avatar
Deshui Yu committed
190
191
* __useAnnotation__
  * Description
Chi Song's avatar
Chi Song committed
192
193
194

    __useAnnotation__ use annotation to analysis trial code and generate search space.

Chi Song's avatar
Chi Song committed
195
    Note: if set useAnnotation=True, the searchSpacePath field should be removed.
SparkSnail's avatar
SparkSnail committed
196
197
198

* __nniManagerIp__
  * Description
Chi Song's avatar
Chi Song committed
199

200
    __nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
SparkSnail's avatar
SparkSnail committed
201

Chi Song's avatar
Chi Song committed
202
    Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly.
203
204
205
206
207
208
209
210
211
212
213

* __logDir__
  * Description

    __logDir__ configures the directory to store logs and data of the experiment. The default value is `<user home directory>/nni/experiment`

* __logLevel__
  * Description

    __logLevel__ sets log level for the experiment, available log levels are: `trace, debug, info, warning, error, fatal`. The default value is `info`.

SparkSnail's avatar
SparkSnail committed
214
215
* __logCollection__
  * Description
216

SparkSnail's avatar
SparkSnail committed
217
218
    __logCollection__ set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.

Deshui Yu's avatar
Deshui Yu committed
219
220
* __tuner__
  * Description
Chi Song's avatar
Chi Song committed
221

222
    __tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
223
224
  * __builtinTunerName__ and __classArgs__
    * __builtinTunerName__
Chi Song's avatar
Chi Song committed
225
226

      __builtinTunerName__ specifies the name of system tuner, NNI sdk provides four kinds of tuner, including {__TPE__, __Random__, __Anneal__, __Evolution__, __BatchTuner__, __GridSearch__}
227

Chi Song's avatar
Chi Song committed
228
    * __classArgs__
Chi Song's avatar
Chi Song committed
229
230

      __classArgs__ specifies the arguments of tuner algorithm. If the __builtinTunerName__ is in {__TPE__, __Random__, __Anneal__, __Evolution__}, user should set __optimize_mode__.
231
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
232
233
234
235
236
237
238
239
240
241
242
243
    * __codeDir__

      __codeDir__ specifies the directory of tuner code.
    * __classFileName__

      __classFileName__ specifies the name of tuner file.
    * __className__

      __className__ specifies the name of tuner class.
    * __classArgs__

      __classArgs__ specifies the arguments of tuner algorithm.
244
245

  * __gpuNum__
Chi Song's avatar
Chi Song committed
246

Chi Song's avatar
Chi Song committed
247
      __gpuNum__ specifies the gpu number to run the tuner process. The value of this field should be a positive number.
Chi Song's avatar
Chi Song committed
248
249

      Note: users could only specify one way to set tuner, for example, set {tunerName, optimizationMode} or {tunerCommand, tunerCwd}, and could not set them both.
Deshui Yu's avatar
Deshui Yu committed
250

251
252
253
254
  * __includeIntermediateResults__

      If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of __includeIntermediateResults__ is false.

Deshui Yu's avatar
Deshui Yu committed
255
* __assessor__
Chi Song's avatar
Chi Song committed
256

Deshui Yu's avatar
Deshui Yu committed
257
  * Description
Chi Song's avatar
Chi Song committed
258

259
    __assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
260
261
  * __builtinAssessorName__ and __classArgs__
    * __builtinAssessorName__
Chi Song's avatar
Chi Song committed
262
263

      __builtinAssessorName__ specifies the name of system assessor, NNI sdk provides one kind of assessor {__Medianstop__}
Chi Song's avatar
Chi Song committed
264
265
    * __classArgs__

Chi Song's avatar
Chi Song committed
266
267
      __classArgs__ specifies the arguments of assessor algorithm

268
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
269

Chi Song's avatar
Chi Song committed
270
    * __codeDir__
Chi Song's avatar
Chi Song committed
271
272
273

      __codeDir__ specifies the directory of assessor code.

Chi Song's avatar
Chi Song committed
274
    * __classFileName__
Chi Song's avatar
Chi Song committed
275
276
277

      __classFileName__ specifies the name of assessor file.

Chi Song's avatar
Chi Song committed
278
    * __className__
Chi Song's avatar
Chi Song committed
279
280
281

      __className__ specifies the name of assessor class.

Chi Song's avatar
Chi Song committed
282
    * __classArgs__
Chi Song's avatar
Chi Song committed
283
284
285

      __classArgs__ specifies the arguments of assessor algorithm.

286
  * __gpuNum__
Deshui Yu's avatar
Deshui Yu committed
287

Chi Song's avatar
Chi Song committed
288
289
290
291
    __gpuNum__ specifies the gpu number to run the assessor process. The value of this field should be a positive number.

    Note: users' could only specify one way to set assessor, for example,set {assessorName, optimizationMode} or {assessorCommand, assessorCwd}, and users could not set them both.If users do not want to use assessor, assessor fileld should leave to empty.

SparkSnail's avatar
SparkSnail committed
292
* __trial(local, remote)__
Chi Song's avatar
Chi Song committed
293

294
  * __command__
Deshui Yu's avatar
Deshui Yu committed
295

Chi Song's avatar
Chi Song committed
296
297
    __command__  specifies the command to run trial process.

298
  * __codeDir__
Chi Song's avatar
Chi Song committed
299
300
301

    __codeDir__ specifies the directory of your own trial file.

302
  * __gpuNum__
Chi Song's avatar
Chi Song committed
303
304

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.
SparkSnail's avatar
SparkSnail committed
305
306

* __trial(pai)__
Chi Song's avatar
Chi Song committed
307

SparkSnail's avatar
SparkSnail committed
308
309
  * __command__

Chi Song's avatar
Chi Song committed
310
311
    __command__  specifies the command to run trial process.

SparkSnail's avatar
SparkSnail committed
312
  * __codeDir__
Chi Song's avatar
Chi Song committed
313
314
315

    __codeDir__ specifies the directory of the own trial file.

SparkSnail's avatar
SparkSnail committed
316
  * __gpuNum__
Chi Song's avatar
Chi Song committed
317
318
319

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.

SparkSnail's avatar
SparkSnail committed
320
321
322
  * __cpuNum__

    __cpuNum__ is the cpu number of cpu to be used in pai container.
Chi Song's avatar
Chi Song committed
323

SparkSnail's avatar
SparkSnail committed
324
325
326
  * __memoryMB__

    __memoryMB__ set the momory size to be used in pai's container.
Chi Song's avatar
Chi Song committed
327

SparkSnail's avatar
SparkSnail committed
328
329
330
331
332
333
334
  * __image__

    __image__ set the image to be used in pai.

  * __dataDir__

    __dataDir__ is the data directory in hdfs to be used.
Chi Song's avatar
Chi Song committed
335

SparkSnail's avatar
SparkSnail committed
336
337
  * __outputDir__

Chi Song's avatar
Chi Song committed
338
    __outputDir__ is the output directory in hdfs to be used in pai, the stdout and stderr files are stored in the directory after job finished.
SparkSnail's avatar
SparkSnail committed
339
340

* __trial(kubeflow)__
Chi Song's avatar
Chi Song committed
341

SparkSnail's avatar
SparkSnail committed
342
  * __codeDir__
Chi Song's avatar
Chi Song committed
343

SparkSnail's avatar
SparkSnail committed
344
    __codeDir__ is the local directory where the code files in.
Chi Song's avatar
Chi Song committed
345

SparkSnail's avatar
SparkSnail committed
346
  * __ps(optional)__
Chi Song's avatar
Chi Song committed
347
348
349

    __ps__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
350
    * __replicas__
Chi Song's avatar
Chi Song committed
351

SparkSnail's avatar
SparkSnail committed
352
      __replicas__ is the replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
353

SparkSnail's avatar
SparkSnail committed
354
    * __command__
Chi Song's avatar
Chi Song committed
355

SparkSnail's avatar
SparkSnail committed
356
      __command__ is the run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
357

SparkSnail's avatar
SparkSnail committed
358
    * __gpuNum__
Chi Song's avatar
Chi Song committed
359

SparkSnail's avatar
SparkSnail committed
360
      __gpuNum__ set the gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
361

SparkSnail's avatar
SparkSnail committed
362
    * __cpuNum__
Chi Song's avatar
Chi Song committed
363

SparkSnail's avatar
SparkSnail committed
364
      __cpuNum__ set the cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
365

SparkSnail's avatar
SparkSnail committed
366
    * __memoryMB__
Chi Song's avatar
Chi Song committed
367

SparkSnail's avatar
SparkSnail committed
368
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
369

SparkSnail's avatar
SparkSnail committed
370
    * __image__
Chi Song's avatar
Chi Song committed
371

Chi Song's avatar
Chi Song committed
372
      __image__ set the image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
373
374

  * __worker__
Chi Song's avatar
Chi Song committed
375
376
377

    __worker__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
378
    * __replicas__
Chi Song's avatar
Chi Song committed
379

SparkSnail's avatar
SparkSnail committed
380
      __replicas__ is the replica number of __worker__ role.
Chi Song's avatar
Chi Song committed
381

SparkSnail's avatar
SparkSnail committed
382
    * __command__
Chi Song's avatar
Chi Song committed
383

SparkSnail's avatar
SparkSnail committed
384
      __command__ is the run script in __worker__'s container.
Chi Song's avatar
Chi Song committed
385

SparkSnail's avatar
SparkSnail committed
386
    * __gpuNum__
Chi Song's avatar
Chi Song committed
387

SparkSnail's avatar
SparkSnail committed
388
      __gpuNum__ set the gpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
389

SparkSnail's avatar
SparkSnail committed
390
    * __cpuNum__
Chi Song's avatar
Chi Song committed
391

SparkSnail's avatar
SparkSnail committed
392
      __cpuNum__ set the cpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
393

SparkSnail's avatar
SparkSnail committed
394
    * __memoryMB__
Chi Song's avatar
Chi Song committed
395

SparkSnail's avatar
SparkSnail committed
396
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
397

SparkSnail's avatar
SparkSnail committed
398
    * __image__
Chi Song's avatar
Chi Song committed
399

Chi Song's avatar
Chi Song committed
400
      __image__ set the image to be used in __worker__.
SparkSnail's avatar
SparkSnail committed
401

402
403
* __localConfig__

Chi Song's avatar
Chi Song committed
404
  __localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
405
  * __gpuIndices__
406

Chi Song's avatar
Chi Song committed
407
    __gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or  `0,1,3`.
408

409
410
411
412
413
414
415
416
417
  * __maxTrialNumPerGpu__
  
    __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.
    
  * __useActiveGpu__
  
    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
  

Chi Song's avatar
Chi Song committed
418
* __machineList__
SparkSnail's avatar
SparkSnail committed
419

Chi Song's avatar
Chi Song committed
420
  __machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
SparkSnail's avatar
SparkSnail committed
421

Deshui Yu's avatar
Deshui Yu committed
422
  * __ip__
423

Chi Song's avatar
Chi Song committed
424
425
    __ip__ is the ip address of remote machine.

Deshui Yu's avatar
Deshui Yu committed
426
  * __port__
427

Chi Song's avatar
Chi Song committed
428
429
430
    __port__ is the ssh port to be used to connect machine.

     Note: if users set port empty, the default value will be 22.
Deshui Yu's avatar
Deshui Yu committed
431
  * __username__
Chi Song's avatar
Chi Song committed
432
433

    __username__ is the account of remote machine.
Deshui Yu's avatar
Deshui Yu committed
434
  * __passwd__
Chi Song's avatar
Chi Song committed
435
436

    __passwd__ specifies the password of the account.
Deshui Yu's avatar
Deshui Yu committed
437

438
439
  * __sshKeyPath__

SparkSnail's avatar
SparkSnail committed
440
    If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid.
Chi Song's avatar
Chi Song committed
441
442
443

    Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd.

444
445
  * __passphrase__

SparkSnail's avatar
SparkSnail committed
446
447
    __passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.

448
  * __gpuIndices__
449

Chi Song's avatar
Chi Song committed
450
    __gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or  `0,1,3`.
451

452
453
454
455
456
457
458
459
  * __maxTrialNumPerGpu__
  
    __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.

  * __useActiveGpu__
  
    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

SparkSnail's avatar
SparkSnail committed
460
* __kubeflowConfig__:
Chi Song's avatar
Chi Song committed
461

SparkSnail's avatar
SparkSnail committed
462
  * __operator__
Chi Song's avatar
Chi Song committed
463

464
    __operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version.
Chi Song's avatar
Chi Song committed
465

466
  * __storage__
Chi Song's avatar
Chi Song committed
467

468
    __storage__ specify the storage type of kubeflow, including {__nfs__, __azureStorage__}. This field is optional, and the default value is __nfs__. If the config use azureStorage, this field must be completed.
Chi Song's avatar
Chi Song committed
469

SparkSnail's avatar
SparkSnail committed
470
  * __nfs__
Chi Song's avatar
Chi Song committed
471

SparkSnail's avatar
SparkSnail committed
472
473
474
    __server__ is the host of nfs server

    __path__ is the mounted path of nfs
Chi Song's avatar
Chi Song committed
475

SparkSnail's avatar
SparkSnail committed
476
  * __keyVault__
Chi Song's avatar
Chi Song committed
477

SparkSnail's avatar
SparkSnail committed
478
    If users want to use azure kubernetes service, they should set keyVault to storage the private key of your azure storage account. Refer: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
SparkSnail's avatar
SparkSnail committed
479
480
481

    * __vaultName__

Chi Song's avatar
Chi Song committed
482
      __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
483
484

    * __name__
485

Chi Song's avatar
Chi Song committed
486
      __name__ is the value of `--name` used in az command.
487

SparkSnail's avatar
SparkSnail committed
488
  * __azureStorage__
Chi Song's avatar
Chi Song committed
489

SparkSnail's avatar
SparkSnail committed
490
491
492
    If users use azure kubernetes service, they should set azure storage account to store code files.

    * __accountName__
Chi Song's avatar
Chi Song committed
493

SparkSnail's avatar
SparkSnail committed
494
495
496
      __accountName__ is the name of azure storage account.

    * __azureShare__
Chi Song's avatar
Chi Song committed
497

SparkSnail's avatar
SparkSnail committed
498
499
      __azureShare__ is the share of the azure file storage.

SparkSnail's avatar
SparkSnail committed
500
501
502
* __paiConfig__

  * __userName__
Chi Song's avatar
Chi Song committed
503

SparkSnail's avatar
SparkSnail committed
504
505
506
    __userName__ is the user name of your pai account.

  * __password__
Chi Song's avatar
Chi Song committed
507

SparkSnail's avatar
SparkSnail committed
508
    __password__ is the password of the pai account.
Chi Song's avatar
Chi Song committed
509

SparkSnail's avatar
SparkSnail committed
510
  * __host__
Chi Song's avatar
Chi Song committed
511

SparkSnail's avatar
SparkSnail committed
512
513
    __host__ is the host of pai.

Chi Song's avatar
Chi Song committed
514
<a name="Examples"></a>
Deshui Yu's avatar
Deshui Yu committed
515
## Examples
Chi Song's avatar
Chi Song committed
516

Deshui Yu's avatar
Deshui Yu committed
517
518
* __local mode__

SparkSnail's avatar
SparkSnail committed
519
  If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
520

Chi Song's avatar
Chi Song committed
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  You can add assessor configuration.

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  Or you could specify your own tuner and assessor file as following,

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
611
612
613

* __remote mode__

614
  If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
654
655
656

* __pai mode__

Chi Song's avatar
Chi Song committed
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
682
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
683
684
685
686
687
688
689
690
691
692
693
694
    #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
    dataDir: hdfs://10.11.12.13:9000/test
    #The hdfs directory to store output data generated by NNI, format 'hdfs://host:port/directory'
    outputDir: hdfs://10.11.12.13:9000/test
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
695

Chi Song's avatar
Chi Song committed
696
* __kubeflow mode__
Chi Song's avatar
Chi Song committed
697

Chi Song's avatar
Chi Song committed
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

  kubeflow with azure storage

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
SparkSnail's avatar
SparkSnail committed
757
    gpuNum: 0
Chi Song's avatar
Chi Song committed
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```