ExperimentConfig.md 24 KB
Newer Older
Scarlett Li's avatar
Scarlett Li committed
1
# Experiment config reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
- [Experiment config reference](#Experiment-config-reference)
  - [Template](#Template)
  - [Configuration spec](#Configuration-spec)
  - [Examples](#Examples)
Yan Ni's avatar
Yan Ni committed
11
12

<a name="Template"></a>
Deshui Yu's avatar
Deshui Yu committed
13
## Template
Chi Song's avatar
Chi Song committed
14

Chi Song's avatar
Chi Song committed
15
16
17
18
19
20
21
22
* __light weight(without Annotation and Assessor)__

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
23
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
24
25
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
26
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
27
useAnnotation:
chicm-ms's avatar
chicm-ms committed
28
29
30
31
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
32
33
tuner:
  #choice: TPE, Random, Anneal, Evolution
34
35
36
37
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
38
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
39
trial:
Chi Song's avatar
Chi Song committed
40
41
42
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
43
44
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
45
46
47
48
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
49
```
Chi Song's avatar
Chi Song committed
50

Deshui Yu's avatar
Deshui Yu committed
51
* __Use Assessor__
Chi Song's avatar
Chi Song committed
52

Chi Song's avatar
Chi Song committed
53
54
55
56
57
58
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
59
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
60
61
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
62
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
63
useAnnotation:
chicm-ms's avatar
chicm-ms committed
64
65
66
67
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
68
69
tuner:
  #choice: TPE, Random, Anneal, Evolution
70
71
72
73
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
74
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
75
76
assessor:
  #choice: Medianstop
77
78
79
80
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
81
trial:
Chi Song's avatar
Chi Song committed
82
83
84
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
85
86
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
87
88
89
90
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
91
```
Chi Song's avatar
Chi Song committed
92

Deshui Yu's avatar
Deshui Yu committed
93
* __Use Annotation__
Chi Song's avatar
Chi Song committed
94

Chi Song's avatar
Chi Song committed
95
96
97
98
99
100
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
101
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
102
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
103
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
104
useAnnotation:
chicm-ms's avatar
chicm-ms committed
105
106
107
108
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
109
110
tuner:
  #choice: TPE, Random, Anneal, Evolution
111
112
113
114
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
115
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
116
117
assessor:
  #choice: Medianstop
118
119
120
121
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
122
trial:
Chi Song's avatar
Chi Song committed
123
124
125
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
126
127
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
128
129
130
131
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
132
```
Chi Song's avatar
Chi Song committed
133

Yan Ni's avatar
Yan Ni committed
134
135
<a name="Configuration"></a>
## Configuration spec
Chi Song's avatar
Chi Song committed
136

Deshui Yu's avatar
Deshui Yu committed
137
* __authorName__
138
  * Description
Chi Song's avatar
Chi Song committed
139
140

    __authorName__ is the name of the author who create the experiment.
141
142

    TBD: add default value
Chi Song's avatar
Chi Song committed
143

Deshui Yu's avatar
Deshui Yu committed
144
145
* __experimentName__
  * Description
Chi Song's avatar
Chi Song committed
146

147
    __experimentName__ is the name of the experiment created.
148

149
    TBD: add default value
Chi Song's avatar
Chi Song committed
150

Deshui Yu's avatar
Deshui Yu committed
151
152
* __trialConcurrency__
  * Description
Chi Song's avatar
Chi Song committed
153

154
    __trialConcurrency__ specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
155
156
157

    Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.

Deshui Yu's avatar
Deshui Yu committed
158
159
* __maxExecDuration__
  * Description
Yan Ni's avatar
Yan Ni committed
160

161
    __maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.
Chi Song's avatar
Chi Song committed
162
163
164

    Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

165
166
167
* __versionCheck__
  * Description
  
168
    NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
169

170
171
172
* __debug__
  * Description

173
    Debug mode will set versionCheck be False and set logLevel be 'debug'
174

Deshui Yu's avatar
Deshui Yu committed
175
* __maxTrialNum__
Chi Song's avatar
Chi Song committed
176
177
  * Description

178
   __maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
Chi Song's avatar
Chi Song committed
179

Deshui Yu's avatar
Deshui Yu committed
180
181
* __trainingServicePlatform__
  * Description
Chi Song's avatar
Chi Song committed
182

183
    __trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.
Chi Song's avatar
Chi Song committed
184

185
    * __local__ run an experiment on local ubuntu machine.
Chi Song's avatar
Chi Song committed
186

187
    * __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
SparkSnail's avatar
SparkSnail committed
188

xuehui's avatar
xuehui committed
189
    * __pai__  submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
190

xuehui's avatar
xuehui committed
191
    * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). Detail please reference [KubeflowDoc](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
192

Deshui Yu's avatar
Deshui Yu committed
193
194
* __searchSpacePath__
  * Description
Chi Song's avatar
Chi Song committed
195
196
197
198
199

    __searchSpacePath__ specifies the path of search space file, which should be a valid path in the local linux machine.

    Note: if set useAnnotation=True, the searchSpacePath field should be removed.

Deshui Yu's avatar
Deshui Yu committed
200
201
* __useAnnotation__
  * Description
Chi Song's avatar
Chi Song committed
202
203
204

    __useAnnotation__ use annotation to analysis trial code and generate search space.

Chi Song's avatar
Chi Song committed
205
    Note: if set useAnnotation=True, the searchSpacePath field should be removed.
SparkSnail's avatar
SparkSnail committed
206

chicm-ms's avatar
chicm-ms committed
207
208
209
* __multiPhase__
  * Description

xuehui's avatar
xuehui committed
210
    __multiPhase__ enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md).
chicm-ms's avatar
chicm-ms committed
211
212
213
214
215
216

* __multiThread__
  * Description

    __multiThread__ enable multi-thread mode for dispatcher, if multiThread is set to `true`, dispatcher will start a thread to process each command from NNI Manager.

SparkSnail's avatar
SparkSnail committed
217
218
* __nniManagerIp__
  * Description
Chi Song's avatar
Chi Song committed
219

220
    __nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
SparkSnail's avatar
SparkSnail committed
221

Chi Song's avatar
Chi Song committed
222
    Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly.
223
224
225
226
227
228
229
230
231
232
233

* __logDir__
  * Description

    __logDir__ configures the directory to store logs and data of the experiment. The default value is `<user home directory>/nni/experiment`

* __logLevel__
  * Description

    __logLevel__ sets log level for the experiment, available log levels are: `trace, debug, info, warning, error, fatal`. The default value is `info`.

SparkSnail's avatar
SparkSnail committed
234
235
* __logCollection__
  * Description
236

SparkSnail's avatar
SparkSnail committed
237
238
    __logCollection__ set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.

Deshui Yu's avatar
Deshui Yu committed
239
240
* __tuner__
  * Description
Chi Song's avatar
Chi Song committed
241

242
    __tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
243
244
  * __builtinTunerName__ and __classArgs__
    * __builtinTunerName__
Chi Song's avatar
Chi Song committed
245

246
      __builtinTunerName__ specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
247

Chi Song's avatar
Chi Song committed
248
    * __classArgs__
Chi Song's avatar
Chi Song committed
249

250
      __classArgs__ specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
251
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
252
253
254
255
256
257
258
259
260
261
262
263
    * __codeDir__

      __codeDir__ specifies the directory of tuner code.
    * __classFileName__

      __classFileName__ specifies the name of tuner file.
    * __className__

      __className__ specifies the name of tuner class.
    * __classArgs__

      __classArgs__ specifies the arguments of tuner algorithm.
264

265
  * __gpuIndices__
Chi Song's avatar
Chi Song committed
266

267
      __gpuIndices__ specifies the gpus that can be used by the tuner process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner.
Deshui Yu's avatar
Deshui Yu committed
268

269
270
271
272
  * __includeIntermediateResults__

      If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result. The default value of __includeIntermediateResults__ is false.

273
274
  Note: users could only use one way to specify tuner, either specifying `builtinTunerName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`.

Deshui Yu's avatar
Deshui Yu committed
275
* __assessor__
Chi Song's avatar
Chi Song committed
276

Deshui Yu's avatar
Deshui Yu committed
277
  * Description
Chi Song's avatar
Chi Song committed
278

279
    __assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
280
281
  * __builtinAssessorName__ and __classArgs__
    * __builtinAssessorName__
Chi Song's avatar
Chi Song committed
282

283
      __builtinAssessorName__ specifies the name of built-in assessor, NNI sdk provides different assessors introducted [here](../Assessor/BuiltinAssessor.md).
Chi Song's avatar
Chi Song committed
284
285
    * __classArgs__

Chi Song's avatar
Chi Song committed
286
287
      __classArgs__ specifies the arguments of assessor algorithm

288
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
289

Chi Song's avatar
Chi Song committed
290
    * __codeDir__
Chi Song's avatar
Chi Song committed
291
292
293

      __codeDir__ specifies the directory of assessor code.

Chi Song's avatar
Chi Song committed
294
    * __classFileName__
Chi Song's avatar
Chi Song committed
295
296
297

      __classFileName__ specifies the name of assessor file.

Chi Song's avatar
Chi Song committed
298
    * __className__
Chi Song's avatar
Chi Song committed
299
300
301

      __className__ specifies the name of assessor class.

Chi Song's avatar
Chi Song committed
302
    * __classArgs__
Chi Song's avatar
Chi Song committed
303
304
305

      __classArgs__ specifies the arguments of assessor algorithm.

306
307
308
309
  Note: users could only use one way to specify assessor, either specifying `builtinAssessorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`. If users do not want to use assessor, assessor fileld should leave to empty.

* __advisor__
  * Description
Deshui Yu's avatar
Deshui Yu committed
310

311
312
313
    __advisor__ specifies the advisor algorithm in the experiment, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
  * __builtinAdvisorName__ and __classArgs__
    * __builtinAdvisorName__
Chi Song's avatar
Chi Song committed
314

315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
      __builtinAdvisorName__ specifies the name of a built-in advisor, NNI sdk provides [different advisors](../Tuner/BuiltinTuner.md).

    * __classArgs__

      __classArgs__ specifies the arguments of the advisor algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in advisor.
  * __codeDir__, __classFileName__, __className__ and __classArgs__
    * __codeDir__

      __codeDir__ specifies the directory of advisor code.
    * __classFileName__

      __classFileName__ specifies the name of advisor file.
    * __className__

      __className__ specifies the name of advisor class.
    * __classArgs__

      __classArgs__ specifies the arguments of advisor algorithm.

  * __gpuIndices__

      __gpuIndices__ specifies the gpus that can be used by the tuner process. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or `0,1,3`. If the field is not set, `CUDA_VISIBLE_DEVICES` will be '' in script, that is, no GPU is visible to tuner.

  Note: users could only use one way to specify advisor, either specifying `builtinAdvisorName` and `classArgs`, or specifying `codeDir`, `classFileName`, `className` and `classArgs`.
Chi Song's avatar
Chi Song committed
339

SparkSnail's avatar
SparkSnail committed
340
* __trial(local, remote)__
Chi Song's avatar
Chi Song committed
341

342
  * __command__
Deshui Yu's avatar
Deshui Yu committed
343

Chi Song's avatar
Chi Song committed
344
345
    __command__  specifies the command to run trial process.

346
  * __codeDir__
Chi Song's avatar
Chi Song committed
347
348
349

    __codeDir__ specifies the directory of your own trial file.

350
  * __gpuNum__
Chi Song's avatar
Chi Song committed
351
352

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.
SparkSnail's avatar
SparkSnail committed
353
354

* __trial(pai)__
Chi Song's avatar
Chi Song committed
355

SparkSnail's avatar
SparkSnail committed
356
357
  * __command__

Chi Song's avatar
Chi Song committed
358
359
    __command__  specifies the command to run trial process.

SparkSnail's avatar
SparkSnail committed
360
  * __codeDir__
Chi Song's avatar
Chi Song committed
361
362
363

    __codeDir__ specifies the directory of the own trial file.

SparkSnail's avatar
SparkSnail committed
364
  * __gpuNum__
Chi Song's avatar
Chi Song committed
365
366
367

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.

SparkSnail's avatar
SparkSnail committed
368
369
370
  * __cpuNum__

    __cpuNum__ is the cpu number of cpu to be used in pai container.
Chi Song's avatar
Chi Song committed
371

SparkSnail's avatar
SparkSnail committed
372
373
374
  * __memoryMB__

    __memoryMB__ set the momory size to be used in pai's container.
Chi Song's avatar
Chi Song committed
375

SparkSnail's avatar
SparkSnail committed
376
377
378
379
380
381
382
  * __image__

    __image__ set the image to be used in pai.

  * __dataDir__

    __dataDir__ is the data directory in hdfs to be used.
Chi Song's avatar
Chi Song committed
383

SparkSnail's avatar
SparkSnail committed
384
385
  * __outputDir__

Chi Song's avatar
Chi Song committed
386
    __outputDir__ is the output directory in hdfs to be used in pai, the stdout and stderr files are stored in the directory after job finished.
SparkSnail's avatar
SparkSnail committed
387
388

* __trial(kubeflow)__
Chi Song's avatar
Chi Song committed
389

SparkSnail's avatar
SparkSnail committed
390
  * __codeDir__
Chi Song's avatar
Chi Song committed
391

SparkSnail's avatar
SparkSnail committed
392
    __codeDir__ is the local directory where the code files in.
Chi Song's avatar
Chi Song committed
393

SparkSnail's avatar
SparkSnail committed
394
  * __ps(optional)__
Chi Song's avatar
Chi Song committed
395
396
397

    __ps__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
398
    * __replicas__
Chi Song's avatar
Chi Song committed
399

SparkSnail's avatar
SparkSnail committed
400
      __replicas__ is the replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
401

SparkSnail's avatar
SparkSnail committed
402
    * __command__
Chi Song's avatar
Chi Song committed
403

SparkSnail's avatar
SparkSnail committed
404
      __command__ is the run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
405

SparkSnail's avatar
SparkSnail committed
406
    * __gpuNum__
Chi Song's avatar
Chi Song committed
407

SparkSnail's avatar
SparkSnail committed
408
      __gpuNum__ set the gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
409

SparkSnail's avatar
SparkSnail committed
410
    * __cpuNum__
Chi Song's avatar
Chi Song committed
411

SparkSnail's avatar
SparkSnail committed
412
      __cpuNum__ set the cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
413

SparkSnail's avatar
SparkSnail committed
414
    * __memoryMB__
Chi Song's avatar
Chi Song committed
415

SparkSnail's avatar
SparkSnail committed
416
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
417

SparkSnail's avatar
SparkSnail committed
418
    * __image__
Chi Song's avatar
Chi Song committed
419

Chi Song's avatar
Chi Song committed
420
      __image__ set the image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
421
422

  * __worker__
Chi Song's avatar
Chi Song committed
423
424
425

    __worker__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
426
    * __replicas__
Chi Song's avatar
Chi Song committed
427

SparkSnail's avatar
SparkSnail committed
428
      __replicas__ is the replica number of __worker__ role.
Chi Song's avatar
Chi Song committed
429

SparkSnail's avatar
SparkSnail committed
430
    * __command__
Chi Song's avatar
Chi Song committed
431

SparkSnail's avatar
SparkSnail committed
432
      __command__ is the run script in __worker__'s container.
Chi Song's avatar
Chi Song committed
433

SparkSnail's avatar
SparkSnail committed
434
    * __gpuNum__
Chi Song's avatar
Chi Song committed
435

SparkSnail's avatar
SparkSnail committed
436
      __gpuNum__ set the gpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
437

SparkSnail's avatar
SparkSnail committed
438
    * __cpuNum__
Chi Song's avatar
Chi Song committed
439

SparkSnail's avatar
SparkSnail committed
440
      __cpuNum__ set the cpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
441

SparkSnail's avatar
SparkSnail committed
442
    * __memoryMB__
Chi Song's avatar
Chi Song committed
443

SparkSnail's avatar
SparkSnail committed
444
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
445

SparkSnail's avatar
SparkSnail committed
446
    * __image__
Chi Song's avatar
Chi Song committed
447

Chi Song's avatar
Chi Song committed
448
      __image__ set the image to be used in __worker__.
SparkSnail's avatar
SparkSnail committed
449

450
451
* __localConfig__

Chi Song's avatar
Chi Song committed
452
  __localConfig__ is applicable only if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.
453
  * __gpuIndices__
454

Chi Song's avatar
Chi Song committed
455
    __gpuIndices__ is used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or  `0,1,3`.
456

457
458
459
460
461
462
463
464
465
  * __maxTrialNumPerGpu__
  
    __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.
    
  * __useActiveGpu__
  
    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
  

Chi Song's avatar
Chi Song committed
466
* __machineList__
SparkSnail's avatar
SparkSnail committed
467

Chi Song's avatar
Chi Song committed
468
  __machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
SparkSnail's avatar
SparkSnail committed
469

Deshui Yu's avatar
Deshui Yu committed
470
  * __ip__
471

Chi Song's avatar
Chi Song committed
472
473
    __ip__ is the ip address of remote machine.

Deshui Yu's avatar
Deshui Yu committed
474
  * __port__
475

Chi Song's avatar
Chi Song committed
476
477
478
    __port__ is the ssh port to be used to connect machine.

     Note: if users set port empty, the default value will be 22.
Deshui Yu's avatar
Deshui Yu committed
479
  * __username__
Chi Song's avatar
Chi Song committed
480
481

    __username__ is the account of remote machine.
Deshui Yu's avatar
Deshui Yu committed
482
  * __passwd__
Chi Song's avatar
Chi Song committed
483
484

    __passwd__ specifies the password of the account.
Deshui Yu's avatar
Deshui Yu committed
485

486
487
  * __sshKeyPath__

SparkSnail's avatar
SparkSnail committed
488
    If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid.
Chi Song's avatar
Chi Song committed
489
490
491

    Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd.

492
493
  * __passphrase__

SparkSnail's avatar
SparkSnail committed
494
495
    __passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.

496
  * __gpuIndices__
497

Chi Song's avatar
Chi Song committed
498
    __gpuIndices__ is used to specify designated GPU devices for NNI on this remote machine, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified, multiple GPU indices are seperated by comma(,), such as `1` or  `0,1,3`.
499

500
501
502
503
504
505
506
507
  * __maxTrialNumPerGpu__
  
    __maxTrialNumPerGpu__ is used to specify the max concurrency trial number on a GPU device.

  * __useActiveGpu__
  
    __useActiveGpu__ is used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no another active process in the GPU, if __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

SparkSnail's avatar
SparkSnail committed
508
* __kubeflowConfig__:
Chi Song's avatar
Chi Song committed
509

SparkSnail's avatar
SparkSnail committed
510
  * __operator__
Chi Song's avatar
Chi Song committed
511

512
    __operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version.
Chi Song's avatar
Chi Song committed
513

514
  * __storage__
Chi Song's avatar
Chi Song committed
515

516
    __storage__ specify the storage type of kubeflow, including {__nfs__, __azureStorage__}. This field is optional, and the default value is __nfs__. If the config use azureStorage, this field must be completed.
Chi Song's avatar
Chi Song committed
517

SparkSnail's avatar
SparkSnail committed
518
  * __nfs__
Chi Song's avatar
Chi Song committed
519

SparkSnail's avatar
SparkSnail committed
520
521
522
    __server__ is the host of nfs server

    __path__ is the mounted path of nfs
Chi Song's avatar
Chi Song committed
523

SparkSnail's avatar
SparkSnail committed
524
  * __keyVault__
Chi Song's avatar
Chi Song committed
525

SparkSnail's avatar
SparkSnail committed
526
    If users want to use azure kubernetes service, they should set keyVault to storage the private key of your azure storage account. Refer: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
SparkSnail's avatar
SparkSnail committed
527
528
529

    * __vaultName__

Chi Song's avatar
Chi Song committed
530
      __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
531
532

    * __name__
533

Chi Song's avatar
Chi Song committed
534
      __name__ is the value of `--name` used in az command.
535

SparkSnail's avatar
SparkSnail committed
536
  * __azureStorage__
Chi Song's avatar
Chi Song committed
537

SparkSnail's avatar
SparkSnail committed
538
539
540
    If users use azure kubernetes service, they should set azure storage account to store code files.

    * __accountName__
Chi Song's avatar
Chi Song committed
541

SparkSnail's avatar
SparkSnail committed
542
543
544
      __accountName__ is the name of azure storage account.

    * __azureShare__
Chi Song's avatar
Chi Song committed
545

SparkSnail's avatar
SparkSnail committed
546
547
      __azureShare__ is the share of the azure file storage.

548
549
550
551
  * __uploadRetryCount__

    If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.

SparkSnail's avatar
SparkSnail committed
552
553
554
* __paiConfig__

  * __userName__
Chi Song's avatar
Chi Song committed
555

SparkSnail's avatar
SparkSnail committed
556
557
558
    __userName__ is the user name of your pai account.

  * __password__
Chi Song's avatar
Chi Song committed
559

SparkSnail's avatar
SparkSnail committed
560
    __password__ is the password of the pai account.
Chi Song's avatar
Chi Song committed
561

SparkSnail's avatar
SparkSnail committed
562
  * __host__
Chi Song's avatar
Chi Song committed
563

SparkSnail's avatar
SparkSnail committed
564
565
    __host__ is the host of pai.

Chi Song's avatar
Chi Song committed
566
<a name="Examples"></a>
Deshui Yu's avatar
Deshui Yu committed
567
## Examples
Chi Song's avatar
Chi Song committed
568

Deshui Yu's avatar
Deshui Yu committed
569
570
* __local mode__

SparkSnail's avatar
SparkSnail committed
571
  If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
572

Chi Song's avatar
Chi Song committed
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  You can add assessor configuration.

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  Or you could specify your own tuner and assessor file as following,

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
658
659
660

* __remote mode__

661
  If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
700
701
702

* __pai mode__

Chi Song's avatar
Chi Song committed
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
728
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
729
730
731
732
733
734
735
736
737
738
739
740
    #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
    dataDir: hdfs://10.11.12.13:9000/test
    #The hdfs directory to store output data generated by NNI, format 'hdfs://host:port/directory'
    outputDir: hdfs://10.11.12.13:9000/test
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
741

Chi Song's avatar
Chi Song committed
742
* __kubeflow mode__
Chi Song's avatar
Chi Song committed
743

Chi Song's avatar
Chi Song committed
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

  kubeflow with azure storage

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```