ExperimentConfig.md 18.8 KB
Newer Older
Scarlett Li's avatar
Scarlett Li committed
1
# Experiment config reference
Deshui Yu's avatar
Deshui Yu committed
2

SparkSnail's avatar
SparkSnail committed
3
A config file is needed when create an experiment, the path of the config file is provide to nnictl.
4
The config file is written in YAML format, and need to be written correctly.
SparkSnail's avatar
SparkSnail committed
5
This document describes the rule to write config file, and will provide some examples and templates. 
Yan Ni's avatar
Yan Ni committed
6

Chi Song's avatar
Chi Song committed
7
8
9
* [Template](#Template) (the templates of an config file)
* [Configuration spec](#Configuration) (the configuration specification of every attribute in config file)
* [Examples](#Examples) (the examples of config file)
Yan Ni's avatar
Yan Ni committed
10
11

<a name="Template"></a>
Deshui Yu's avatar
Deshui Yu committed
12
## Template
Chi Song's avatar
Chi Song committed
13

Chi Song's avatar
Chi Song committed
14
15
16
17
18
19
20
21
* __light weight(without Annotation and Assessor)__

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
22
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
23
24
trainingServicePlatform:
searchSpacePath:
Deshui Yu's avatar
Deshui Yu committed
25
#choice: true, false
Chi Song's avatar
Chi Song committed
26
useAnnotation:
Deshui Yu's avatar
Deshui Yu committed
27
28
tuner:
  #choice: TPE, Random, Anneal, Evolution
29
30
31
32
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
33
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
34
trial:
Chi Song's avatar
Chi Song committed
35
36
37
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
38
39
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
40
41
42
43
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
44
```
Chi Song's avatar
Chi Song committed
45

Deshui Yu's avatar
Deshui Yu committed
46
* __Use Assessor__
Chi Song's avatar
Chi Song committed
47

Chi Song's avatar
Chi Song committed
48
49
50
51
52
53
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
54
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
55
56
trainingServicePlatform:
searchSpacePath:
Deshui Yu's avatar
Deshui Yu committed
57
#choice: true, false
Chi Song's avatar
Chi Song committed
58
useAnnotation:
Deshui Yu's avatar
Deshui Yu committed
59
60
tuner:
  #choice: TPE, Random, Anneal, Evolution
61
62
63
64
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
65
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
66
67
assessor:
  #choice: Medianstop
68
69
70
71
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
72
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
73
trial:
Chi Song's avatar
Chi Song committed
74
75
76
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
77
78
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
79
80
81
82
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
83
```
Chi Song's avatar
Chi Song committed
84

Deshui Yu's avatar
Deshui Yu committed
85
* __Use Annotation__
Chi Song's avatar
Chi Song committed
86

Chi Song's avatar
Chi Song committed
87
88
89
90
91
92
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
93
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
94
trainingServicePlatform:
Deshui Yu's avatar
Deshui Yu committed
95
#choice: true, false
Chi Song's avatar
Chi Song committed
96
useAnnotation:
Deshui Yu's avatar
Deshui Yu committed
97
98
tuner:
  #choice: TPE, Random, Anneal, Evolution
99
100
101
102
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
103
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
104
105
assessor:
  #choice: Medianstop
106
107
108
109
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
110
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
111
trial:
Chi Song's avatar
Chi Song committed
112
113
114
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
115
116
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
117
118
119
120
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
121
```
Chi Song's avatar
Chi Song committed
122

Yan Ni's avatar
Yan Ni committed
123
124
<a name="Configuration"></a>
## Configuration spec
Chi Song's avatar
Chi Song committed
125

Deshui Yu's avatar
Deshui Yu committed
126
127
* __authorName__
  * Description  
Chi Song's avatar
Chi Song committed
128
129

    __authorName__ is the name of the author who create the experiment.
130
   TBD: add default value
Chi Song's avatar
Chi Song committed
131

Deshui Yu's avatar
Deshui Yu committed
132
133
* __experimentName__
  * Description
Chi Song's avatar
Chi Song committed
134

SparkSnail's avatar
SparkSnail committed
135
    __experimentName__ is the name of the experiment created.  
136
    TBD: add default value
Chi Song's avatar
Chi Song committed
137

Deshui Yu's avatar
Deshui Yu committed
138
139
* __trialConcurrency__
  * Description
Chi Song's avatar
Chi Song committed
140
141
142
143
144

    __trialConcurrency__ specifies the max num of trial jobs run simultaneously.  

    Note: if trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach trialConcurrency number, some trial jobs will be put into a queue to wait for gpu allocation.

Deshui Yu's avatar
Deshui Yu committed
145
146
* __maxExecDuration__
  * Description
Yan Ni's avatar
Yan Ni committed
147

Chi Song's avatar
Chi Song committed
148
149
150
151
    __maxExecDuration__ specifies the max duration time of an experiment.The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.  

    Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

Deshui Yu's avatar
Deshui Yu committed
152
* __maxTrialNum__
Chi Song's avatar
Chi Song committed
153
154
155
156
  * Description

   __maxTrialNum__ specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.  

Deshui Yu's avatar
Deshui Yu committed
157
158
* __trainingServicePlatform__
  * Description
Chi Song's avatar
Chi Song committed
159
160
161

    __trainingServicePlatform__ specifies the platform to run the experiment, including {__local__, __remote__, __pai__, __kubeflow__}.  

SparkSnail's avatar
SparkSnail committed
162
    * __local__ run an experiment on local ubuntu machine.  
Chi Song's avatar
Chi Song committed
163

SparkSnail's avatar
SparkSnail committed
164
    * __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.  
SparkSnail's avatar
SparkSnail committed
165

SparkSnail's avatar
SparkSnail committed
166
    * __pai__  submit trial jobs to [OpenPai](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please reference [PAIMOdeDoc](./PAIMode.md)
Chi Song's avatar
Chi Song committed
167

168
    * __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/).
Chi Song's avatar
Chi Song committed
169

Deshui Yu's avatar
Deshui Yu committed
170
171
* __searchSpacePath__
  * Description
Chi Song's avatar
Chi Song committed
172
173
174
175
176

    __searchSpacePath__ specifies the path of search space file, which should be a valid path in the local linux machine.

    Note: if set useAnnotation=True, the searchSpacePath field should be removed.

Deshui Yu's avatar
Deshui Yu committed
177
178
* __useAnnotation__
  * Description
Chi Song's avatar
Chi Song committed
179
180
181

    __useAnnotation__ use annotation to analysis trial code and generate search space.

Chi Song's avatar
Chi Song committed
182
    Note: if set useAnnotation=True, the searchSpacePath field should be removed.
SparkSnail's avatar
SparkSnail committed
183
184
185

* __nniManagerIp__
  * Description
Chi Song's avatar
Chi Song committed
186

187
    __nniManagerIp__ set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
SparkSnail's avatar
SparkSnail committed
188

Chi Song's avatar
Chi Song committed
189
    Note: run ifconfig on NNI manager's machine to check if eth0 device exists. If not, we recommend to set nnimanagerIp explicitly.
190
191
192
193
194
195
196
197
198
199
200

* __logDir__
  * Description

    __logDir__ configures the directory to store logs and data of the experiment. The default value is `<user home directory>/nni/experiment`

* __logLevel__
  * Description

    __logLevel__ sets log level for the experiment, available log levels are: `trace, debug, info, warning, error, fatal`. The default value is `info`.

Deshui Yu's avatar
Deshui Yu committed
201
202
* __tuner__
  * Description
Chi Song's avatar
Chi Song committed
203

204
    __tuner__ specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk, need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
205
206
  * __builtinTunerName__ and __classArgs__
    * __builtinTunerName__
Chi Song's avatar
Chi Song committed
207
208

      __builtinTunerName__ specifies the name of system tuner, NNI sdk provides four kinds of tuner, including {__TPE__, __Random__, __Anneal__, __Evolution__, __BatchTuner__, __GridSearch__}
Chi Song's avatar
Chi Song committed
209
    * __classArgs__
Chi Song's avatar
Chi Song committed
210
211

      __classArgs__ specifies the arguments of tuner algorithm. If the __builtinTunerName__ is in {__TPE__, __Random__, __Anneal__, __Evolution__}, user should set __optimize_mode__.
212
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
213
214
215
216
217
218
219
220
221
222
223
224
225
226
    * __codeDir__

      __codeDir__ specifies the directory of tuner code.
    * __classFileName__

      __classFileName__ specifies the name of tuner file.
    * __className__

      __className__ specifies the name of tuner class.
    * __classArgs__

      __classArgs__ specifies the arguments of tuner algorithm.
    * __gpuNum__

Chi Song's avatar
Chi Song committed
227
      __gpuNum__ specifies the gpu number to run the tuner process. The value of this field should be a positive number.
Chi Song's avatar
Chi Song committed
228
229

      Note: users could only specify one way to set tuner, for example, set {tunerName, optimizationMode} or {tunerCommand, tunerCwd}, and could not set them both.
Deshui Yu's avatar
Deshui Yu committed
230
231

* __assessor__
Chi Song's avatar
Chi Song committed
232

Deshui Yu's avatar
Deshui Yu committed
233
  * Description
Chi Song's avatar
Chi Song committed
234

235
    __assessor__ specifies the assessor algorithm to run an experiment, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk, users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
236
237
  * __builtinAssessorName__ and __classArgs__
    * __builtinAssessorName__
Chi Song's avatar
Chi Song committed
238
239

      __builtinAssessorName__ specifies the name of system assessor, NNI sdk provides one kind of assessor {__Medianstop__}
Chi Song's avatar
Chi Song committed
240
241
    * __classArgs__

Chi Song's avatar
Chi Song committed
242
243
      __classArgs__ specifies the arguments of assessor algorithm

244
  * __codeDir__, __classFileName__, __className__ and __classArgs__
Chi Song's avatar
Chi Song committed
245

Chi Song's avatar
Chi Song committed
246
    * __codeDir__
Chi Song's avatar
Chi Song committed
247
248
249

      __codeDir__ specifies the directory of assessor code.

Chi Song's avatar
Chi Song committed
250
    * __classFileName__
Chi Song's avatar
Chi Song committed
251
252
253

      __classFileName__ specifies the name of assessor file.

Chi Song's avatar
Chi Song committed
254
    * __className__
Chi Song's avatar
Chi Song committed
255
256
257

      __className__ specifies the name of assessor class.

Chi Song's avatar
Chi Song committed
258
    * __classArgs__
Chi Song's avatar
Chi Song committed
259
260
261

      __classArgs__ specifies the arguments of assessor algorithm.

262
  * __gpuNum__
Deshui Yu's avatar
Deshui Yu committed
263

Chi Song's avatar
Chi Song committed
264
265
266
267
    __gpuNum__ specifies the gpu number to run the assessor process. The value of this field should be a positive number.

    Note: users' could only specify one way to set assessor, for example,set {assessorName, optimizationMode} or {assessorCommand, assessorCwd}, and users could not set them both.If users do not want to use assessor, assessor fileld should leave to empty.

SparkSnail's avatar
SparkSnail committed
268
* __trial(local, remote)__
Chi Song's avatar
Chi Song committed
269

270
  * __command__
Deshui Yu's avatar
Deshui Yu committed
271

Chi Song's avatar
Chi Song committed
272
273
    __command__  specifies the command to run trial process.

274
  * __codeDir__
Chi Song's avatar
Chi Song committed
275
276
277

    __codeDir__ specifies the directory of your own trial file.

278
  * __gpuNum__
Chi Song's avatar
Chi Song committed
279
280

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.
SparkSnail's avatar
SparkSnail committed
281
282

* __trial(pai)__
Chi Song's avatar
Chi Song committed
283

SparkSnail's avatar
SparkSnail committed
284
285
  * __command__

Chi Song's avatar
Chi Song committed
286
287
    __command__  specifies the command to run trial process.

SparkSnail's avatar
SparkSnail committed
288
  * __codeDir__
Chi Song's avatar
Chi Song committed
289
290
291

    __codeDir__ specifies the directory of the own trial file.

SparkSnail's avatar
SparkSnail committed
292
  * __gpuNum__
Chi Song's avatar
Chi Song committed
293
294
295

    __gpuNum__ specifies the num of gpu to run the trial process. Default value is 0.

SparkSnail's avatar
SparkSnail committed
296
297
298
  * __cpuNum__

    __cpuNum__ is the cpu number of cpu to be used in pai container.
Chi Song's avatar
Chi Song committed
299

SparkSnail's avatar
SparkSnail committed
300
301
302
  * __memoryMB__

    __memoryMB__ set the momory size to be used in pai's container.
Chi Song's avatar
Chi Song committed
303

SparkSnail's avatar
SparkSnail committed
304
305
306
307
308
309
310
  * __image__

    __image__ set the image to be used in pai.

  * __dataDir__

    __dataDir__ is the data directory in hdfs to be used.
Chi Song's avatar
Chi Song committed
311

SparkSnail's avatar
SparkSnail committed
312
313
  * __outputDir__

Chi Song's avatar
Chi Song committed
314
    __outputDir__ is the output directory in hdfs to be used in pai, the stdout and stderr files are stored in the directory after job finished.
SparkSnail's avatar
SparkSnail committed
315
316

* __trial(kubeflow)__
Chi Song's avatar
Chi Song committed
317

SparkSnail's avatar
SparkSnail committed
318
  * __codeDir__
Chi Song's avatar
Chi Song committed
319

SparkSnail's avatar
SparkSnail committed
320
    __codeDir__ is the local directory where the code files in.
Chi Song's avatar
Chi Song committed
321

SparkSnail's avatar
SparkSnail committed
322
  * __ps(optional)__
Chi Song's avatar
Chi Song committed
323
324
325

    __ps__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
326
    * __replicas__
Chi Song's avatar
Chi Song committed
327

SparkSnail's avatar
SparkSnail committed
328
      __replicas__ is the replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
329

SparkSnail's avatar
SparkSnail committed
330
    * __command__
Chi Song's avatar
Chi Song committed
331

SparkSnail's avatar
SparkSnail committed
332
      __command__ is the run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
333

SparkSnail's avatar
SparkSnail committed
334
    * __gpuNum__
Chi Song's avatar
Chi Song committed
335

SparkSnail's avatar
SparkSnail committed
336
      __gpuNum__ set the gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
337

SparkSnail's avatar
SparkSnail committed
338
    * __cpuNum__
Chi Song's avatar
Chi Song committed
339

SparkSnail's avatar
SparkSnail committed
340
      __cpuNum__ set the cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
341

SparkSnail's avatar
SparkSnail committed
342
    * __memoryMB__
Chi Song's avatar
Chi Song committed
343

SparkSnail's avatar
SparkSnail committed
344
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
345

SparkSnail's avatar
SparkSnail committed
346
    * __image__
Chi Song's avatar
Chi Song committed
347

Chi Song's avatar
Chi Song committed
348
      __image__ set the image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
349
350

  * __worker__
Chi Song's avatar
Chi Song committed
351
352
353

    __worker__ is the configuration for kubeflow's tensorflow-operator.

SparkSnail's avatar
SparkSnail committed
354
    * __replicas__
Chi Song's avatar
Chi Song committed
355

SparkSnail's avatar
SparkSnail committed
356
      __replicas__ is the replica number of __worker__ role.
Chi Song's avatar
Chi Song committed
357

SparkSnail's avatar
SparkSnail committed
358
    * __command__
Chi Song's avatar
Chi Song committed
359

SparkSnail's avatar
SparkSnail committed
360
      __command__ is the run script in __worker__'s container.
Chi Song's avatar
Chi Song committed
361

SparkSnail's avatar
SparkSnail committed
362
    * __gpuNum__
Chi Song's avatar
Chi Song committed
363

SparkSnail's avatar
SparkSnail committed
364
      __gpuNum__ set the gpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
365

SparkSnail's avatar
SparkSnail committed
366
    * __cpuNum__
Chi Song's avatar
Chi Song committed
367

SparkSnail's avatar
SparkSnail committed
368
      __cpuNum__ set the cpu number to be used in __worker__ container.
Chi Song's avatar
Chi Song committed
369

SparkSnail's avatar
SparkSnail committed
370
    * __memoryMB__
Chi Song's avatar
Chi Song committed
371

SparkSnail's avatar
SparkSnail committed
372
      __memoryMB__ set the memory size of the container.
Chi Song's avatar
Chi Song committed
373

SparkSnail's avatar
SparkSnail committed
374
    * __image__
Chi Song's avatar
Chi Song committed
375

Chi Song's avatar
Chi Song committed
376
      __image__ set the image to be used in __worker__.
SparkSnail's avatar
SparkSnail committed
377

Chi Song's avatar
Chi Song committed
378
* __machineList__
SparkSnail's avatar
SparkSnail committed
379

Chi Song's avatar
Chi Song committed
380
  __machineList__ should be set if __trainingServicePlatform__ is set to remote, or it should be empty.
SparkSnail's avatar
SparkSnail committed
381

Deshui Yu's avatar
Deshui Yu committed
382
  * __ip__
Chi Song's avatar
Chi Song committed
383
384
385
  
    __ip__ is the ip address of remote machine.

Deshui Yu's avatar
Deshui Yu committed
386
  * __port__
Chi Song's avatar
Chi Song committed
387
388
389
390
  
    __port__ is the ssh port to be used to connect machine.

     Note: if users set port empty, the default value will be 22.
Deshui Yu's avatar
Deshui Yu committed
391
  * __username__
Chi Song's avatar
Chi Song committed
392
393

    __username__ is the account of remote machine.
Deshui Yu's avatar
Deshui Yu committed
394
  * __passwd__
Chi Song's avatar
Chi Song committed
395
396

    __passwd__ specifies the password of the account.
Deshui Yu's avatar
Deshui Yu committed
397

398
399
  * __sshKeyPath__

SparkSnail's avatar
SparkSnail committed
400
    If users use ssh key to login remote machine, could set __sshKeyPath__ in config file. __sshKeyPath__ is the path of ssh key file, which should be valid.
Chi Song's avatar
Chi Song committed
401
402
403

    Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd.

404
405
  * __passphrase__

SparkSnail's avatar
SparkSnail committed
406
407
408
    __passphrase__ is used to protect ssh key, which could be empty if users don't have passphrase.

* __kubeflowConfig__:
Chi Song's avatar
Chi Song committed
409

SparkSnail's avatar
SparkSnail committed
410
  * __operator__
Chi Song's avatar
Chi Song committed
411

412
    __operator__ specify the kubeflow's operator to be used, NNI support __tf-operator__ in current version.
Chi Song's avatar
Chi Song committed
413

414
  * __storage__
Chi Song's avatar
Chi Song committed
415

416
    __storage__ specify the storage type of kubeflow, including {__nfs__, __azureStorage__}. This field is optional, and the default value is __nfs__. If the config use azureStorage, this field must be completed.
Chi Song's avatar
Chi Song committed
417

SparkSnail's avatar
SparkSnail committed
418
  * __nfs__
Chi Song's avatar
Chi Song committed
419

SparkSnail's avatar
SparkSnail committed
420
421
422
    __server__ is the host of nfs server

    __path__ is the mounted path of nfs
Chi Song's avatar
Chi Song committed
423

SparkSnail's avatar
SparkSnail committed
424
  * __keyVault__
Chi Song's avatar
Chi Song committed
425

SparkSnail's avatar
SparkSnail committed
426
    If users want to use azure kubernetes service, they should set keyVault to storage the private key of your azure storage account. Refer: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
SparkSnail's avatar
SparkSnail committed
427
428
429

    * __vaultName__

Chi Song's avatar
Chi Song committed
430
      __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
431
432

    * __name__
433

Chi Song's avatar
Chi Song committed
434
      __name__ is the value of `--name` used in az command.
435

SparkSnail's avatar
SparkSnail committed
436
  * __azureStorage__
Chi Song's avatar
Chi Song committed
437

SparkSnail's avatar
SparkSnail committed
438
439
440
    If users use azure kubernetes service, they should set azure storage account to store code files.

    * __accountName__
Chi Song's avatar
Chi Song committed
441

SparkSnail's avatar
SparkSnail committed
442
443
444
      __accountName__ is the name of azure storage account.

    * __azureShare__
Chi Song's avatar
Chi Song committed
445

SparkSnail's avatar
SparkSnail committed
446
447
      __azureShare__ is the share of the azure file storage.

SparkSnail's avatar
SparkSnail committed
448
449
450
* __paiConfig__

  * __userName__
Chi Song's avatar
Chi Song committed
451

SparkSnail's avatar
SparkSnail committed
452
453
454
    __userName__ is the user name of your pai account.

  * __password__
Chi Song's avatar
Chi Song committed
455

SparkSnail's avatar
SparkSnail committed
456
    __password__ is the password of the pai account.
Chi Song's avatar
Chi Song committed
457

SparkSnail's avatar
SparkSnail committed
458
  * __host__
Chi Song's avatar
Chi Song committed
459

SparkSnail's avatar
SparkSnail committed
460
461
    __host__ is the host of pai.

Chi Song's avatar
Chi Song committed
462
<a name="Examples"></a>
Deshui Yu's avatar
Deshui Yu committed
463
## Examples
Chi Song's avatar
Chi Song committed
464

Deshui Yu's avatar
Deshui Yu committed
465
466
* __local mode__

SparkSnail's avatar
SparkSnail committed
467
  If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
468

Chi Song's avatar
Chi Song committed
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  You can add assessor configuration.

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

  Or you could specify your own tuner and assessor file as following,

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
559
560
561

* __remote mode__

Chi Song's avatar
Chi Song committed
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
  If run trial jobs in remote machine, users could specify the remote mahcine information as fllowing format:

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
602
603
604

* __pai mode__

Chi Song's avatar
Chi Song committed
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
630
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
631
632
633
634
635
636
637
638
639
640
641
642
    #The hdfs directory to store data on pai, format 'hdfs://host:port/directory'
    dataDir: hdfs://10.11.12.13:9000/test
    #The hdfs directory to store output data generated by NNI, format 'hdfs://host:port/directory'
    outputDir: hdfs://10.11.12.13:9000/test
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
643

Chi Song's avatar
Chi Song committed
644
* __kubeflow mode__
Chi Song's avatar
Chi Song committed
645

Chi Song's avatar
Chi Song committed
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

  kubeflow with azure storage

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
SparkSnail's avatar
SparkSnail committed
705
    gpuNum: 0
Chi Song's avatar
Chi Song committed
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```