ExperimentConfig.md 25.9 KB
Newer Older
1
# Experiment Config Reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
- [Experiment Config Reference](#experiment-config-reference)
  * [Template](#template)
  * [Configuration Spec](#configuration-spec)
    + [authorName](#authorname)
    + [experimentName](#experimentname)
    + [trialConcurrency](#trialconcurrency)
    + [maxExecDuration](#maxexecduration)
    + [versionCheck](#versioncheck)
    + [debug](#debug)
    + [maxTrialNum](#maxtrialnum)
    + [trainingServicePlatform](#trainingserviceplatform)
    + [searchSpacePath](#searchspacepath)
    + [useAnnotation](#useannotation)
    + [multiPhase](#multiphase)
    + [multiThread](#multithread)
    + [nniManagerIp](#nnimanagerip)
    + [logDir](#logdir)
    + [logLevel](#loglevel)
    + [logCollection](#logcollection)
    + [tuner](#tuner)
      - [builtinTunerName](#builtintunername)
      - [codeDir](#codedir)
      - [classFileName](#classfilename)
      - [className](#classname)
      - [classArgs](#classargs)
      - [gpuIndices](#gpuindices)
      - [includeIntermediateResults](#includeintermediateresults)
    + [assessor](#assessor)
      - [builtinAssessorName](#builtinassessorname)
      - [codeDir](#codedir-1)
      - [classFileName](#classfilename-1)
      - [className](#classname-1)
      - [classArgs](#classargs-1)
    + [advisor](#advisor)
      - [builtinAdvisorName](#builtinadvisorname)
      - [codeDir](#codedir-2)
      - [classFileName](#classfilename-2)
      - [className](#classname-2)
      - [classArgs](#classargs-2)
      - [gpuIndices](#gpuindices-1)
    + [trial](#trial)
    + [localConfig](#localconfig)
      - [gpuIndices](#gpuindices-2)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu)
      - [useActiveGpu](#useactivegpu)
    + [machineList](#machinelist)
      - [ip](#ip)
      - [port](#port)
      - [username](#username)
      - [passwd](#passwd)
      - [sshKeyPath](#sshkeypath)
      - [passphrase](#passphrase)
      - [gpuIndices](#gpuindices-3)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
      - [useActiveGpu](#useactivegpu-1)
    + [kubeflowConfig](#kubeflowconfig)
      - [operator](#operator)
      - [storage](#storage)
      - [nfs](#nfs)
      - [keyVault](#keyvault)
      - [azureStorage](#azurestorage)
      - [uploadRetryCount](#uploadretrycount)
    + [paiConfig](#paiconfig)
      - [userName](#username)
      - [password](#password)
      - [token](#token)
      - [host](#host)
  * [Examples](#examples)
    + [Local mode](#local-mode)
    + [Remote mode](#remote-mode)
    + [PAI mode](#pai-mode)
    + [Kubeflow mode](#kubeflow-mode)
    + [Kubeflow with azure storage](#kubeflow-with-azure-storage)
Yan Ni's avatar
Yan Ni committed
80

Deshui Yu's avatar
Deshui Yu committed
81
## Template
Chi Song's avatar
Chi Song committed
82

83
* __Light weight (without Annotation and Assessor)__
Chi Song's avatar
Chi Song committed
84
85
86
87
88
89
90

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
91
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
92
93
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
94
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
95
useAnnotation:
chicm-ms's avatar
chicm-ms committed
96
97
98
99
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
100
101
tuner:
  #choice: TPE, Random, Anneal, Evolution
102
103
104
105
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
106
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
107
trial:
Chi Song's avatar
Chi Song committed
108
109
110
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
111
112
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
113
114
115
116
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
117
```
Chi Song's avatar
Chi Song committed
118

Deshui Yu's avatar
Deshui Yu committed
119
* __Use Assessor__
Chi Song's avatar
Chi Song committed
120

Chi Song's avatar
Chi Song committed
121
122
123
124
125
126
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
127
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
128
129
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
130
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
131
useAnnotation:
chicm-ms's avatar
chicm-ms committed
132
133
134
135
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
136
137
tuner:
  #choice: TPE, Random, Anneal, Evolution
138
139
140
141
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
142
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
143
144
assessor:
  #choice: Medianstop
145
146
147
148
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
149
trial:
Chi Song's avatar
Chi Song committed
150
151
152
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
153
154
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
155
156
157
158
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
159
```
Chi Song's avatar
Chi Song committed
160

Deshui Yu's avatar
Deshui Yu committed
161
* __Use Annotation__
Chi Song's avatar
Chi Song committed
162

Chi Song's avatar
Chi Song committed
163
164
165
166
167
168
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
169
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
170
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
171
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
172
useAnnotation:
chicm-ms's avatar
chicm-ms committed
173
174
175
176
#choice: true, false, default: false
multiPhase:
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
177
178
tuner:
  #choice: TPE, Random, Anneal, Evolution
179
180
181
182
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
183
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
184
185
assessor:
  #choice: Medianstop
186
187
188
189
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
190
trial:
Chi Song's avatar
Chi Song committed
191
192
193
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
194
195
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
196
197
198
199
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
200
```
Chi Song's avatar
Chi Song committed
201

202
## Configuration Spec
Chi Song's avatar
Chi Song committed
203

204
### authorName
Chi Song's avatar
Chi Song committed
205

206
Required. String.
207

208
The name of the author who create the experiment.
Chi Song's avatar
Chi Song committed
209

210
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
211

212
### experimentName
213

214
Required. String.
Chi Song's avatar
Chi Song committed
215

216
The name of the experiment created.
Chi Song's avatar
Chi Song committed
217

218
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
219

220
### trialConcurrency
Chi Song's avatar
Chi Song committed
221

222
Required. Integer between 1 and 99999.
Yan Ni's avatar
Yan Ni committed
223

224
Specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
225

226
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
Chi Song's avatar
Chi Song committed
227

228
229
230
231
232
233
234
235
236
237
238
### maxExecDuration

Optional. String. Default: 999d.

__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.

Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

### versionCheck

Optional. Bool. Default: false.
239
  
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.

### debug

Optional. Bool. Default: false.

Debug mode will set versionCheck to false and set logLevel to be 'debug'.

### maxTrialNum

Optional. Integer between 1 and 99999. Default: 99999.

Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.

### trainingServicePlatform

Required. String.
257

258
Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
259

260
* __local__ run an experiment on local ubuntu machine.
261

262
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
Chi Song's avatar
Chi Song committed
263

264
* __pai__  submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
265

266
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
267

268
* TODO: explain frameworkcontroller.
Chi Song's avatar
Chi Song committed
269

270
### searchSpacePath
Chi Song's avatar
Chi Song committed
271

272
Optional. Path to existing file.
SparkSnail's avatar
SparkSnail committed
273

274
Specifies the path of search space file, which should be a valid path in the local linux machine.
Chi Song's avatar
Chi Song committed
275

276
The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
Chi Song's avatar
Chi Song committed
277

278
### useAnnotation
Chi Song's avatar
Chi Song committed
279

280
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
281

282
Use annotation to analysis trial code and generate search space.
Chi Song's avatar
Chi Song committed
283

284
Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
Chi Song's avatar
Chi Song committed
285

286
### multiPhase
Chi Song's avatar
Chi Song committed
287

288
Optional. Bool. Default: false.
SparkSnail's avatar
SparkSnail committed
289

290
Enable [multi-phase experiment](../AdvancedFeature/MultiPhase.md).
chicm-ms's avatar
chicm-ms committed
291

292
### multiThread
chicm-ms's avatar
chicm-ms committed
293

294
Optional. Bool. Default: false.
chicm-ms's avatar
chicm-ms committed
295

296
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
chicm-ms's avatar
chicm-ms committed
297

298
### nniManagerIp
Chi Song's avatar
Chi Song committed
299

300
Optional. String. Default: eth0 device IP.
SparkSnail's avatar
SparkSnail committed
301

302
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
303

304
Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
305

306
### logDir
307

308
Optional. Path to a directory. Default: `<user home directory>/nni/experiment`.
309

310
Configures the directory to store logs and data of the experiment.
311

312
### logLevel
313

314
Optional. String. Default: `info`.
SparkSnail's avatar
SparkSnail committed
315

316
Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
Chi Song's avatar
Chi Song committed
317

318
### logCollection
Chi Song's avatar
Chi Song committed
319

320
Optional. `http` or `none`. Default: `none`.
321

322
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
Chi Song's avatar
Chi Song committed
323

324
### tuner
Chi Song's avatar
Chi Song committed
325

326
Required.
Chi Song's avatar
Chi Song committed
327

328
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
Chi Song's avatar
Chi Song committed
329

330
#### builtinTunerName
Chi Song's avatar
Chi Song committed
331

332
Required if using built-in tuners. String.
333

334
Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
Chi Song's avatar
Chi Song committed
335

336
#### codeDir
Deshui Yu's avatar
Deshui Yu committed
337

338
Required if using customized tuners. Path relative to the location of config file.
339

340
Specifies the directory of tuner code.
341

342
#### classFileName
343

344
Required if using customized tuners. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
345

346
Specifies the name of tuner file.
Chi Song's avatar
Chi Song committed
347

348
#### className
Chi Song's avatar
Chi Song committed
349

350
Required if using customized tuners. String.
Chi Song's avatar
Chi Song committed
351

352
Specifies the name of tuner class.
Chi Song's avatar
Chi Song committed
353

354
#### classArgs
Chi Song's avatar
Chi Song committed
355

356
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
357

358
Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
Chi Song's avatar
Chi Song committed
359

360
#### gpuIndices
Chi Song's avatar
Chi Song committed
361

362
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
363

364
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
365

366
#### includeIntermediateResults
Chi Song's avatar
Chi Song committed
367

368
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
369

370
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
Chi Song's avatar
Chi Song committed
371

372
### assessor
373

374
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
Deshui Yu's avatar
Deshui Yu committed
375

376
By default, there is no assessor enabled.
Chi Song's avatar
Chi Song committed
377

378
#### builtinAssessorName
379

380
Required if using built-in assessors. String.
381

382
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
383

384
#### codeDir
385

386
Required if using customized assessors. Path relative to the location of config file.
387

388
Specifies the directory of assessor code.
389

390
#### classFileName
391

392
Required if using customized assessors. File path relative to __codeDir__.
393

394
Specifies the name of assessor file.
395

396
#### className
Chi Song's avatar
Chi Song committed
397

398
Required if using customized assessors. String.
Chi Song's avatar
Chi Song committed
399

400
Specifies the name of assessor class.
Deshui Yu's avatar
Deshui Yu committed
401

402
#### classArgs
Chi Song's avatar
Chi Song committed
403

404
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
405

406
Specifies the arguments of assessor algorithm.
Chi Song's avatar
Chi Song committed
407

408
### advisor
Chi Song's avatar
Chi Song committed
409

410
Optional.
SparkSnail's avatar
SparkSnail committed
411

412
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
Chi Song's avatar
Chi Song committed
413

414
When advisor is enabled, settings of tuners and advisors will be bypassed.
SparkSnail's avatar
SparkSnail committed
415

416
#### builtinAdvisorName
Chi Song's avatar
Chi Song committed
417

418
Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
Chi Song's avatar
Chi Song committed
419

420
#### codeDir
Chi Song's avatar
Chi Song committed
421

422
Required if using customized advisors. Path relative to the location of config file.
Chi Song's avatar
Chi Song committed
423

424
Specifies the directory of advisor code.
Chi Song's avatar
Chi Song committed
425

426
#### classFileName
SparkSnail's avatar
SparkSnail committed
427

428
Required if using customized advisors. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
429

430
Specifies the name of advisor file.
SparkSnail's avatar
SparkSnail committed
431

432
#### className
Chi Song's avatar
Chi Song committed
433

434
Required if using customized advisors. String.
SparkSnail's avatar
SparkSnail committed
435

436
Specifies the name of advisor class.
SparkSnail's avatar
SparkSnail committed
437

438
#### classArgs
Chi Song's avatar
Chi Song committed
439

440
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
441

442
Specifies the arguments of advisor.
Chi Song's avatar
Chi Song committed
443

444
#### gpuIndices
Chi Song's avatar
Chi Song committed
445

446
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
447

448
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
449

450
### trial
Chi Song's avatar
Chi Song committed
451

452
Required. Key-value pairs.
Chi Song's avatar
Chi Song committed
453

454
In local and remote mode, the following keys are required.
Chi Song's avatar
Chi Song committed
455

456
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
457

458
* __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
Chi Song's avatar
Chi Song committed
459

460
* __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
461

462
In PAI mode, the following keys are required.
Chi Song's avatar
Chi Song committed
463

464
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
465

466
* __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
Chi Song's avatar
Chi Song committed
467

468
* __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
469

470
* __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
SparkSnail's avatar
SparkSnail committed
471

472
* __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
Chi Song's avatar
Chi Song committed
473

474
* __image__: Required string. Set the image to be used in pai.
Chi Song's avatar
Chi Song committed
475

476
* __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
Chi Song's avatar
Chi Song committed
477

478
* __shmMB__: Optional integer. Shared memory size of container.
Chi Song's avatar
Chi Song committed
479

480
* __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
Chi Song's avatar
Chi Song committed
481

482
In Kubeflow mode, the following keys are required.
Chi Song's avatar
Chi Song committed
483

484
* __codeDir__: The local directory where the code files are in.
Chi Song's avatar
Chi Song committed
485

486
* __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
Chi Song's avatar
Chi Song committed
487

488
    * __replicas__: The replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
489

490
    * __command__: The run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
491

492
    * __gpuNum__: The gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
493

494
    * __cpuNum__: The cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
495

496
    * __memoryMB__: The memory size of the container.
Chi Song's avatar
Chi Song committed
497

498
    * __image__: The image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
499

500
* __worker__: An optional configuration for kubeflow's tensorflow-operator.
501

502
    * __replicas__: The replica number of __worker__ role.
503

504
    * __command__: The run script in __worker__'s container.
505

506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
    * __gpuNum__: The gpu number to be used in __worker__ container.

    * __cpuNum__: The cpu number to be used in __worker__ container.

    * __memoryMB__: The memory size of the container.

    * __image__: The image to be used in __worker__.

### localConfig

Optional in local mode. Key-value pairs.

Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

Optional. Integer. Default: 99999.
529
  
530
Used to specify the max concurrency trial number on a GPU device.
531
    
532
#### useActiveGpu
533

534
Optional. Bool. Default: false.
SparkSnail's avatar
SparkSnail committed
535

536
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
SparkSnail's avatar
SparkSnail committed
537

538
### machineList
539

540
Required in remote mode. A list of key-value pairs with the following keys.
Chi Song's avatar
Chi Song committed
541

542
#### ip
543

544
Required. IP address that is accessible from the current machine.
Chi Song's avatar
Chi Song committed
545

546
The IP address of remote machine.
Chi Song's avatar
Chi Song committed
547

548
#### port
Chi Song's avatar
Chi Song committed
549

550
Optional. Integer. Valid port. Default: 22.
Deshui Yu's avatar
Deshui Yu committed
551

552
The ssh port to be used to connect machine.
553

554
#### username
Chi Song's avatar
Chi Song committed
555

556
Required if authentication with username/password. String.
Chi Song's avatar
Chi Song committed
557

558
The account of remote machine.
559

560
#### passwd
SparkSnail's avatar
SparkSnail committed
561

562
Required if authentication with username/password. String.
563

564
Specifies the password of the account.
565

566
#### sshKeyPath
567

568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
Required if authentication with ssh key. Path to private key file.

If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.

*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*

#### passphrase

Optional. String.

Used to protect ssh key, which could be empty if users don't have passphrase.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

Optional. Integer. Default: 99999.

Used to specify the max concurrency trial number on a GPU device.

#### useActiveGpu

Optional. Bool. Default: false.

Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

### kubeflowConfig

#### operator

Required. String. Has to be `tf-operator` or `pytorch-operator`.

Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.

#### storage

Optional. String. Default. `nfs`.

Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.

#### nfs
613

614
Required if using nfs. Key-value pairs.
Chi Song's avatar
Chi Song committed
615

616
* __server__ is the host of nfs server.
Chi Song's avatar
Chi Song committed
617

618
* __path__ is the mounted path of nfs.
Chi Song's avatar
Chi Song committed
619

620
#### keyVault
Chi Song's avatar
Chi Song committed
621

622
Required if using azure storage. Key-value pairs.
Chi Song's avatar
Chi Song committed
623

624
Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
Chi Song's avatar
Chi Song committed
625

626
* __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
627

628
* __name__ is the value of `--name` used in az command.
Chi Song's avatar
Chi Song committed
629

630
#### azureStorage
Chi Song's avatar
Chi Song committed
631

632
Required if using azure storage. Key-value pairs.
SparkSnail's avatar
SparkSnail committed
633

634
Set azure storage account to store code files.
SparkSnail's avatar
SparkSnail committed
635

636
* __accountName__ is the name of azure storage account.
SparkSnail's avatar
SparkSnail committed
637

638
* __azureShare__ is the share of the azure file storage.
639

640
#### uploadRetryCount
641

642
Required if using azure storage. Integer between 1 and 99999.
Chi Song's avatar
Chi Song committed
643

644
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
SparkSnail's avatar
SparkSnail committed
645

646
### paiConfig
Chi Song's avatar
Chi Song committed
647

648
#### userName
SparkSnail's avatar
SparkSnail committed
649

650
Required. String.
Chi Song's avatar
Chi Song committed
651

652
The user name of your pai account.
SparkSnail's avatar
SparkSnail committed
653

654
#### password
655

656
Required if using password authentication. String.
657

658
The password of the pai account.
SparkSnail's avatar
SparkSnail committed
659

660
#### token
Chi Song's avatar
Chi Song committed
661

662
Required if using token authentication. String.
SparkSnail's avatar
SparkSnail committed
663

664
Personal access token that can be retrieved from PAI portal.
Chi Song's avatar
Chi Song committed
665

666
#### host
Chi Song's avatar
Chi Song committed
667

668
Required. String.
Chi Song's avatar
Chi Song committed
669

670
The hostname of IP address of PAI.
SparkSnail's avatar
SparkSnail committed
671

Deshui Yu's avatar
Deshui Yu committed
672
## Examples
Chi Song's avatar
Chi Song committed
673

674
### Local mode
Deshui Yu's avatar
Deshui Yu committed
675

676
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
677

Chi Song's avatar
Chi Song committed
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

700
You can add assessor configuration.
Chi Song's avatar
Chi Song committed
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

731
Or you could specify your own tuner and assessor file as following,
Chi Song's avatar
Chi Song committed
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
763

764
### Remote mode
Deshui Yu's avatar
Deshui Yu committed
765

766
If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
805

806
### PAI mode
SparkSnail's avatar
SparkSnail committed
807

Chi Song's avatar
Chi Song committed
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
833
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
834
835
836
837
838
839
840
841
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
842

843
### Kubeflow mode
Chi Song's avatar
Chi Song committed
844

Chi Song's avatar
Chi Song committed
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

880
### Kubeflow with azure storage
Chi Song's avatar
Chi Song committed
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```