ExperimentConfig.md 26 KB
Newer Older
1
# Experiment Config Reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
- [Experiment Config Reference](#experiment-config-reference)
  * [Template](#template)
  * [Configuration Spec](#configuration-spec)
    + [authorName](#authorname)
    + [experimentName](#experimentname)
    + [trialConcurrency](#trialconcurrency)
    + [maxExecDuration](#maxexecduration)
    + [versionCheck](#versioncheck)
    + [debug](#debug)
    + [maxTrialNum](#maxtrialnum)
    + [trainingServicePlatform](#trainingserviceplatform)
    + [searchSpacePath](#searchspacepath)
    + [useAnnotation](#useannotation)
    + [multiThread](#multithread)
    + [nniManagerIp](#nnimanagerip)
    + [logDir](#logdir)
    + [logLevel](#loglevel)
    + [logCollection](#logcollection)
    + [tuner](#tuner)
      - [builtinTunerName](#builtintunername)
      - [codeDir](#codedir)
      - [classFileName](#classfilename)
      - [className](#classname)
      - [classArgs](#classargs)
      - [gpuIndices](#gpuindices)
      - [includeIntermediateResults](#includeintermediateresults)
    + [assessor](#assessor)
      - [builtinAssessorName](#builtinassessorname)
      - [codeDir](#codedir-1)
      - [classFileName](#classfilename-1)
      - [className](#classname-1)
      - [classArgs](#classargs-1)
    + [advisor](#advisor)
      - [builtinAdvisorName](#builtinadvisorname)
      - [codeDir](#codedir-2)
      - [classFileName](#classfilename-2)
      - [className](#classname-2)
      - [classArgs](#classargs-2)
      - [gpuIndices](#gpuindices-1)
    + [trial](#trial)
    + [localConfig](#localconfig)
      - [gpuIndices](#gpuindices-2)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu)
      - [useActiveGpu](#useactivegpu)
    + [machineList](#machinelist)
      - [ip](#ip)
      - [port](#port)
      - [username](#username)
      - [passwd](#passwd)
      - [sshKeyPath](#sshkeypath)
      - [passphrase](#passphrase)
      - [gpuIndices](#gpuindices-3)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
      - [useActiveGpu](#useactivegpu-1)
    + [kubeflowConfig](#kubeflowconfig)
      - [operator](#operator)
      - [storage](#storage)
      - [nfs](#nfs)
      - [keyVault](#keyvault)
      - [azureStorage](#azurestorage)
      - [uploadRetryCount](#uploadretrycount)
    + [paiConfig](#paiconfig)
      - [userName](#username)
      - [password](#password)
      - [token](#token)
      - [host](#host)
73
      - [reuse](#reuse)
74
75
76
77
78
79
  * [Examples](#examples)
    + [Local mode](#local-mode)
    + [Remote mode](#remote-mode)
    + [PAI mode](#pai-mode)
    + [Kubeflow mode](#kubeflow-mode)
    + [Kubeflow with azure storage](#kubeflow-with-azure-storage)
Yan Ni's avatar
Yan Ni committed
80

Deshui Yu's avatar
Deshui Yu committed
81
## Template
Chi Song's avatar
Chi Song committed
82

83
* __Light weight (without Annotation and Assessor)__
Chi Song's avatar
Chi Song committed
84
85
86
87
88
89
90

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
91
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
92
93
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
94
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
95
useAnnotation:
chicm-ms's avatar
chicm-ms committed
96
97
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
98
99
tuner:
  #choice: TPE, Random, Anneal, Evolution
100
101
102
103
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
104
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
105
trial:
Chi Song's avatar
Chi Song committed
106
107
108
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
109
110
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
111
112
113
114
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
115
```
Chi Song's avatar
Chi Song committed
116

Deshui Yu's avatar
Deshui Yu committed
117
* __Use Assessor__
Chi Song's avatar
Chi Song committed
118

Chi Song's avatar
Chi Song committed
119
120
121
122
123
124
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
125
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
126
127
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
128
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
129
useAnnotation:
chicm-ms's avatar
chicm-ms committed
130
131
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
132
133
tuner:
  #choice: TPE, Random, Anneal, Evolution
134
135
136
137
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
138
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
139
140
assessor:
  #choice: Medianstop
141
142
143
144
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
145
trial:
Chi Song's avatar
Chi Song committed
146
147
148
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
149
150
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
151
152
153
154
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
155
```
Chi Song's avatar
Chi Song committed
156

Deshui Yu's avatar
Deshui Yu committed
157
* __Use Annotation__
Chi Song's avatar
Chi Song committed
158

Chi Song's avatar
Chi Song committed
159
160
161
162
163
164
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
165
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
166
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
167
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
168
useAnnotation:
chicm-ms's avatar
chicm-ms committed
169
170
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
171
172
tuner:
  #choice: TPE, Random, Anneal, Evolution
173
174
175
176
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
177
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
178
179
assessor:
  #choice: Medianstop
180
181
182
183
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
184
trial:
Chi Song's avatar
Chi Song committed
185
186
187
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
188
189
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
190
191
192
193
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
194
```
Chi Song's avatar
Chi Song committed
195

196
## Configuration Spec
Chi Song's avatar
Chi Song committed
197

198
### authorName
Chi Song's avatar
Chi Song committed
199

200
Required. String.
201

202
The name of the author who create the experiment.
Chi Song's avatar
Chi Song committed
203

204
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
205

206
### experimentName
207

208
Required. String.
Chi Song's avatar
Chi Song committed
209

210
The name of the experiment created.
Chi Song's avatar
Chi Song committed
211

212
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
213

214
### trialConcurrency
Chi Song's avatar
Chi Song committed
215

216
Required. Integer between 1 and 99999.
Yan Ni's avatar
Yan Ni committed
217

218
Specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
219

220
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
Chi Song's avatar
Chi Song committed
221

222
223
224
225
226
227
228
229
230
231
### maxExecDuration

Optional. String. Default: 999d.

__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.

Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

### versionCheck

232
Optional. Bool. Default: true.
233
  
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.

### debug

Optional. Bool. Default: false.

Debug mode will set versionCheck to false and set logLevel to be 'debug'.

### maxTrialNum

Optional. Integer between 1 and 99999. Default: 99999.

Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.

### trainingServicePlatform

Required. String.
251

252
Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
253

254
* __local__ run an experiment on local ubuntu machine.
255

256
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
Chi Song's avatar
Chi Song committed
257

258
* __pai__  submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
259

260
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
261

262
* TODO: explain frameworkcontroller.
Chi Song's avatar
Chi Song committed
263

264
### searchSpacePath
Chi Song's avatar
Chi Song committed
265

266
Optional. Path to existing file.
SparkSnail's avatar
SparkSnail committed
267

268
Specifies the path of search space file, which should be a valid path in the local linux machine.
Chi Song's avatar
Chi Song committed
269

270
The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
Chi Song's avatar
Chi Song committed
271

272
### useAnnotation
Chi Song's avatar
Chi Song committed
273

274
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
275

276
Use annotation to analysis trial code and generate search space.
Chi Song's avatar
Chi Song committed
277

278
Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
Chi Song's avatar
Chi Song committed
279

280
### multiThread
chicm-ms's avatar
chicm-ms committed
281

282
Optional. Bool. Default: false.
chicm-ms's avatar
chicm-ms committed
283

284
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
chicm-ms's avatar
chicm-ms committed
285

286
### nniManagerIp
Chi Song's avatar
Chi Song committed
287

288
Optional. String. Default: eth0 device IP.
SparkSnail's avatar
SparkSnail committed
289

290
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
291

292
Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
293

294
### logDir
295

chicm-ms's avatar
chicm-ms committed
296
Optional. Path to a directory. Default: `<user home directory>/nni-experiments`.
297

298
Configures the directory to store logs and data of the experiment.
299

300
### logLevel
301

302
Optional. String. Default: `info`.
SparkSnail's avatar
SparkSnail committed
303

304
Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
Chi Song's avatar
Chi Song committed
305

306
### logCollection
Chi Song's avatar
Chi Song committed
307

308
Optional. `http` or `none`. Default: `none`.
309

310
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
Chi Song's avatar
Chi Song committed
311

312
### tuner
Chi Song's avatar
Chi Song committed
313

314
Required.
Chi Song's avatar
Chi Song committed
315

316
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
Chi Song's avatar
Chi Song committed
317

318
#### builtinTunerName
Chi Song's avatar
Chi Song committed
319

320
Required if using built-in tuners. String.
321

322
Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
Chi Song's avatar
Chi Song committed
323

324
#### codeDir
Deshui Yu's avatar
Deshui Yu committed
325

326
Required if using customized tuners. Path relative to the location of config file.
327

328
Specifies the directory of tuner code.
329

330
#### classFileName
331

332
Required if using customized tuners. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
333

334
Specifies the name of tuner file.
Chi Song's avatar
Chi Song committed
335

336
#### className
Chi Song's avatar
Chi Song committed
337

338
Required if using customized tuners. String.
Chi Song's avatar
Chi Song committed
339

340
Specifies the name of tuner class.
Chi Song's avatar
Chi Song committed
341

342
#### classArgs
Chi Song's avatar
Chi Song committed
343

344
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
345

346
Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
Chi Song's avatar
Chi Song committed
347

348
#### gpuIndices
Chi Song's avatar
Chi Song committed
349

350
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
351

352
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
353

354
#### includeIntermediateResults
Chi Song's avatar
Chi Song committed
355

356
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
357

358
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
Chi Song's avatar
Chi Song committed
359

360
### assessor
361

362
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
Deshui Yu's avatar
Deshui Yu committed
363

364
By default, there is no assessor enabled.
Chi Song's avatar
Chi Song committed
365

366
#### builtinAssessorName
367

368
Required if using built-in assessors. String.
369

370
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
371

372
#### codeDir
373

374
Required if using customized assessors. Path relative to the location of config file.
375

376
Specifies the directory of assessor code.
377

378
#### classFileName
379

380
Required if using customized assessors. File path relative to __codeDir__.
381

382
Specifies the name of assessor file.
383

384
#### className
Chi Song's avatar
Chi Song committed
385

386
Required if using customized assessors. String.
Chi Song's avatar
Chi Song committed
387

388
Specifies the name of assessor class.
Deshui Yu's avatar
Deshui Yu committed
389

390
#### classArgs
Chi Song's avatar
Chi Song committed
391

392
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
393

394
Specifies the arguments of assessor algorithm.
Chi Song's avatar
Chi Song committed
395

396
### advisor
Chi Song's avatar
Chi Song committed
397

398
Optional.
SparkSnail's avatar
SparkSnail committed
399

400
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
Chi Song's avatar
Chi Song committed
401

402
When advisor is enabled, settings of tuners and advisors will be bypassed.
SparkSnail's avatar
SparkSnail committed
403

404
#### builtinAdvisorName
Chi Song's avatar
Chi Song committed
405

406
Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
Chi Song's avatar
Chi Song committed
407

408
#### codeDir
Chi Song's avatar
Chi Song committed
409

410
Required if using customized advisors. Path relative to the location of config file.
Chi Song's avatar
Chi Song committed
411

412
Specifies the directory of advisor code.
Chi Song's avatar
Chi Song committed
413

414
#### classFileName
SparkSnail's avatar
SparkSnail committed
415

416
Required if using customized advisors. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
417

418
Specifies the name of advisor file.
SparkSnail's avatar
SparkSnail committed
419

420
#### className
Chi Song's avatar
Chi Song committed
421

422
Required if using customized advisors. String.
SparkSnail's avatar
SparkSnail committed
423

424
Specifies the name of advisor class.
SparkSnail's avatar
SparkSnail committed
425

426
#### classArgs
Chi Song's avatar
Chi Song committed
427

428
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
429

430
Specifies the arguments of advisor.
Chi Song's avatar
Chi Song committed
431

432
#### gpuIndices
Chi Song's avatar
Chi Song committed
433

434
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
435

436
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
437

438
### trial
Chi Song's avatar
Chi Song committed
439

440
Required. Key-value pairs.
Chi Song's avatar
Chi Song committed
441

442
In local and remote mode, the following keys are required.
Chi Song's avatar
Chi Song committed
443

444
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
445

446
* __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
Chi Song's avatar
Chi Song committed
447

448
* __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
449

450
In PAI mode, the following keys are required.
Chi Song's avatar
Chi Song committed
451

452
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
453

454
* __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
Chi Song's avatar
Chi Song committed
455

456
* __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
457

458
* __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
SparkSnail's avatar
SparkSnail committed
459

460
* __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
Chi Song's avatar
Chi Song committed
461

462
* __image__: Required string. Set the image to be used in pai.
Chi Song's avatar
Chi Song committed
463

464
* __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
Chi Song's avatar
Chi Song committed
465

466
* __shmMB__: Optional integer. Shared memory size of container.
Chi Song's avatar
Chi Song committed
467

468
* __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
Chi Song's avatar
Chi Song committed
469

470
In Kubeflow mode, the following keys are required.
Chi Song's avatar
Chi Song committed
471

472
* __codeDir__: The local directory where the code files are in.
Chi Song's avatar
Chi Song committed
473

474
* __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
Chi Song's avatar
Chi Song committed
475

476
    * __replicas__: The replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
477

478
    * __command__: The run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
479

480
    * __gpuNum__: The gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
481

482
    * __cpuNum__: The cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
483

484
    * __memoryMB__: The memory size of the container.
Chi Song's avatar
Chi Song committed
485

486
    * __image__: The image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
487

488
* __worker__: An optional configuration for kubeflow's tensorflow-operator.
489

490
    * __replicas__: The replica number of __worker__ role.
491

492
    * __command__: The run script in __worker__'s container.
493

494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
    * __gpuNum__: The gpu number to be used in __worker__ container.

    * __cpuNum__: The cpu number to be used in __worker__ container.

    * __memoryMB__: The memory size of the container.

    * __image__: The image to be used in __worker__.

### localConfig

Optional in local mode. Key-value pairs.

Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

liuzhe-lz's avatar
liuzhe-lz committed
516
Optional. Integer. Default: 1.
517
  
518
Used to specify the max concurrency trial number on a GPU device.
519
    
520
#### useActiveGpu
521

522
Optional. Bool. Default: false.
SparkSnail's avatar
SparkSnail committed
523

524
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
SparkSnail's avatar
SparkSnail committed
525

526
### machineList
527

528
Required in remote mode. A list of key-value pairs with the following keys.
Chi Song's avatar
Chi Song committed
529

530
#### ip
531

532
Required. IP address or host name that is accessible from the current machine.
Chi Song's avatar
Chi Song committed
533

534
The IP address or host name of remote machine.
Chi Song's avatar
Chi Song committed
535

536
#### port
Chi Song's avatar
Chi Song committed
537

538
Optional. Integer. Valid port. Default: 22.
Deshui Yu's avatar
Deshui Yu committed
539

540
The ssh port to be used to connect machine.
541

542
#### username
Chi Song's avatar
Chi Song committed
543

544
Required if authentication with username/password. String.
Chi Song's avatar
Chi Song committed
545

546
The account of remote machine.
547

548
#### passwd
SparkSnail's avatar
SparkSnail committed
549

550
Required if authentication with username/password. String.
551

552
Specifies the password of the account.
553

554
#### sshKeyPath
555

556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
Required if authentication with ssh key. Path to private key file.

If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.

*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*

#### passphrase

Optional. String.

Used to protect ssh key, which could be empty if users don't have passphrase.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

576
Optional. Integer. Default: 1.
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600

Used to specify the max concurrency trial number on a GPU device.

#### useActiveGpu

Optional. Bool. Default: false.

Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

### kubeflowConfig

#### operator

Required. String. Has to be `tf-operator` or `pytorch-operator`.

Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.

#### storage

Optional. String. Default. `nfs`.

Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.

#### nfs
601

602
Required if using nfs. Key-value pairs.
Chi Song's avatar
Chi Song committed
603

604
* __server__ is the host of nfs server.
Chi Song's avatar
Chi Song committed
605

606
* __path__ is the mounted path of nfs.
Chi Song's avatar
Chi Song committed
607

608
#### keyVault
Chi Song's avatar
Chi Song committed
609

610
Required if using azure storage. Key-value pairs.
Chi Song's avatar
Chi Song committed
611

612
Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
Chi Song's avatar
Chi Song committed
613

614
* __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
615

616
* __name__ is the value of `--name` used in az command.
Chi Song's avatar
Chi Song committed
617

618
#### azureStorage
Chi Song's avatar
Chi Song committed
619

620
Required if using azure storage. Key-value pairs.
SparkSnail's avatar
SparkSnail committed
621

622
Set azure storage account to store code files.
SparkSnail's avatar
SparkSnail committed
623

624
* __accountName__ is the name of azure storage account.
SparkSnail's avatar
SparkSnail committed
625

626
* __azureShare__ is the share of the azure file storage.
627

628
#### uploadRetryCount
629

630
Required if using azure storage. Integer between 1 and 99999.
Chi Song's avatar
Chi Song committed
631

632
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
SparkSnail's avatar
SparkSnail committed
633

634
### paiConfig
Chi Song's avatar
Chi Song committed
635

636
#### userName
SparkSnail's avatar
SparkSnail committed
637

638
Required. String.
Chi Song's avatar
Chi Song committed
639

640
The user name of your pai account.
SparkSnail's avatar
SparkSnail committed
641

642
#### password
643

644
Required if using password authentication. String.
645

646
The password of the pai account.
SparkSnail's avatar
SparkSnail committed
647

648
#### token
Chi Song's avatar
Chi Song committed
649

650
Required if using token authentication. String.
SparkSnail's avatar
SparkSnail committed
651

652
Personal access token that can be retrieved from PAI portal.
Chi Song's avatar
Chi Song committed
653

654
#### host
Chi Song's avatar
Chi Song committed
655

656
Required. String.
Chi Song's avatar
Chi Song committed
657

658
The hostname of IP address of PAI.
SparkSnail's avatar
SparkSnail committed
659

660
661
662
663
664
665
#### reuse

Optional. Bool. default: `false`. It's an experimental feature.

If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.

Deshui Yu's avatar
Deshui Yu committed
666
## Examples
Chi Song's avatar
Chi Song committed
667

668
### Local mode
Deshui Yu's avatar
Deshui Yu committed
669

670
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
671

Chi Song's avatar
Chi Song committed
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

694
You can add assessor configuration.
Chi Song's avatar
Chi Song committed
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

725
Or you could specify your own tuner and assessor file as following,
Chi Song's avatar
Chi Song committed
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
757

758
### Remote mode
Deshui Yu's avatar
Deshui Yu committed
759

760
If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
799

800
### PAI mode
SparkSnail's avatar
SparkSnail committed
801

Chi Song's avatar
Chi Song committed
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
827
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
828
829
830
831
832
833
834
835
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
836

837
### Kubeflow mode
Chi Song's avatar
Chi Song committed
838

Chi Song's avatar
Chi Song committed
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

874
### Kubeflow with azure storage
Chi Song's avatar
Chi Song committed
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```