ExperimentConfig.md 25.7 KB
Newer Older
1
# Experiment Config Reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
- [Experiment Config Reference](#experiment-config-reference)
  * [Template](#template)
  * [Configuration Spec](#configuration-spec)
    + [authorName](#authorname)
    + [experimentName](#experimentname)
    + [trialConcurrency](#trialconcurrency)
    + [maxExecDuration](#maxexecduration)
    + [versionCheck](#versioncheck)
    + [debug](#debug)
    + [maxTrialNum](#maxtrialnum)
    + [trainingServicePlatform](#trainingserviceplatform)
    + [searchSpacePath](#searchspacepath)
    + [useAnnotation](#useannotation)
    + [multiThread](#multithread)
    + [nniManagerIp](#nnimanagerip)
    + [logDir](#logdir)
    + [logLevel](#loglevel)
    + [logCollection](#logcollection)
    + [tuner](#tuner)
      - [builtinTunerName](#builtintunername)
      - [codeDir](#codedir)
      - [classFileName](#classfilename)
      - [className](#classname)
      - [classArgs](#classargs)
      - [gpuIndices](#gpuindices)
      - [includeIntermediateResults](#includeintermediateresults)
    + [assessor](#assessor)
      - [builtinAssessorName](#builtinassessorname)
      - [codeDir](#codedir-1)
      - [classFileName](#classfilename-1)
      - [className](#classname-1)
      - [classArgs](#classargs-1)
    + [advisor](#advisor)
      - [builtinAdvisorName](#builtinadvisorname)
      - [codeDir](#codedir-2)
      - [classFileName](#classfilename-2)
      - [className](#classname-2)
      - [classArgs](#classargs-2)
      - [gpuIndices](#gpuindices-1)
    + [trial](#trial)
    + [localConfig](#localconfig)
      - [gpuIndices](#gpuindices-2)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu)
      - [useActiveGpu](#useactivegpu)
    + [machineList](#machinelist)
      - [ip](#ip)
      - [port](#port)
      - [username](#username)
      - [passwd](#passwd)
      - [sshKeyPath](#sshkeypath)
      - [passphrase](#passphrase)
      - [gpuIndices](#gpuindices-3)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
      - [useActiveGpu](#useactivegpu-1)
    + [kubeflowConfig](#kubeflowconfig)
      - [operator](#operator)
      - [storage](#storage)
      - [nfs](#nfs)
      - [keyVault](#keyvault)
      - [azureStorage](#azurestorage)
      - [uploadRetryCount](#uploadretrycount)
    + [paiConfig](#paiconfig)
      - [userName](#username)
      - [password](#password)
      - [token](#token)
      - [host](#host)
  * [Examples](#examples)
    + [Local mode](#local-mode)
    + [Remote mode](#remote-mode)
    + [PAI mode](#pai-mode)
    + [Kubeflow mode](#kubeflow-mode)
    + [Kubeflow with azure storage](#kubeflow-with-azure-storage)
Yan Ni's avatar
Yan Ni committed
79

Deshui Yu's avatar
Deshui Yu committed
80
## Template
Chi Song's avatar
Chi Song committed
81

82
* __Light weight (without Annotation and Assessor)__
Chi Song's avatar
Chi Song committed
83
84
85
86
87
88
89

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
90
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
91
92
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
93
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
94
useAnnotation:
chicm-ms's avatar
chicm-ms committed
95
96
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
97
98
tuner:
  #choice: TPE, Random, Anneal, Evolution
99
100
101
102
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
103
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
104
trial:
Chi Song's avatar
Chi Song committed
105
106
107
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
108
109
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
110
111
112
113
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
114
```
Chi Song's avatar
Chi Song committed
115

Deshui Yu's avatar
Deshui Yu committed
116
* __Use Assessor__
Chi Song's avatar
Chi Song committed
117

Chi Song's avatar
Chi Song committed
118
119
120
121
122
123
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
124
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
125
126
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
127
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
128
useAnnotation:
chicm-ms's avatar
chicm-ms committed
129
130
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
131
132
tuner:
  #choice: TPE, Random, Anneal, Evolution
133
134
135
136
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
137
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
138
139
assessor:
  #choice: Medianstop
140
141
142
143
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
144
trial:
Chi Song's avatar
Chi Song committed
145
146
147
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
148
149
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
150
151
152
153
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
154
```
Chi Song's avatar
Chi Song committed
155

Deshui Yu's avatar
Deshui Yu committed
156
* __Use Annotation__
Chi Song's avatar
Chi Song committed
157

Chi Song's avatar
Chi Song committed
158
159
160
161
162
163
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
164
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
165
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
166
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
167
useAnnotation:
chicm-ms's avatar
chicm-ms committed
168
169
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
170
171
tuner:
  #choice: TPE, Random, Anneal, Evolution
172
173
174
175
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
176
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
177
178
assessor:
  #choice: Medianstop
179
180
181
182
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
183
trial:
Chi Song's avatar
Chi Song committed
184
185
186
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
187
188
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
189
190
191
192
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
193
```
Chi Song's avatar
Chi Song committed
194

195
## Configuration Spec
Chi Song's avatar
Chi Song committed
196

197
### authorName
Chi Song's avatar
Chi Song committed
198

199
Required. String.
200

201
The name of the author who create the experiment.
Chi Song's avatar
Chi Song committed
202

203
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
204

205
### experimentName
206

207
Required. String.
Chi Song's avatar
Chi Song committed
208

209
The name of the experiment created.
Chi Song's avatar
Chi Song committed
210

211
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
212

213
### trialConcurrency
Chi Song's avatar
Chi Song committed
214

215
Required. Integer between 1 and 99999.
Yan Ni's avatar
Yan Ni committed
216

217
Specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
218

219
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
Chi Song's avatar
Chi Song committed
220

221
222
223
224
225
226
227
228
229
230
231
### maxExecDuration

Optional. String. Default: 999d.

__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.

Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

### versionCheck

Optional. Bool. Default: false.
232
  
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.

### debug

Optional. Bool. Default: false.

Debug mode will set versionCheck to false and set logLevel to be 'debug'.

### maxTrialNum

Optional. Integer between 1 and 99999. Default: 99999.

Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.

### trainingServicePlatform

Required. String.
250

251
Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
252

253
* __local__ run an experiment on local ubuntu machine.
254

255
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
Chi Song's avatar
Chi Song committed
256

257
* __pai__  submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
258

259
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
260

261
* TODO: explain frameworkcontroller.
Chi Song's avatar
Chi Song committed
262

263
### searchSpacePath
Chi Song's avatar
Chi Song committed
264

265
Optional. Path to existing file.
SparkSnail's avatar
SparkSnail committed
266

267
Specifies the path of search space file, which should be a valid path in the local linux machine.
Chi Song's avatar
Chi Song committed
268

269
The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
Chi Song's avatar
Chi Song committed
270

271
### useAnnotation
Chi Song's avatar
Chi Song committed
272

273
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
274

275
Use annotation to analysis trial code and generate search space.
Chi Song's avatar
Chi Song committed
276

277
Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
Chi Song's avatar
Chi Song committed
278

279
### multiThread
chicm-ms's avatar
chicm-ms committed
280

281
Optional. Bool. Default: false.
chicm-ms's avatar
chicm-ms committed
282

283
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
chicm-ms's avatar
chicm-ms committed
284

285
### nniManagerIp
Chi Song's avatar
Chi Song committed
286

287
Optional. String. Default: eth0 device IP.
SparkSnail's avatar
SparkSnail committed
288

289
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
290

291
Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
292

293
### logDir
294

295
Optional. Path to a directory. Default: `<user home directory>/nni/experiment`.
296

297
Configures the directory to store logs and data of the experiment.
298

299
### logLevel
300

301
Optional. String. Default: `info`.
SparkSnail's avatar
SparkSnail committed
302

303
Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
Chi Song's avatar
Chi Song committed
304

305
### logCollection
Chi Song's avatar
Chi Song committed
306

307
Optional. `http` or `none`. Default: `none`.
308

309
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
Chi Song's avatar
Chi Song committed
310

311
### tuner
Chi Song's avatar
Chi Song committed
312

313
Required.
Chi Song's avatar
Chi Song committed
314

315
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
Chi Song's avatar
Chi Song committed
316

317
#### builtinTunerName
Chi Song's avatar
Chi Song committed
318

319
Required if using built-in tuners. String.
320

321
Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
Chi Song's avatar
Chi Song committed
322

323
#### codeDir
Deshui Yu's avatar
Deshui Yu committed
324

325
Required if using customized tuners. Path relative to the location of config file.
326

327
Specifies the directory of tuner code.
328

329
#### classFileName
330

331
Required if using customized tuners. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
332

333
Specifies the name of tuner file.
Chi Song's avatar
Chi Song committed
334

335
#### className
Chi Song's avatar
Chi Song committed
336

337
Required if using customized tuners. String.
Chi Song's avatar
Chi Song committed
338

339
Specifies the name of tuner class.
Chi Song's avatar
Chi Song committed
340

341
#### classArgs
Chi Song's avatar
Chi Song committed
342

343
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
344

345
Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
Chi Song's avatar
Chi Song committed
346

347
#### gpuIndices
Chi Song's avatar
Chi Song committed
348

349
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
350

351
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
352

353
#### includeIntermediateResults
Chi Song's avatar
Chi Song committed
354

355
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
356

357
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
Chi Song's avatar
Chi Song committed
358

359
### assessor
360

361
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
Deshui Yu's avatar
Deshui Yu committed
362

363
By default, there is no assessor enabled.
Chi Song's avatar
Chi Song committed
364

365
#### builtinAssessorName
366

367
Required if using built-in assessors. String.
368

369
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
370

371
#### codeDir
372

373
Required if using customized assessors. Path relative to the location of config file.
374

375
Specifies the directory of assessor code.
376

377
#### classFileName
378

379
Required if using customized assessors. File path relative to __codeDir__.
380

381
Specifies the name of assessor file.
382

383
#### className
Chi Song's avatar
Chi Song committed
384

385
Required if using customized assessors. String.
Chi Song's avatar
Chi Song committed
386

387
Specifies the name of assessor class.
Deshui Yu's avatar
Deshui Yu committed
388

389
#### classArgs
Chi Song's avatar
Chi Song committed
390

391
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
392

393
Specifies the arguments of assessor algorithm.
Chi Song's avatar
Chi Song committed
394

395
### advisor
Chi Song's avatar
Chi Song committed
396

397
Optional.
SparkSnail's avatar
SparkSnail committed
398

399
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
Chi Song's avatar
Chi Song committed
400

401
When advisor is enabled, settings of tuners and advisors will be bypassed.
SparkSnail's avatar
SparkSnail committed
402

403
#### builtinAdvisorName
Chi Song's avatar
Chi Song committed
404

405
Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
Chi Song's avatar
Chi Song committed
406

407
#### codeDir
Chi Song's avatar
Chi Song committed
408

409
Required if using customized advisors. Path relative to the location of config file.
Chi Song's avatar
Chi Song committed
410

411
Specifies the directory of advisor code.
Chi Song's avatar
Chi Song committed
412

413
#### classFileName
SparkSnail's avatar
SparkSnail committed
414

415
Required if using customized advisors. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
416

417
Specifies the name of advisor file.
SparkSnail's avatar
SparkSnail committed
418

419
#### className
Chi Song's avatar
Chi Song committed
420

421
Required if using customized advisors. String.
SparkSnail's avatar
SparkSnail committed
422

423
Specifies the name of advisor class.
SparkSnail's avatar
SparkSnail committed
424

425
#### classArgs
Chi Song's avatar
Chi Song committed
426

427
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
428

429
Specifies the arguments of advisor.
Chi Song's avatar
Chi Song committed
430

431
#### gpuIndices
Chi Song's avatar
Chi Song committed
432

433
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
434

435
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
436

437
### trial
Chi Song's avatar
Chi Song committed
438

439
Required. Key-value pairs.
Chi Song's avatar
Chi Song committed
440

441
In local and remote mode, the following keys are required.
Chi Song's avatar
Chi Song committed
442

443
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
444

445
* __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
Chi Song's avatar
Chi Song committed
446

447
* __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
448

449
In PAI mode, the following keys are required.
Chi Song's avatar
Chi Song committed
450

451
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
452

453
* __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
Chi Song's avatar
Chi Song committed
454

455
* __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
456

457
* __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
SparkSnail's avatar
SparkSnail committed
458

459
* __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
Chi Song's avatar
Chi Song committed
460

461
* __image__: Required string. Set the image to be used in pai.
Chi Song's avatar
Chi Song committed
462

463
* __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
Chi Song's avatar
Chi Song committed
464

465
* __shmMB__: Optional integer. Shared memory size of container.
Chi Song's avatar
Chi Song committed
466

467
* __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
Chi Song's avatar
Chi Song committed
468

469
In Kubeflow mode, the following keys are required.
Chi Song's avatar
Chi Song committed
470

471
* __codeDir__: The local directory where the code files are in.
Chi Song's avatar
Chi Song committed
472

473
* __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
Chi Song's avatar
Chi Song committed
474

475
    * __replicas__: The replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
476

477
    * __command__: The run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
478

479
    * __gpuNum__: The gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
480

481
    * __cpuNum__: The cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
482

483
    * __memoryMB__: The memory size of the container.
Chi Song's avatar
Chi Song committed
484

485
    * __image__: The image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
486

487
* __worker__: An optional configuration for kubeflow's tensorflow-operator.
488

489
    * __replicas__: The replica number of __worker__ role.
490

491
    * __command__: The run script in __worker__'s container.
492

493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
    * __gpuNum__: The gpu number to be used in __worker__ container.

    * __cpuNum__: The cpu number to be used in __worker__ container.

    * __memoryMB__: The memory size of the container.

    * __image__: The image to be used in __worker__.

### localConfig

Optional in local mode. Key-value pairs.

Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

liuzhe-lz's avatar
liuzhe-lz committed
515
Optional. Integer. Default: 1.
516
  
517
Used to specify the max concurrency trial number on a GPU device.
518
    
519
#### useActiveGpu
520

521
Optional. Bool. Default: false.
SparkSnail's avatar
SparkSnail committed
522

523
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
SparkSnail's avatar
SparkSnail committed
524

525
### machineList
526

527
Required in remote mode. A list of key-value pairs with the following keys.
Chi Song's avatar
Chi Song committed
528

529
#### ip
530

531
Required. IP address that is accessible from the current machine.
Chi Song's avatar
Chi Song committed
532

533
The IP address of remote machine.
Chi Song's avatar
Chi Song committed
534

535
#### port
Chi Song's avatar
Chi Song committed
536

537
Optional. Integer. Valid port. Default: 22.
Deshui Yu's avatar
Deshui Yu committed
538

539
The ssh port to be used to connect machine.
540

541
#### username
Chi Song's avatar
Chi Song committed
542

543
Required if authentication with username/password. String.
Chi Song's avatar
Chi Song committed
544

545
The account of remote machine.
546

547
#### passwd
SparkSnail's avatar
SparkSnail committed
548

549
Required if authentication with username/password. String.
550

551
Specifies the password of the account.
552

553
#### sshKeyPath
554

555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
Required if authentication with ssh key. Path to private key file.

If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.

*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*

#### passphrase

Optional. String.

Used to protect ssh key, which could be empty if users don't have passphrase.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

Optional. Integer. Default: 99999.

Used to specify the max concurrency trial number on a GPU device.

#### useActiveGpu

Optional. Bool. Default: false.

Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

### kubeflowConfig

#### operator

Required. String. Has to be `tf-operator` or `pytorch-operator`.

Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.

#### storage

Optional. String. Default. `nfs`.

Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.

#### nfs
600

601
Required if using nfs. Key-value pairs.
Chi Song's avatar
Chi Song committed
602

603
* __server__ is the host of nfs server.
Chi Song's avatar
Chi Song committed
604

605
* __path__ is the mounted path of nfs.
Chi Song's avatar
Chi Song committed
606

607
#### keyVault
Chi Song's avatar
Chi Song committed
608

609
Required if using azure storage. Key-value pairs.
Chi Song's avatar
Chi Song committed
610

611
Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
Chi Song's avatar
Chi Song committed
612

613
* __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
614

615
* __name__ is the value of `--name` used in az command.
Chi Song's avatar
Chi Song committed
616

617
#### azureStorage
Chi Song's avatar
Chi Song committed
618

619
Required if using azure storage. Key-value pairs.
SparkSnail's avatar
SparkSnail committed
620

621
Set azure storage account to store code files.
SparkSnail's avatar
SparkSnail committed
622

623
* __accountName__ is the name of azure storage account.
SparkSnail's avatar
SparkSnail committed
624

625
* __azureShare__ is the share of the azure file storage.
626

627
#### uploadRetryCount
628

629
Required if using azure storage. Integer between 1 and 99999.
Chi Song's avatar
Chi Song committed
630

631
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
SparkSnail's avatar
SparkSnail committed
632

633
### paiConfig
Chi Song's avatar
Chi Song committed
634

635
#### userName
SparkSnail's avatar
SparkSnail committed
636

637
Required. String.
Chi Song's avatar
Chi Song committed
638

639
The user name of your pai account.
SparkSnail's avatar
SparkSnail committed
640

641
#### password
642

643
Required if using password authentication. String.
644

645
The password of the pai account.
SparkSnail's avatar
SparkSnail committed
646

647
#### token
Chi Song's avatar
Chi Song committed
648

649
Required if using token authentication. String.
SparkSnail's avatar
SparkSnail committed
650

651
Personal access token that can be retrieved from PAI portal.
Chi Song's avatar
Chi Song committed
652

653
#### host
Chi Song's avatar
Chi Song committed
654

655
Required. String.
Chi Song's avatar
Chi Song committed
656

657
The hostname of IP address of PAI.
SparkSnail's avatar
SparkSnail committed
658

Deshui Yu's avatar
Deshui Yu committed
659
## Examples
Chi Song's avatar
Chi Song committed
660

661
### Local mode
Deshui Yu's avatar
Deshui Yu committed
662

663
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
664

Chi Song's avatar
Chi Song committed
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

687
You can add assessor configuration.
Chi Song's avatar
Chi Song committed
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

718
Or you could specify your own tuner and assessor file as following,
Chi Song's avatar
Chi Song committed
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
750

751
### Remote mode
Deshui Yu's avatar
Deshui Yu committed
752

753
If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```
SparkSnail's avatar
SparkSnail committed
792

793
### PAI mode
SparkSnail's avatar
SparkSnail committed
794

Chi Song's avatar
Chi Song committed
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
820
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
821
822
823
824
825
826
827
828
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
829

830
### Kubeflow mode
Chi Song's avatar
Chi Song committed
831

Chi Song's avatar
Chi Song committed
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

867
### Kubeflow with azure storage
Chi Song's avatar
Chi Song committed
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```