ExperimentConfig.md 27.6 KB
Newer Older
1
# Experiment Config Reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
- [Experiment Config Reference](#experiment-config-reference)
  * [Template](#template)
  * [Configuration Spec](#configuration-spec)
    + [authorName](#authorname)
    + [experimentName](#experimentname)
    + [trialConcurrency](#trialconcurrency)
    + [maxExecDuration](#maxexecduration)
    + [versionCheck](#versioncheck)
    + [debug](#debug)
    + [maxTrialNum](#maxtrialnum)
    + [trainingServicePlatform](#trainingserviceplatform)
    + [searchSpacePath](#searchspacepath)
    + [useAnnotation](#useannotation)
    + [multiThread](#multithread)
    + [nniManagerIp](#nnimanagerip)
    + [logDir](#logdir)
    + [logLevel](#loglevel)
    + [logCollection](#logcollection)
    + [tuner](#tuner)
      - [builtinTunerName](#builtintunername)
      - [codeDir](#codedir)
      - [classFileName](#classfilename)
      - [className](#classname)
      - [classArgs](#classargs)
      - [gpuIndices](#gpuindices)
      - [includeIntermediateResults](#includeintermediateresults)
    + [assessor](#assessor)
      - [builtinAssessorName](#builtinassessorname)
      - [codeDir](#codedir-1)
      - [classFileName](#classfilename-1)
      - [className](#classname-1)
      - [classArgs](#classargs-1)
    + [advisor](#advisor)
      - [builtinAdvisorName](#builtinadvisorname)
      - [codeDir](#codedir-2)
      - [classFileName](#classfilename-2)
      - [className](#classname-2)
      - [classArgs](#classargs-2)
      - [gpuIndices](#gpuindices-1)
    + [trial](#trial)
    + [localConfig](#localconfig)
      - [gpuIndices](#gpuindices-2)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu)
      - [useActiveGpu](#useactivegpu)
    + [machineList](#machinelist)
      - [ip](#ip)
      - [port](#port)
      - [username](#username)
      - [passwd](#passwd)
      - [sshKeyPath](#sshkeypath)
      - [passphrase](#passphrase)
      - [gpuIndices](#gpuindices-3)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
      - [useActiveGpu](#useactivegpu-1)
61
      - [preCommand](#preCommand)
62
63
64
65
66
67
68
69
70
71
72
73
    + [kubeflowConfig](#kubeflowconfig)
      - [operator](#operator)
      - [storage](#storage)
      - [nfs](#nfs)
      - [keyVault](#keyvault)
      - [azureStorage](#azurestorage)
      - [uploadRetryCount](#uploadretrycount)
    + [paiConfig](#paiconfig)
      - [userName](#username)
      - [password](#password)
      - [token](#token)
      - [host](#host)
74
      - [reuse](#reuse)
75
76
77
78
79
80
  * [Examples](#examples)
    + [Local mode](#local-mode)
    + [Remote mode](#remote-mode)
    + [PAI mode](#pai-mode)
    + [Kubeflow mode](#kubeflow-mode)
    + [Kubeflow with azure storage](#kubeflow-with-azure-storage)
Yan Ni's avatar
Yan Ni committed
81

Deshui Yu's avatar
Deshui Yu committed
82
## Template
Chi Song's avatar
Chi Song committed
83

84
* __Light weight (without Annotation and Assessor)__
Chi Song's avatar
Chi Song committed
85
86
87
88
89
90
91

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
92
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
93
94
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
95
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
96
useAnnotation:
chicm-ms's avatar
chicm-ms committed
97
98
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
99
100
tuner:
  #choice: TPE, Random, Anneal, Evolution
101
102
103
104
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
105
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
106
trial:
Chi Song's avatar
Chi Song committed
107
108
109
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
110
111
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
112
113
114
115
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
116
```
Chi Song's avatar
Chi Song committed
117

Deshui Yu's avatar
Deshui Yu committed
118
* __Use Assessor__
Chi Song's avatar
Chi Song committed
119

Chi Song's avatar
Chi Song committed
120
121
122
123
124
125
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
126
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
127
128
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
129
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
130
useAnnotation:
chicm-ms's avatar
chicm-ms committed
131
132
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
133
134
tuner:
  #choice: TPE, Random, Anneal, Evolution
135
136
137
138
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
139
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
140
141
assessor:
  #choice: Medianstop
142
143
144
145
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
146
trial:
Chi Song's avatar
Chi Song committed
147
148
149
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
150
151
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
152
153
154
155
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
156
```
Chi Song's avatar
Chi Song committed
157

Deshui Yu's avatar
Deshui Yu committed
158
* __Use Annotation__
Chi Song's avatar
Chi Song committed
159

Chi Song's avatar
Chi Song committed
160
161
162
163
164
165
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
166
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
167
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
168
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
169
useAnnotation:
chicm-ms's avatar
chicm-ms committed
170
171
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
172
173
tuner:
  #choice: TPE, Random, Anneal, Evolution
174
175
176
177
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
178
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
179
180
assessor:
  #choice: Medianstop
181
182
183
184
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
185
trial:
Chi Song's avatar
Chi Song committed
186
187
188
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
189
190
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
191
192
193
194
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
195
```
Chi Song's avatar
Chi Song committed
196

197
## Configuration Spec
Chi Song's avatar
Chi Song committed
198

199
### authorName
Chi Song's avatar
Chi Song committed
200

201
Required. String.
202

203
The name of the author who create the experiment.
Chi Song's avatar
Chi Song committed
204

205
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
206

207
### experimentName
208

209
Required. String.
Chi Song's avatar
Chi Song committed
210

211
The name of the experiment created.
Chi Song's avatar
Chi Song committed
212

213
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
214

215
### trialConcurrency
Chi Song's avatar
Chi Song committed
216

217
Required. Integer between 1 and 99999.
Yan Ni's avatar
Yan Ni committed
218

219
Specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
220

221
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
Chi Song's avatar
Chi Song committed
222

223
224
225
226
227
228
229
230
231
232
### maxExecDuration

Optional. String. Default: 999d.

__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.

Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

### versionCheck

233
Optional. Bool. Default: true.
234
  
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.

### debug

Optional. Bool. Default: false.

Debug mode will set versionCheck to false and set logLevel to be 'debug'.

### maxTrialNum

Optional. Integer between 1 and 99999. Default: 99999.

Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.

### trainingServicePlatform

Required. String.
252

253
Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
254

255
* __local__ run an experiment on local ubuntu machine.
256

257
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
Chi Song's avatar
Chi Song committed
258

259
* __pai__  submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
260

261
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
262

263
* TODO: explain frameworkcontroller.
Chi Song's avatar
Chi Song committed
264

265
### searchSpacePath
Chi Song's avatar
Chi Song committed
266

267
Optional. Path to existing file.
SparkSnail's avatar
SparkSnail committed
268

269
Specifies the path of search space file, which should be a valid path in the local linux machine.
Chi Song's avatar
Chi Song committed
270

271
The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
Chi Song's avatar
Chi Song committed
272

273
### useAnnotation
Chi Song's avatar
Chi Song committed
274

275
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
276

277
Use annotation to analysis trial code and generate search space.
Chi Song's avatar
Chi Song committed
278

279
Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
Chi Song's avatar
Chi Song committed
280

281
### multiThread
chicm-ms's avatar
chicm-ms committed
282

283
Optional. Bool. Default: false.
chicm-ms's avatar
chicm-ms committed
284

285
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
chicm-ms's avatar
chicm-ms committed
286

287
### nniManagerIp
Chi Song's avatar
Chi Song committed
288

289
Optional. String. Default: eth0 device IP.
SparkSnail's avatar
SparkSnail committed
290

291
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
292

293
Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
294

295
### logDir
296

chicm-ms's avatar
chicm-ms committed
297
Optional. Path to a directory. Default: `<user home directory>/nni-experiments`.
298

299
Configures the directory to store logs and data of the experiment.
300

301
### logLevel
302

303
Optional. String. Default: `info`.
SparkSnail's avatar
SparkSnail committed
304

305
Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
Chi Song's avatar
Chi Song committed
306

307
### logCollection
Chi Song's avatar
Chi Song committed
308

309
Optional. `http` or `none`. Default: `none`.
310

311
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
Chi Song's avatar
Chi Song committed
312

313
### tuner
Chi Song's avatar
Chi Song committed
314

315
Required.
Chi Song's avatar
Chi Song committed
316

317
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
Chi Song's avatar
Chi Song committed
318

319
#### builtinTunerName
Chi Song's avatar
Chi Song committed
320

321
Required if using built-in tuners. String.
322

323
Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
Chi Song's avatar
Chi Song committed
324

325
#### codeDir
Deshui Yu's avatar
Deshui Yu committed
326

327
Required if using customized tuners. Path relative to the location of config file.
328

329
Specifies the directory of tuner code.
330

331
#### classFileName
332

333
Required if using customized tuners. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
334

335
Specifies the name of tuner file.
Chi Song's avatar
Chi Song committed
336

337
#### className
Chi Song's avatar
Chi Song committed
338

339
Required if using customized tuners. String.
Chi Song's avatar
Chi Song committed
340

341
Specifies the name of tuner class.
Chi Song's avatar
Chi Song committed
342

343
#### classArgs
Chi Song's avatar
Chi Song committed
344

345
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
346

347
Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
Chi Song's avatar
Chi Song committed
348

349
#### gpuIndices
Chi Song's avatar
Chi Song committed
350

351
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
352

353
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
354

355
#### includeIntermediateResults
Chi Song's avatar
Chi Song committed
356

357
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
358

359
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
Chi Song's avatar
Chi Song committed
360

361
### assessor
362

363
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
Deshui Yu's avatar
Deshui Yu committed
364

365
By default, there is no assessor enabled.
Chi Song's avatar
Chi Song committed
366

367
#### builtinAssessorName
368

369
Required if using built-in assessors. String.
370

371
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
372

373
#### codeDir
374

375
Required if using customized assessors. Path relative to the location of config file.
376

377
Specifies the directory of assessor code.
378

379
#### classFileName
380

381
Required if using customized assessors. File path relative to __codeDir__.
382

383
Specifies the name of assessor file.
384

385
#### className
Chi Song's avatar
Chi Song committed
386

387
Required if using customized assessors. String.
Chi Song's avatar
Chi Song committed
388

389
Specifies the name of assessor class.
Deshui Yu's avatar
Deshui Yu committed
390

391
#### classArgs
Chi Song's avatar
Chi Song committed
392

393
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
394

395
Specifies the arguments of assessor algorithm.
Chi Song's avatar
Chi Song committed
396

397
### advisor
Chi Song's avatar
Chi Song committed
398

399
Optional.
SparkSnail's avatar
SparkSnail committed
400

401
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
Chi Song's avatar
Chi Song committed
402

403
When advisor is enabled, settings of tuners and advisors will be bypassed.
SparkSnail's avatar
SparkSnail committed
404

405
#### builtinAdvisorName
Chi Song's avatar
Chi Song committed
406

407
Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
Chi Song's avatar
Chi Song committed
408

409
#### codeDir
Chi Song's avatar
Chi Song committed
410

411
Required if using customized advisors. Path relative to the location of config file.
Chi Song's avatar
Chi Song committed
412

413
Specifies the directory of advisor code.
Chi Song's avatar
Chi Song committed
414

415
#### classFileName
SparkSnail's avatar
SparkSnail committed
416

417
Required if using customized advisors. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
418

419
Specifies the name of advisor file.
SparkSnail's avatar
SparkSnail committed
420

421
#### className
Chi Song's avatar
Chi Song committed
422

423
Required if using customized advisors. String.
SparkSnail's avatar
SparkSnail committed
424

425
Specifies the name of advisor class.
SparkSnail's avatar
SparkSnail committed
426

427
#### classArgs
Chi Song's avatar
Chi Song committed
428

429
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
430

431
Specifies the arguments of advisor.
Chi Song's avatar
Chi Song committed
432

433
#### gpuIndices
Chi Song's avatar
Chi Song committed
434

435
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
436

437
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
438

439
### trial
Chi Song's avatar
Chi Song committed
440

441
Required. Key-value pairs.
Chi Song's avatar
Chi Song committed
442

443
In local and remote mode, the following keys are required.
Chi Song's avatar
Chi Song committed
444

445
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
446

447
* __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
Chi Song's avatar
Chi Song committed
448

449
* __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
450

451
In PAI mode, the following keys are required.
Chi Song's avatar
Chi Song committed
452

453
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
454

455
* __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
Chi Song's avatar
Chi Song committed
456

457
* __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
458

459
* __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
SparkSnail's avatar
SparkSnail committed
460

461
* __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
Chi Song's avatar
Chi Song committed
462

463
* __image__: Required string. Set the image to be used in pai.
Chi Song's avatar
Chi Song committed
464

465
* __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
Chi Song's avatar
Chi Song committed
466

467
* __shmMB__: Optional integer. Shared memory size of container.
Chi Song's avatar
Chi Song committed
468

469
* __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
Chi Song's avatar
Chi Song committed
470

471
In Kubeflow mode, the following keys are required.
Chi Song's avatar
Chi Song committed
472

473
* __codeDir__: The local directory where the code files are in.
Chi Song's avatar
Chi Song committed
474

475
* __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
Chi Song's avatar
Chi Song committed
476

477
    * __replicas__: The replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
478

479
    * __command__: The run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
480

481
    * __gpuNum__: The gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
482

483
    * __cpuNum__: The cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
484

485
    * __memoryMB__: The memory size of the container.
Chi Song's avatar
Chi Song committed
486

487
    * __image__: The image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
488

489
* __worker__: An optional configuration for kubeflow's tensorflow-operator.
490

491
    * __replicas__: The replica number of __worker__ role.
492

493
    * __command__: The run script in __worker__'s container.
494

495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
    * __gpuNum__: The gpu number to be used in __worker__ container.

    * __cpuNum__: The cpu number to be used in __worker__ container.

    * __memoryMB__: The memory size of the container.

    * __image__: The image to be used in __worker__.

### localConfig

Optional in local mode. Key-value pairs.

Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

liuzhe-lz's avatar
liuzhe-lz committed
517
Optional. Integer. Default: 1.
518
  
519
Used to specify the max concurrency trial number on a GPU device.
520
    
521
#### useActiveGpu
522

523
Optional. Bool. Default: false.
SparkSnail's avatar
SparkSnail committed
524

525
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
SparkSnail's avatar
SparkSnail committed
526

527
### machineList
528

529
Required in remote mode. A list of key-value pairs with the following keys.
Chi Song's avatar
Chi Song committed
530

531
#### ip
532

533
Required. IP address or host name that is accessible from the current machine.
Chi Song's avatar
Chi Song committed
534

535
The IP address or host name of remote machine.
Chi Song's avatar
Chi Song committed
536

537
#### port
Chi Song's avatar
Chi Song committed
538

539
Optional. Integer. Valid port. Default: 22.
Deshui Yu's avatar
Deshui Yu committed
540

541
The ssh port to be used to connect machine.
542

543
#### username
Chi Song's avatar
Chi Song committed
544

545
Required if authentication with username/password. String.
Chi Song's avatar
Chi Song committed
546

547
The account of remote machine.
548

549
#### passwd
SparkSnail's avatar
SparkSnail committed
550

551
Required if authentication with username/password. String.
552

553
Specifies the password of the account.
554

555
#### sshKeyPath
556

557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
Required if authentication with ssh key. Path to private key file.

If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.

*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*

#### passphrase

Optional. String.

Used to protect ssh key, which could be empty if users don't have passphrase.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

577
Optional. Integer. Default: 1.
578
579
580
581
582
583
584
585
586

Used to specify the max concurrency trial number on a GPU device.

#### useActiveGpu

Optional. Bool. Default: false.

Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

587
588
589
590
591
592
593
594
#### preCommand

Optional. String.

Specifies the pre-command that will be executed before the remote machine executes other commands. Users can configure the experimental environment on remote machine by setting __preCommand__. If there are multiple commands need to execute, use `&&` to connect them, such as `preCommand: command1 && command2 && ...`.

__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`.

595
596
597
598
599
600
601
602
603
604
### remoteConfig

Optional field in remote mode. Users could set per machine information in `machineList` field, and set global configuration for remote mode in this field.

#### reuse

Optional. Bool. default: `false`. It's an experimental feature.

If it's true, NNI will reuse remote jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials. 

605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
### kubeflowConfig

#### operator

Required. String. Has to be `tf-operator` or `pytorch-operator`.

Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.

#### storage

Optional. String. Default. `nfs`.

Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.

#### nfs
620

621
Required if using nfs. Key-value pairs.
Chi Song's avatar
Chi Song committed
622

623
* __server__ is the host of nfs server.
Chi Song's avatar
Chi Song committed
624

625
* __path__ is the mounted path of nfs.
Chi Song's avatar
Chi Song committed
626

627
#### keyVault
Chi Song's avatar
Chi Song committed
628

629
Required if using azure storage. Key-value pairs.
Chi Song's avatar
Chi Song committed
630

631
Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
Chi Song's avatar
Chi Song committed
632

633
* __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
634

635
* __name__ is the value of `--name` used in az command.
Chi Song's avatar
Chi Song committed
636

637
#### azureStorage
Chi Song's avatar
Chi Song committed
638

639
Required if using azure storage. Key-value pairs.
SparkSnail's avatar
SparkSnail committed
640

641
Set azure storage account to store code files.
SparkSnail's avatar
SparkSnail committed
642

643
* __accountName__ is the name of azure storage account.
SparkSnail's avatar
SparkSnail committed
644

645
* __azureShare__ is the share of the azure file storage.
646

647
#### uploadRetryCount
648

649
Required if using azure storage. Integer between 1 and 99999.
Chi Song's avatar
Chi Song committed
650

651
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
SparkSnail's avatar
SparkSnail committed
652

653
### paiConfig
Chi Song's avatar
Chi Song committed
654

655
#### userName
SparkSnail's avatar
SparkSnail committed
656

657
Required. String.
Chi Song's avatar
Chi Song committed
658

659
The user name of your pai account.
SparkSnail's avatar
SparkSnail committed
660

661
#### password
662

663
Required if using password authentication. String.
664

665
The password of the pai account.
SparkSnail's avatar
SparkSnail committed
666

667
#### token
Chi Song's avatar
Chi Song committed
668

669
Required if using token authentication. String.
SparkSnail's avatar
SparkSnail committed
670

671
Personal access token that can be retrieved from PAI portal.
Chi Song's avatar
Chi Song committed
672

673
#### host
Chi Song's avatar
Chi Song committed
674

675
Required. String.
Chi Song's avatar
Chi Song committed
676

677
The hostname of IP address of PAI.
SparkSnail's avatar
SparkSnail committed
678

679
680
681
682
683
684
#### reuse

Optional. Bool. default: `false`. It's an experimental feature.

If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.

Deshui Yu's avatar
Deshui Yu committed
685
## Examples
Chi Song's avatar
Chi Song committed
686

687
### Local mode
Deshui Yu's avatar
Deshui Yu committed
688

689
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
690

Chi Song's avatar
Chi Song committed
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

713
You can add assessor configuration.
Chi Song's avatar
Chi Song committed
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

744
Or you could specify your own tuner and assessor file as following,
Chi Song's avatar
Chi Song committed
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
776

777
### Remote mode
Deshui Yu's avatar
Deshui Yu committed
778

779
If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
817
818
819
820
821
822
      # Pre-command will be executed before the remote machine executes other commands.
      # Below is an example of specifying python environment.
      # If you want to execute multiple commands, please use "&&" to connect them.
      # preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
      # preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
      preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
Chi Song's avatar
Chi Song committed
823
  ```
SparkSnail's avatar
SparkSnail committed
824

825
### PAI mode
SparkSnail's avatar
SparkSnail committed
826

Chi Song's avatar
Chi Song committed
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
852
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
853
854
855
856
857
858
859
860
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
861

862
### Kubeflow mode
Chi Song's avatar
Chi Song committed
863

Chi Song's avatar
Chi Song committed
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

899
### Kubeflow with azure storage
Chi Song's avatar
Chi Song committed
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```