ExperimentConfig.md 27.8 KB
Newer Older
1
# Experiment Config Reference
Deshui Yu's avatar
Deshui Yu committed
2

Dan Nissenbaum's avatar
Dan Nissenbaum committed
3
4
5
A config file is needed when creating an experiment. The path of the config file is provided to `nnictl`.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
Yan Ni's avatar
Yan Ni committed
6

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
- [Experiment Config Reference](#experiment-config-reference)
  * [Template](#template)
  * [Configuration Spec](#configuration-spec)
    + [authorName](#authorname)
    + [experimentName](#experimentname)
    + [trialConcurrency](#trialconcurrency)
    + [maxExecDuration](#maxexecduration)
    + [versionCheck](#versioncheck)
    + [debug](#debug)
    + [maxTrialNum](#maxtrialnum)
    + [trainingServicePlatform](#trainingserviceplatform)
    + [searchSpacePath](#searchspacepath)
    + [useAnnotation](#useannotation)
    + [multiThread](#multithread)
    + [nniManagerIp](#nnimanagerip)
    + [logDir](#logdir)
    + [logLevel](#loglevel)
    + [logCollection](#logcollection)
    + [tuner](#tuner)
      - [builtinTunerName](#builtintunername)
      - [codeDir](#codedir)
      - [classFileName](#classfilename)
      - [className](#classname)
      - [classArgs](#classargs)
      - [gpuIndices](#gpuindices)
      - [includeIntermediateResults](#includeintermediateresults)
    + [assessor](#assessor)
      - [builtinAssessorName](#builtinassessorname)
      - [codeDir](#codedir-1)
      - [classFileName](#classfilename-1)
      - [className](#classname-1)
      - [classArgs](#classargs-1)
    + [advisor](#advisor)
      - [builtinAdvisorName](#builtinadvisorname)
      - [codeDir](#codedir-2)
      - [classFileName](#classfilename-2)
      - [className](#classname-2)
      - [classArgs](#classargs-2)
      - [gpuIndices](#gpuindices-1)
    + [trial](#trial)
    + [localConfig](#localconfig)
      - [gpuIndices](#gpuindices-2)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu)
      - [useActiveGpu](#useactivegpu)
    + [machineList](#machinelist)
      - [ip](#ip)
      - [port](#port)
      - [username](#username)
      - [passwd](#passwd)
      - [sshKeyPath](#sshkeypath)
      - [passphrase](#passphrase)
      - [gpuIndices](#gpuindices-3)
      - [maxTrialNumPerGpu](#maxtrialnumpergpu-1)
      - [useActiveGpu](#useactivegpu-1)
61
      - [preCommand](#preCommand)
62
63
64
65
66
67
68
69
70
71
72
73
    + [kubeflowConfig](#kubeflowconfig)
      - [operator](#operator)
      - [storage](#storage)
      - [nfs](#nfs)
      - [keyVault](#keyvault)
      - [azureStorage](#azurestorage)
      - [uploadRetryCount](#uploadretrycount)
    + [paiConfig](#paiconfig)
      - [userName](#username)
      - [password](#password)
      - [token](#token)
      - [host](#host)
74
      - [reuse](#reuse)
75
76
77
78
79
80
  * [Examples](#examples)
    + [Local mode](#local-mode)
    + [Remote mode](#remote-mode)
    + [PAI mode](#pai-mode)
    + [Kubeflow mode](#kubeflow-mode)
    + [Kubeflow with azure storage](#kubeflow-with-azure-storage)
Yan Ni's avatar
Yan Ni committed
81

Deshui Yu's avatar
Deshui Yu committed
82
## Template
Chi Song's avatar
Chi Song committed
83

84
* __Light weight (without Annotation and Assessor)__
Chi Song's avatar
Chi Song committed
85
86
87
88
89
90
91

```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
92
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
93
94
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
95
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
96
useAnnotation:
chicm-ms's avatar
chicm-ms committed
97
98
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
99
100
tuner:
  #choice: TPE, Random, Anneal, Evolution
101
102
103
104
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
105
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
106
trial:
Chi Song's avatar
Chi Song committed
107
108
109
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
110
111
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
112
113
114
115
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
116
```
Chi Song's avatar
Chi Song committed
117

Deshui Yu's avatar
Deshui Yu committed
118
* __Use Assessor__
Chi Song's avatar
Chi Song committed
119

Chi Song's avatar
Chi Song committed
120
121
122
123
124
125
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
126
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
127
128
trainingServicePlatform:
searchSpacePath:
chicm-ms's avatar
chicm-ms committed
129
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
130
useAnnotation:
chicm-ms's avatar
chicm-ms committed
131
132
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
133
134
tuner:
  #choice: TPE, Random, Anneal, Evolution
135
136
137
138
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
139
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
140
141
assessor:
  #choice: Medianstop
142
143
144
145
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
146
trial:
Chi Song's avatar
Chi Song committed
147
148
149
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
150
151
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
152
153
154
155
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
156
```
Chi Song's avatar
Chi Song committed
157

Deshui Yu's avatar
Deshui Yu committed
158
* __Use Annotation__
Chi Song's avatar
Chi Song committed
159

Chi Song's avatar
Chi Song committed
160
161
162
163
164
165
```yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
SparkSnail's avatar
SparkSnail committed
166
#choice: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
167
trainingServicePlatform:
chicm-ms's avatar
chicm-ms committed
168
#choice: true, false, default: false
Chi Song's avatar
Chi Song committed
169
useAnnotation:
chicm-ms's avatar
chicm-ms committed
170
171
#choice: true, false, default: false
multiThread:
Deshui Yu's avatar
Deshui Yu committed
172
173
tuner:
  #choice: TPE, Random, Anneal, Evolution
174
175
176
177
  builtinTunerName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
178
  gpuIndices:
Deshui Yu's avatar
Deshui Yu committed
179
180
assessor:
  #choice: Medianstop
181
182
183
184
  builtinAssessorName:
  classArgs:
    #choice: maximize, minimize
    optimize_mode:
Deshui Yu's avatar
Deshui Yu committed
185
trial:
Chi Song's avatar
Chi Song committed
186
187
188
  command:
  codeDir:
  gpuNum:
Deshui Yu's avatar
Deshui Yu committed
189
190
#machineList can be empty if the platform is local
machineList:
Chi Song's avatar
Chi Song committed
191
192
193
194
  - ip:
    port:
    username:
    passwd:
Deshui Yu's avatar
Deshui Yu committed
195
```
Chi Song's avatar
Chi Song committed
196

197
## Configuration Spec
Chi Song's avatar
Chi Song committed
198

199
### authorName
Chi Song's avatar
Chi Song committed
200

201
Required. String.
202

203
The name of the author who create the experiment.
Chi Song's avatar
Chi Song committed
204

205
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
206

207
### experimentName
208

209
Required. String.
Chi Song's avatar
Chi Song committed
210

211
The name of the experiment created.
Chi Song's avatar
Chi Song committed
212

213
*TBD: add default value.*
Chi Song's avatar
Chi Song committed
214

215
### trialConcurrency
Chi Song's avatar
Chi Song committed
216

217
Required. Integer between 1 and 99999.
Yan Ni's avatar
Yan Ni committed
218

219
Specifies the max num of trial jobs run simultaneously.
Chi Song's avatar
Chi Song committed
220

221
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach __trialConcurrency__ number, some trial jobs will be put into a queue to wait for gpu allocation.
Chi Song's avatar
Chi Song committed
222

223
224
225
226
227
228
229
230
231
232
### maxExecDuration

Optional. String. Default: 999d.

__maxExecDuration__ specifies the max duration time of an experiment. The unit of the time is {__s__, __m__, __h__, __d__}, which means {_seconds_, _minutes_, _hours_, _days_}.

Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.

### versionCheck

233
Optional. Bool. Default: true.
234
  
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.

### debug

Optional. Bool. Default: false.

Debug mode will set versionCheck to false and set logLevel to be 'debug'.

### maxTrialNum

Optional. Integer between 1 and 99999. Default: 99999.

Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.

### trainingServicePlatform

Required. String.
252

253
Specifies the platform to run the experiment, including __local__, __remote__, __pai__, __kubeflow__, __frameworkcontroller__.
254

255
* __local__ run an experiment on local ubuntu machine.
256

257
* __remote__ submit trial jobs to remote ubuntu machines, and __machineList__ field should be filed in order to set up SSH connection to remote machine.
Chi Song's avatar
Chi Song committed
258

259
* __pai__  submit trial jobs to [OpenPAI](https://github.com/Microsoft/pai) of Microsoft. For more details of pai configuration, please refer to [Guide to PAI Mode](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
260

261
* __kubeflow__ submit trial jobs to [kubeflow](https://www.kubeflow.org/docs/about/kubeflow/), NNI support kubeflow based on normal kubernetes and [azure kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/). For detail please refer to [Kubeflow Docs](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
262

263
264
* __adl__ submit trial jobs to [AdaptDL](https://www.kubeflow.org/docs/about/kubeflow/), NNI support AdaptDL on Kubernetes cluster. For detail please refer to [AdaptDL Docs](../TrainingService/AdaptDLMode.md)

265
* TODO: explain frameworkcontroller.
Chi Song's avatar
Chi Song committed
266

267
### searchSpacePath
Chi Song's avatar
Chi Song committed
268

269
Optional. Path to existing file.
SparkSnail's avatar
SparkSnail committed
270

271
Specifies the path of search space file, which should be a valid path in the local linux machine.
Chi Song's avatar
Chi Song committed
272

273
The only exception that __searchSpacePath__ can be not fulfilled is when `useAnnotation=True`.
Chi Song's avatar
Chi Song committed
274

275
### useAnnotation
Chi Song's avatar
Chi Song committed
276

277
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
278

279
Use annotation to analysis trial code and generate search space.
Chi Song's avatar
Chi Song committed
280

281
Note: if __useAnnotation__ is true, the searchSpacePath field should be removed.
Chi Song's avatar
Chi Song committed
282

283
### multiThread
chicm-ms's avatar
chicm-ms committed
284

285
Optional. Bool. Default: false.
chicm-ms's avatar
chicm-ms committed
286

287
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
chicm-ms's avatar
chicm-ms committed
288

289
### nniManagerIp
Chi Song's avatar
Chi Song committed
290

291
Optional. String. Default: eth0 device IP.
SparkSnail's avatar
SparkSnail committed
292

293
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
294

295
Note: run `ifconfig` on NNI manager's machine to check if eth0 device exists. If not, __nniManagerIp__ is recommended to set explicitly.
296

297
### logDir
298

chicm-ms's avatar
chicm-ms committed
299
Optional. Path to a directory. Default: `<user home directory>/nni-experiments`.
300

301
Configures the directory to store logs and data of the experiment.
302

303
### logLevel
304

305
Optional. String. Default: `info`.
SparkSnail's avatar
SparkSnail committed
306

307
Sets log level for the experiment. Available log levels are: `trace`, `debug`, `info`, `warning`, `error`, `fatal`.
Chi Song's avatar
Chi Song committed
308

309
### logCollection
Chi Song's avatar
Chi Song committed
310

311
Optional. `http` or `none`. Default: `none`.
312

313
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from `http`, trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is `none`, trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be `none`.
Chi Song's avatar
Chi Song committed
314

315
### tuner
Chi Song's avatar
Chi Song committed
316

317
Required.
Chi Song's avatar
Chi Song committed
318

319
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set __builtinTunerName__ and __classArgs__. Another way is to use users' own tuner file, in which case __codeDirectory__, __classFileName__, __className__ and __classArgs__ are needed. *Users must choose exactly one way.*
Chi Song's avatar
Chi Song committed
320

321
#### builtinTunerName
Chi Song's avatar
Chi Song committed
322

323
Required if using built-in tuners. String.
324

325
Specifies the name of system tuner, NNI sdk provides different tuners introduced [here](../Tuner/BuiltinTuner.md).
Chi Song's avatar
Chi Song committed
326

327
#### codeDir
Deshui Yu's avatar
Deshui Yu committed
328

329
Required if using customized tuners. Path relative to the location of config file.
330

331
Specifies the directory of tuner code.
332

333
#### classFileName
334

335
Required if using customized tuners. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
336

337
Specifies the name of tuner file.
Chi Song's avatar
Chi Song committed
338

339
#### className
Chi Song's avatar
Chi Song committed
340

341
Required if using customized tuners. String.
Chi Song's avatar
Chi Song committed
342

343
Specifies the name of tuner class.
Chi Song's avatar
Chi Song committed
344

345
#### classArgs
Chi Song's avatar
Chi Song committed
346

347
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
348

349
Specifies the arguments of tuner algorithm. Please refer to [this file](../Tuner/BuiltinTuner.md) for the configurable arguments of each built-in tuner.
Chi Song's avatar
Chi Song committed
350

351
#### gpuIndices
Chi Song's avatar
Chi Song committed
352

353
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
354

355
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
356

357
#### includeIntermediateResults
Chi Song's avatar
Chi Song committed
358

359
Optional. Bool. Default: false.
Chi Song's avatar
Chi Song committed
360

361
If __includeIntermediateResults__ is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
Chi Song's avatar
Chi Song committed
362

363
### assessor
364

365
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set __builtinAssessorName__ and __classArgs__. Another way is to use users' own assessor file, and users need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__. *Users must choose exactly one way.*
Deshui Yu's avatar
Deshui Yu committed
366

367
By default, there is no assessor enabled.
Chi Song's avatar
Chi Song committed
368

369
#### builtinAssessorName
370

371
Required if using built-in assessors. String.
372

373
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced [here](../Assessor/BuiltinAssessor.md).
374

375
#### codeDir
376

377
Required if using customized assessors. Path relative to the location of config file.
378

379
Specifies the directory of assessor code.
380

381
#### classFileName
382

383
Required if using customized assessors. File path relative to __codeDir__.
384

385
Specifies the name of assessor file.
386

387
#### className
Chi Song's avatar
Chi Song committed
388

389
Required if using customized assessors. String.
Chi Song's avatar
Chi Song committed
390

391
Specifies the name of assessor class.
Deshui Yu's avatar
Deshui Yu committed
392

393
#### classArgs
Chi Song's avatar
Chi Song committed
394

395
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
396

397
Specifies the arguments of assessor algorithm.
Chi Song's avatar
Chi Song committed
398

399
### advisor
Chi Song's avatar
Chi Song committed
400

401
Optional.
SparkSnail's avatar
SparkSnail committed
402

403
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set __builtinAdvisorName__ and __classArgs__. Another way is to use users' own advisor file, and need to set __codeDirectory__, __classFileName__, __className__ and __classArgs__.
Chi Song's avatar
Chi Song committed
404

405
When advisor is enabled, settings of tuners and advisors will be bypassed.
SparkSnail's avatar
SparkSnail committed
406

407
#### builtinAdvisorName
Chi Song's avatar
Chi Song committed
408

409
Specifies the name of a built-in advisor. NNI sdk provides [BOHB](../Tuner/BohbAdvisor.md) and [Hyperband](../Tuner/HyperbandAdvisor.md).
Chi Song's avatar
Chi Song committed
410

411
#### codeDir
Chi Song's avatar
Chi Song committed
412

413
Required if using customized advisors. Path relative to the location of config file.
Chi Song's avatar
Chi Song committed
414

415
Specifies the directory of advisor code.
Chi Song's avatar
Chi Song committed
416

417
#### classFileName
SparkSnail's avatar
SparkSnail committed
418

419
Required if using customized advisors. File path relative to __codeDir__.
Chi Song's avatar
Chi Song committed
420

421
Specifies the name of advisor file.
SparkSnail's avatar
SparkSnail committed
422

423
#### className
Chi Song's avatar
Chi Song committed
424

425
Required if using customized advisors. String.
SparkSnail's avatar
SparkSnail committed
426

427
Specifies the name of advisor class.
SparkSnail's avatar
SparkSnail committed
428

429
#### classArgs
Chi Song's avatar
Chi Song committed
430

431
Optional. Key-value pairs. Default: empty.
Chi Song's avatar
Chi Song committed
432

433
Specifies the arguments of advisor.
Chi Song's avatar
Chi Song committed
434

435
#### gpuIndices
Chi Song's avatar
Chi Song committed
436

437
Optional. String. Default: empty.
Chi Song's avatar
Chi Song committed
438

439
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma `,`. For example, `1`, or `0,1,3`. If the field is not set, no GPU will be visible to tuner (by setting `CUDA_VISIBLE_DEVICES` to be an empty string).
Chi Song's avatar
Chi Song committed
440

441
### trial
Chi Song's avatar
Chi Song committed
442

443
Required. Key-value pairs.
Chi Song's avatar
Chi Song committed
444

445
In local and remote mode, the following keys are required.
Chi Song's avatar
Chi Song committed
446

447
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
448

449
* __codeDir__: Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
Chi Song's avatar
Chi Song committed
450

451
* __gpuNum__: Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
452

453
In PAI mode, the following keys are required.
Chi Song's avatar
Chi Song committed
454

455
* __command__: Required string. Specifies the command to run trial process.
Chi Song's avatar
Chi Song committed
456

457
* __codeDir__: Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
Chi Song's avatar
Chi Song committed
458

459
* __gpuNum__: Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
Chi Song's avatar
Chi Song committed
460

461
* __cpuNum__: Required integer. Specifies the cpu number of cpu to be used in pai container.
SparkSnail's avatar
SparkSnail committed
462

463
* __memoryMB__: Required integer. Set the memory size to be used in pai container, in megabytes.
Chi Song's avatar
Chi Song committed
464

465
* __image__: Required string. Set the image to be used in pai.
Chi Song's avatar
Chi Song committed
466

467
* __authFile__: Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. [Reference](https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job).
Chi Song's avatar
Chi Song committed
468

469
* __shmMB__: Optional integer. Shared memory size of container.
Chi Song's avatar
Chi Song committed
470

471
* __portList__: List of key-values pairs with `label`, `beginAt`, `portNumber`. See [job tutorial of PAI](https://github.com/microsoft/pai/blob/master/docs/job_tutorial.md) for details.
Chi Song's avatar
Chi Song committed
472

473
In Kubeflow mode, the following keys are required.
Chi Song's avatar
Chi Song committed
474

475
* __codeDir__: The local directory where the code files are in.
Chi Song's avatar
Chi Song committed
476

477
* __ps__: An optional configuration for kubeflow's tensorflow-operator, which includes
Chi Song's avatar
Chi Song committed
478

479
    * __replicas__: The replica number of __ps__ role.
Chi Song's avatar
Chi Song committed
480

481
    * __command__: The run script in __ps__'s container.
Chi Song's avatar
Chi Song committed
482

483
    * __gpuNum__: The gpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
484

485
    * __cpuNum__: The cpu number to be used in __ps__ container.
Chi Song's avatar
Chi Song committed
486

487
    * __memoryMB__: The memory size of the container.
Chi Song's avatar
Chi Song committed
488

489
    * __image__: The image to be used in __ps__.
SparkSnail's avatar
SparkSnail committed
490

491
* __worker__: An optional configuration for kubeflow's tensorflow-operator.
492

493
    * __replicas__: The replica number of __worker__ role.
494

495
    * __command__: The run script in __worker__'s container.
496

497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
    * __gpuNum__: The gpu number to be used in __worker__ container.

    * __cpuNum__: The cpu number to be used in __worker__ container.

    * __memoryMB__: The memory size of the container.

    * __image__: The image to be used in __worker__.

### localConfig

Optional in local mode. Key-value pairs.

Only applicable if __trainingServicePlatform__ is set to `local`, otherwise there should not be __localConfig__ section in configuration file.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

liuzhe-lz's avatar
liuzhe-lz committed
519
Optional. Integer. Default: 1.
520
  
521
Used to specify the max concurrency trial number on a GPU device.
522
    
523
#### useActiveGpu
524

525
Optional. Bool. Default: false.
SparkSnail's avatar
SparkSnail committed
526

527
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
SparkSnail's avatar
SparkSnail committed
528

529
### machineList
530

531
Required in remote mode. A list of key-value pairs with the following keys.
Chi Song's avatar
Chi Song committed
532

533
#### ip
534

535
Required. IP address or host name that is accessible from the current machine.
Chi Song's avatar
Chi Song committed
536

537
The IP address or host name of remote machine.
Chi Song's avatar
Chi Song committed
538

539
#### port
Chi Song's avatar
Chi Song committed
540

541
Optional. Integer. Valid port. Default: 22.
Deshui Yu's avatar
Deshui Yu committed
542

543
The ssh port to be used to connect machine.
544

545
#### username
Chi Song's avatar
Chi Song committed
546

547
Required if authentication with username/password. String.
Chi Song's avatar
Chi Song committed
548

549
The account of remote machine.
550

551
#### passwd
SparkSnail's avatar
SparkSnail committed
552

553
Required if authentication with username/password. String.
554

555
Specifies the password of the account.
556

557
#### sshKeyPath
558

559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
Required if authentication with ssh key. Path to private key file.

If users use ssh key to login remote machine, __sshKeyPath__ should be a valid path to a ssh key file.

*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*

#### passphrase

Optional. String.

Used to protect ssh key, which could be empty if users don't have passphrase.

#### gpuIndices

Optional. String. Default: none.

Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (`,`), such as `1` or  `0,1,3`. By default, all GPUs available will be used.

#### maxTrialNumPerGpu

579
Optional. Integer. Default: 1.
580
581
582
583
584
585
586
587
588

Used to specify the max concurrency trial number on a GPU device.

#### useActiveGpu

Optional. Bool. Default: false.

Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If __useActiveGpu__ is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.

589
590
591
592
593
594
595
596
#### preCommand

Optional. String.

Specifies the pre-command that will be executed before the remote machine executes other commands. Users can configure the experimental environment on remote machine by setting __preCommand__. If there are multiple commands need to execute, use `&&` to connect them, such as `preCommand: command1 && command2 && ...`.

__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`.

597
598
599
600
601
602
603
604
605
606
### remoteConfig

Optional field in remote mode. Users could set per machine information in `machineList` field, and set global configuration for remote mode in this field.

#### reuse

Optional. Bool. default: `false`. It's an experimental feature.

If it's true, NNI will reuse remote jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials. 

607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
### kubeflowConfig

#### operator

Required. String. Has to be `tf-operator` or `pytorch-operator`.

Specifies the kubeflow's operator to be used, NNI support `tf-operator` in current version.

#### storage

Optional. String. Default. `nfs`.

Specifies the storage type of kubeflow, including `nfs` and `azureStorage`.

#### nfs
622

623
Required if using nfs. Key-value pairs.
Chi Song's avatar
Chi Song committed
624

625
* __server__ is the host of nfs server.
Chi Song's avatar
Chi Song committed
626

627
* __path__ is the mounted path of nfs.
Chi Song's avatar
Chi Song committed
628

629
#### keyVault
Chi Song's avatar
Chi Song committed
630

631
Required if using azure storage. Key-value pairs.
Chi Song's avatar
Chi Song committed
632

633
Set __keyVault__ to storage the private key of your azure storage account. Refer to https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2.
Chi Song's avatar
Chi Song committed
634

635
* __vaultName__ is the value of `--vault-name` used in az command.
SparkSnail's avatar
SparkSnail committed
636

637
* __name__ is the value of `--name` used in az command.
Chi Song's avatar
Chi Song committed
638

639
#### azureStorage
Chi Song's avatar
Chi Song committed
640

641
Required if using azure storage. Key-value pairs.
SparkSnail's avatar
SparkSnail committed
642

643
Set azure storage account to store code files.
SparkSnail's avatar
SparkSnail committed
644

645
* __accountName__ is the name of azure storage account.
SparkSnail's avatar
SparkSnail committed
646

647
* __azureShare__ is the share of the azure file storage.
648

649
#### uploadRetryCount
650

651
Required if using azure storage. Integer between 1 and 99999.
Chi Song's avatar
Chi Song committed
652

653
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
SparkSnail's avatar
SparkSnail committed
654

655
### paiConfig
Chi Song's avatar
Chi Song committed
656

657
#### userName
SparkSnail's avatar
SparkSnail committed
658

659
Required. String.
Chi Song's avatar
Chi Song committed
660

661
The user name of your pai account.
SparkSnail's avatar
SparkSnail committed
662

663
#### password
664

665
Required if using password authentication. String.
666

667
The password of the pai account.
SparkSnail's avatar
SparkSnail committed
668

669
#### token
Chi Song's avatar
Chi Song committed
670

671
Required if using token authentication. String.
SparkSnail's avatar
SparkSnail committed
672

673
Personal access token that can be retrieved from PAI portal.
Chi Song's avatar
Chi Song committed
674

675
#### host
Chi Song's avatar
Chi Song committed
676

677
Required. String.
Chi Song's avatar
Chi Song committed
678

679
The hostname of IP address of PAI.
SparkSnail's avatar
SparkSnail committed
680

681
682
683
684
685
686
#### reuse

Optional. Bool. default: `false`. It's an experimental feature.

If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.

Deshui Yu's avatar
Deshui Yu committed
687
## Examples
Chi Song's avatar
Chi Song committed
688

689
### Local mode
Deshui Yu's avatar
Deshui Yu committed
690

691
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
Chi Song's avatar
Chi Song committed
692

Chi Song's avatar
Chi Song committed
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #choice: true, false
  useAnnotation: true
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

715
You can add assessor configuration.
Chi Song's avatar
Chi Song committed
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    #choice: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

746
Or you could specify your own tuner and assessor file as following,
Chi Song's avatar
Chi Song committed
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
Deshui Yu's avatar
Deshui Yu committed
778

779
### Remote mode
Deshui Yu's avatar
Deshui Yu committed
780

781
If run trial jobs in remote machine, users could specify the remote machine information as following format:
Chi Song's avatar
Chi Song committed
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818

  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  #machineList can be empty if the platform is local
  machineList:
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
819
820
821
822
823
824
      # Pre-command will be executed before the remote machine executes other commands.
      # Below is an example of specifying python environment.
      # If you want to execute multiple commands, please use "&&" to connect them.
      # preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
      # preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
      preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
Chi Song's avatar
Chi Song committed
825
  ```
SparkSnail's avatar
SparkSnail committed
826

827
### PAI mode
SparkSnail's avatar
SparkSnail committed
828

Chi Song's avatar
Chi Song committed
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC should be installed through nnictl)
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    #The docker image to run NNI job on pai
SparkSnail's avatar
SparkSnail committed
854
    image: msranni/nni:latest
Chi Song's avatar
Chi Song committed
855
856
857
858
859
860
861
862
  paiConfig:
    #The username to login pai
    userName: test
    #The password to login pai
    passWord: test
    #The host of restful server of pai
    host: 10.10.10.10
  ```
Chi Song's avatar
Chi Song committed
863

864
### Kubeflow mode
Chi Song's avatar
Chi Song committed
865

Chi Song's avatar
Chi Song committed
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
  kubeflow with nfs storage.

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```

901
### Kubeflow with azure storage
Chi Song's avatar
Chi Song committed
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942

  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #choice: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #choice: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #choice: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```