ExperimentConfig.md 20.4 KB
Newer Older
Chi Song's avatar
Chi Song committed
1
2
3
4
# Experiment(实验)配置参考

创建 Experiment 时,需要给 nnictl 命令提供配置文件的路径。 配置文件是 YAML 格式,需要保证其格式正确。 本文介绍了配置文件的内容,并提供了一些示例和模板。

Chi Song's avatar
Chi Song committed
5
6
7
8
- [Experiment(实验)配置参考](#experiment-config-reference) 
  - [模板](#template)
  - [说明](#configuration-spec)
  - [样例](#examples)
Chi Song's avatar
Chi Song committed
9
10
11
12
13

<a name="Template"></a>

## 模板

Chi Song's avatar
Chi Song committed
14
- **简化版(不包含 Annotation(标记)和 Assessor)**
Chi Song's avatar
Chi Song committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

```yaml
authorName: 
experimentName: 
trialConcurrency: 
maxExecDuration: 
maxTrialNum: 
#可选项: local, remote, pai, kubeflow
trainingServicePlatform: 
searchSpacePath: 
#可选项: true, false
useAnnotation: 
tuner:
  #可选项: TPE, Random, Anneal, Evolution
  builtinTunerName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
trial:
  command: 
  codeDir: 
  gpuNum: 
#在本地使用时,machineList 可为空
machineList:
  - ip: 
    port: 
    username: 
    passwd:
```

Chi Song's avatar
Chi Song committed
46
- **使用 Assessor**
Chi Song's avatar
Chi Song committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84

```yaml
authorName: 
experimentName: 
trialConcurrency: 
maxExecDuration: 
maxTrialNum: 
#可选项: local, remote, pai, kubeflow
trainingServicePlatform: 
searchSpacePath: 
#可选项: true, false
useAnnotation: 
tuner:
  #可选项: TPE, Random, Anneal, Evolution
  builtinTunerName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
assessor:
  #可选项: Medianstop
  builtinAssessorName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
trial:
  command: 
  codeDir: 
  gpuNum: 
#在本地使用时,machineList 可为空
machineList:
  - ip: 
    port: 
    username: 
    passwd:
```

Chi Song's avatar
Chi Song committed
85
- **使用 Annotation**
Chi Song's avatar
Chi Song committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

```yaml
authorName: 
experimentName: 
trialConcurrency: 
maxExecDuration: 
maxTrialNum: 
#可选项: local, remote, pai, kubeflow
trainingServicePlatform: 
#可选项: true, false
useAnnotation: 
tuner:
  #可选项: TPE, Random, Anneal, Evolution
  builtinTunerName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
assessor:
  #可选项: Medianstop
  builtinAssessorName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
trial:
  command: 
  codeDir: 
  gpuNum: 
#在本地使用时,machineList 可为空
machineList:
  - ip: 
    port: 
    username: 
    passwd:
```

<a name="Configuration"></a>

## 说明

Chi Song's avatar
Chi Song committed
127
- **authorName**
Chi Song's avatar
Chi Song committed
128
  
Chi Song's avatar
Chi Song committed
129
  - 说明
Chi Song's avatar
Chi Song committed
130
131
132
    
    **authorName** 是创建 Experiment 的作者。 待定: 增加默认值

Chi Song's avatar
Chi Song committed
133
- **experimentName**
Chi Song's avatar
Chi Song committed
134
  
Chi Song's avatar
Chi Song committed
135
  - 说明
Chi Song's avatar
Chi Song committed
136
137
138
139
    
    **experimentName** 是 Experiment 的名称。  
    待实现:增加默认值

Chi Song's avatar
Chi Song committed
140
- **trialConcurrency**
Chi Song's avatar
Chi Song committed
141
  
Chi Song's avatar
Chi Song committed
142
  - 说明
Chi Song's avatar
Chi Song committed
143
144
145
146
147
    
    **trialConcurrency** 定义了并发尝试任务的最大数量。
    
    注意:如果 trialGpuNum 大于空闲的 GPU 数量,并且并发的 Trial 任务数量还没达到 trialConcurrency,Trial 任务会被放入队列,等待分配 GPU 资源。

Chi Song's avatar
Chi Song committed
148
- **maxExecDuration**
Chi Song's avatar
Chi Song committed
149
  
Chi Song's avatar
Chi Song committed
150
  - 说明
Chi Song's avatar
Chi Song committed
151
152
153
154
155
    
    **maxExecDuration** 定义 Experiment 执行的最长时间。时间单位:{**s**, **m**, **h**, **d**},分别代表:{*seconds*, *minutes*, *hours*, *days*}。
    
    注意:maxExecDuration 设置的是 Experiment 执行的时间,不是 Trial 的。 如果 Experiment 达到了设置的最大时间,Experiment 不会停止,但不会再启动新的 Trial 作业。

Chi Song's avatar
Chi Song committed
156
- **debug**
Chi Song's avatar
Chi Song committed
157
  
Chi Song's avatar
Chi Song committed
158
159
  - 说明
    
Chi Song's avatar
Chi Song committed
160
    NNI 会校验 remote, pai 和 Kubernetes 模式下 NNIManager 与 trialKeeper 进程的版本。 如果需要禁用版本校验,debug 应设置为 true。
Chi Song's avatar
Chi Song committed
161
162
163
164

- **maxTrialNum**
  
  - 说明
Chi Song's avatar
Chi Song committed
165
166
167
    
    **maxTrialNum** 定义了 Trial 任务的最大数量,成功和失败的都计算在内。

Chi Song's avatar
Chi Song committed
168
- **trainingServicePlatform**
Chi Song's avatar
Chi Song committed
169
  
Chi Song's avatar
Chi Song committed
170
  - 说明
Chi Song's avatar
Chi Song committed
171
172
173
    
    **trainingServicePlatform** 定义运行 Experiment 的平台,包括:{**local**, **remote**, **pai**, **kubeflow**}.
    
Chi Song's avatar
Chi Song committed
174
    - **local** 在本机的 ubuntu 上运行 Experiment。
Chi Song's avatar
Chi Song committed
175
    
Chi Song's avatar
Chi Song committed
176
    - **remote** 将任务提交到远程的 Ubuntu 上,必须用 **machineList** 来指定远程的 SSH 连接信息。
Chi Song's avatar
Chi Song committed
177
    
Chi Song's avatar
Chi Song committed
178
    - **pai** 提交任务到微软开源的 [OpenPAI](https://github.com/Microsoft/pai) 上。 更多 OpenPAI 配置,参考 [pai 模式](./PaiMode.md)
Chi Song's avatar
Chi Song committed
179
    
Chi Song's avatar
Chi Song committed
180
    - **kubeflow** 提交任务至 [Kubeflow](https://www.kubeflow.org/docs/about/kubeflow/)。 NNI 支持基于 Kubeflow 的 Kubenetes,以及[Azure Kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/)
Chi Song's avatar
Chi Song committed
181

Chi Song's avatar
Chi Song committed
182
- **searchSpacePath**
Chi Song's avatar
Chi Song committed
183
  
Chi Song's avatar
Chi Song committed
184
  - 说明
Chi Song's avatar
Chi Song committed
185
186
187
188
189
    
    **searchSpacePath** 定义搜索空间文件的路径,此文件必须在运行 nnictl 的本机。
    
    注意: 如果设置了 useAnnotation=True,searchSpacePath 字段必须被删除。

Chi Song's avatar
Chi Song committed
190
- **useAnnotation**
Chi Song's avatar
Chi Song committed
191
  
Chi Song's avatar
Chi Song committed
192
  - 说明
Chi Song's avatar
Chi Song committed
193
194
195
196
197
    
    **useAnnotation** 定义使用标记来分析代码并生成搜索空间。
    
    注意: 如果设置了 useAnnotation=True,searchSpacePath 字段必须被删除。

Chi Song's avatar
Chi Song committed
198
- **nniManagerIp**
Chi Song's avatar
Chi Song committed
199
  
Chi Song's avatar
Chi Song committed
200
  - 说明
Chi Song's avatar
Chi Song committed
201
202
203
204
205
    
    **nniManagerIp** 设置 NNI 管理器运行的 IP 地址。 此字段为可选项,如果没有设置,则会使用 eth0 的 IP 地址。
    
    注意: 可在 NNI 管理器机器上运行 ifconfig 来检查 eth0 是否存在。 如果不存在,推荐显式设置 nnimanagerIp。

Chi Song's avatar
Chi Song committed
206
- **logDir**
Chi Song's avatar
Chi Song committed
207
  
Chi Song's avatar
Chi Song committed
208
  - 说明
Chi Song's avatar
Chi Song committed
209
210
211
    
    **logDir** 配置存储日志和数据的目录。 默认值是 `<user home directory>/nni/experiment`

Chi Song's avatar
Chi Song committed
212
- **logLevel**
Chi Song's avatar
Chi Song committed
213
  
Chi Song's avatar
Chi Song committed
214
  - 说明
Chi Song's avatar
Chi Song committed
215
216
217
    
    **logLevel** 为 Experiment 设置日志级别,支持的日志级别有:`trace, debug, info, warning, error, fatal`。 默认值是 `info`

Chi Song's avatar
Chi Song committed
218
219
220
221
222
223
224
- **logCollection**
  
  - 说明
    
    **logCollection** 设置在 remote, pai, kubeflow, frameworkcontroller 平台下收集日志的方法。 日志支持两种设置,一种是通过 `http`,让 Trial 将日志通过 POST 方法发回日志,这种方法会减慢 trialKeeper 的速度。 另一种方法是 `none`,Trial 不将日志回传回来,仅仅回传 Job 的指标。 如果日志较大,可将此参数设置为 `none`

- **Tuner**
Chi Song's avatar
Chi Song committed
225
  
Chi Song's avatar
Chi Song committed
226
  - 说明
Chi Song's avatar
Chi Song committed
227
228
229
    
    **tuner** 指定了 Experiment 的 Tuner 算法。有两种方法可设置 Tuner。 一种方法是使用 SDK 提供的 Tuner,需要设置 **builtinTunerName****classArgs**。 另一种方法,是使用用户自定义的 Tuner,需要设置 **codeDirectory****classFileName****className****classArgs**
  
Chi Song's avatar
Chi Song committed
230
  - **builtinTunerName****classArgs**
Chi Song's avatar
Chi Song committed
231
    
Chi Song's avatar
Chi Song committed
232
    - **builtinTunerName**
Chi Song's avatar
Chi Song committed
233
234
235
      
      **builtinTunerName** 指定了系统 Tuner 的名字,NNI SDK 提供了多种 Tuner,如:{**TPE**, **Random**, **Anneal**, **Evolution**, **BatchTuner**, **GridSearch**}。
    
Chi Song's avatar
Chi Song committed
236
    - **classArgs**
Chi Song's avatar
Chi Song committed
237
238
239
      
      **classArgs** 指定了 Tuner 算法的参数。 如果 **builtinTunerName** 是{**TPE**, **Random**, **Anneal**, **Evolution**},用户需要设置 **optimize_mode**
  
Chi Song's avatar
Chi Song committed
240
  - **codeDir**, **classFileName**, **className****classArgs**
Chi Song's avatar
Chi Song committed
241
    
Chi Song's avatar
Chi Song committed
242
    - **codeDir**
Chi Song's avatar
Chi Song committed
243
244
245
      
      **codeDir** 指定 Tuner 代码的目录。
    
Chi Song's avatar
Chi Song committed
246
    - **classFileName**
Chi Song's avatar
Chi Song committed
247
248
249
      
      **classFileName** 指定 Tuner 文件名。
    
Chi Song's avatar
Chi Song committed
250
    - **className**
Chi Song's avatar
Chi Song committed
251
252
253
      
      **className** 指定 Tuner 类名。
    
Chi Song's avatar
Chi Song committed
254
    - **classArgs**
Chi Song's avatar
Chi Song committed
255
256
      
      **classArgs** 指定了 Tuner 算法的参数。
Chi Song's avatar
Chi Song committed
257
258
  
  - **gpuNum**
Chi Song's avatar
Chi Song committed
259
    
Chi Song's avatar
Chi Song committed
260
261
262
263
264
265
266
267
268
        __gpuNum__ 指定了运行 Tuner 进程的 GPU 数量。 此字段的值必须是正整数。
        
        注意: 只能使用一种方法来指定 Tuner,例如:设置{tunerName, optimizationMode} 或 {tunerCommand, tunerCwd},不能同时设置。
        
  
  - **includeIntermediateResults**
    
        如果 __includeIntermediateResults__ 为 true,最后一个 Assessor 的中间结果会被发送给 Tuner 作为最终结果。 __includeIntermediateResults__ 的默认值为 false。
        
Chi Song's avatar
Chi Song committed
269

Chi Song's avatar
Chi Song committed
270
- **Assessor**
Chi Song's avatar
Chi Song committed
271
  
Chi Song's avatar
Chi Song committed
272
  - 说明
Chi Song's avatar
Chi Song committed
273
274
275
    
    **assessor** 指定了 Experiment 的 Assessor 算法。有两种方法可设置 Assessor。 一种方法是使用 SDK 提供的 Assessor,需要设置 **builtinAssessorName****classArgs**。 另一种方法,是使用用户自定义的 Assessor,需要设置 **codeDirectory****classFileName****className****classArgs**
  
Chi Song's avatar
Chi Song committed
276
  - **builtinAssessorName****classArgs**
Chi Song's avatar
Chi Song committed
277
    
Chi Song's avatar
Chi Song committed
278
    - **builtinAssessorName**
Chi Song's avatar
Chi Song committed
279
280
281
      
      **builtinAssessorName** 指定了系统 Assessor 的名称, NNI 内置的 Assessor 有 {**Medianstop**,等等}。
    
Chi Song's avatar
Chi Song committed
282
    - **classArgs**
Chi Song's avatar
Chi Song committed
283
284
285
      
      **classArgs** 指定了 Assessor 算法的参数。
  
Chi Song's avatar
Chi Song committed
286
  - **codeDir**, **classFileName**, **className****classArgs**
Chi Song's avatar
Chi Song committed
287
    
Chi Song's avatar
Chi Song committed
288
    - **codeDir**
Chi Song's avatar
Chi Song committed
289
290
291
      
      **codeDir** 指定 Assessor 代码的目录。
    
Chi Song's avatar
Chi Song committed
292
    - **classFileName**
Chi Song's avatar
Chi Song committed
293
294
295
      
      **classFileName** 指定 Assessor 文件名。
    
Chi Song's avatar
Chi Song committed
296
    - **className**
Chi Song's avatar
Chi Song committed
297
298
299
      
      **className** 指定 Assessor 类名。
    
Chi Song's avatar
Chi Song committed
300
    - **classArgs**
Chi Song's avatar
Chi Song committed
301
302
303
      
      **classArgs** 指定了 Assessor 算法的参数。
  
Chi Song's avatar
Chi Song committed
304
  - **gpuNum**
Chi Song's avatar
Chi Song committed
305
306
307
    
    **gpuNum** 指定了运行 Assessor 进程的 GPU 数量。 此字段的值必须是正整数。
    
308
    注意: 只能使用一种方法来指定 Assessor,例如:设置 {assessorName, optimizationMode} 或 {assessorCommand, assessorCwd},不能同时设置。如果不需要使用 Assessor,可将其置为空。
Chi Song's avatar
Chi Song committed
309

Chi Song's avatar
Chi Song committed
310
- **trial (local, remote)**
Chi Song's avatar
Chi Song committed
311
  
Chi Song's avatar
Chi Song committed
312
  - **command**
Chi Song's avatar
Chi Song committed
313
314
315
    
    **command** 指定了运行 Trial 进程的命令行。
  
Chi Song's avatar
Chi Song committed
316
  - **codeDir**
Chi Song's avatar
Chi Song committed
317
318
319
    
    **codeDir** 指定了 Trial 代码文件的目录。
  
Chi Song's avatar
Chi Song committed
320
  - **gpuNum**
Chi Song's avatar
Chi Song committed
321
322
323
    
    **gpuNum** 指定了运行 Trial 进程的 GPU 数量。 默认值为 0。

Chi Song's avatar
Chi Song committed
324
- **trial (pai)**
Chi Song's avatar
Chi Song committed
325
  
Chi Song's avatar
Chi Song committed
326
  - **command**
Chi Song's avatar
Chi Song committed
327
328
329
    
    **command** 指定了运行 Trial 进程的命令行。
  
Chi Song's avatar
Chi Song committed
330
  - **codeDir**
Chi Song's avatar
Chi Song committed
331
332
333
    
    **codeDir** 指定了 Trial 代码文件的目录。
  
Chi Song's avatar
Chi Song committed
334
  - **gpuNum**
Chi Song's avatar
Chi Song committed
335
336
337
    
    **gpuNum** 指定了运行 Trial 进程的 GPU 数量。 默认值为 0。
  
Chi Song's avatar
Chi Song committed
338
  - **cpuNum**
Chi Song's avatar
Chi Song committed
339
340
341
    
    **cpuNum** 指定了 OpenPAI 容器中使用的 CPU 数量。
  
Chi Song's avatar
Chi Song committed
342
  - **memoryMB**
Chi Song's avatar
Chi Song committed
343
344
345
    
    **memoryMB** 指定了 OpenPAI 容器中使用的内存数量。
  
Chi Song's avatar
Chi Song committed
346
  - **image**
Chi Song's avatar
Chi Song committed
347
348
349
    
    **image** 指定了 OpenPAI 中使用的 docker 映像。
  
Chi Song's avatar
Chi Song committed
350
  - **dataDir**
Chi Song's avatar
Chi Song committed
351
352
353
    
    **dataDir** 是 HDFS 中用到的数据目录变量。
  
Chi Song's avatar
Chi Song committed
354
  - **outputDir**
Chi Song's avatar
Chi Song committed
355
356
357
    
    **outputDir** 是 HDFS 中用到的输出目录变量。在 OpenPAI 中,stdout 和 stderr 文件会在作业完成后,存放在此目录中。

Chi Song's avatar
Chi Song committed
358
- **trial (kubeflow)**
Chi Song's avatar
Chi Song committed
359
  
Chi Song's avatar
Chi Song committed
360
  - **codeDir**
Chi Song's avatar
Chi Song committed
361
362
363
    
    **codeDir** 指定了代码文件的本机路径。
  
Chi Song's avatar
Chi Song committed
364
  - **ps (可选)**
Chi Song's avatar
Chi Song committed
365
366
367
    
    **ps** 是 Kubeflow 的 Tensorflow-operator 配置。
    
Chi Song's avatar
Chi Song committed
368
    - **replicas**
Chi Song's avatar
Chi Song committed
369
370
371
      
      **replicas****ps** 角色的副本数量。
    
Chi Song's avatar
Chi Song committed
372
    - **command**
Chi Song's avatar
Chi Song committed
373
374
375
      
      **command** 是在 **ps** 的容器中运行的脚本命令。
    
Chi Song's avatar
Chi Song committed
376
    - **gpuNum**
Chi Song's avatar
Chi Song committed
377
378
379
      
      **gpuNum** 是在 **ps** 容器中使用的 GPU 数量。
    
Chi Song's avatar
Chi Song committed
380
    - **cpuNum**
Chi Song's avatar
Chi Song committed
381
382
383
      
      **cpuNum** 是在 **ps** 容器中使用的 CPU 数量。
    
Chi Song's avatar
Chi Song committed
384
    - **memoryMB**
Chi Song's avatar
Chi Song committed
385
386
387
      
      **memoryMB** 指定了容器中使用的内存数量。
    
Chi Song's avatar
Chi Song committed
388
    - **image**
Chi Song's avatar
Chi Song committed
389
      
390
      **image** 设置了 **ps** 使用的 docker 映像。
Chi Song's avatar
Chi Song committed
391
  
Chi Song's avatar
Chi Song committed
392
  - **worker**
Chi Song's avatar
Chi Song committed
393
394
395
    
    **worker** 是 Kubeflow 的 Tensorflow-operator 配置。
    
Chi Song's avatar
Chi Song committed
396
    - **replicas**
Chi Song's avatar
Chi Song committed
397
398
399
      
      **replicas****worker** 角色的副本数量。
    
Chi Song's avatar
Chi Song committed
400
    - **command**
Chi Song's avatar
Chi Song committed
401
402
403
      
      **command** 是在 **worker** 的容器中运行的脚本命令。
    
Chi Song's avatar
Chi Song committed
404
    - **gpuNum**
Chi Song's avatar
Chi Song committed
405
406
407
      
      **gpuNum** 是在 **worker** 容器中使用的 GPU 数量。
    
Chi Song's avatar
Chi Song committed
408
    - **cpuNum**
Chi Song's avatar
Chi Song committed
409
410
411
      
      **cpuNum** 是在 **worker** 容器中使用的 CPU 数量。
    
Chi Song's avatar
Chi Song committed
412
    - **memoryMB**
Chi Song's avatar
Chi Song committed
413
414
415
      
      **memoryMB** 指定了容器中使用的内存数量。
    
Chi Song's avatar
Chi Song committed
416
    - **image**
Chi Song's avatar
Chi Song committed
417
418
419
      
      **image** 设置了 **worker** 使用的 docker 映像。

Chi Song's avatar
Chi Song committed
420
421
422
423
424
425
426
427
- **localConfig**
  
  **localConfig** 仅在 **trainingServicePlatform** 设为 `local` 时有效,否则,配置文件中不应该有 **localConfig** 部分。
  
  - **gpuIndices**
    
    **gpuIndices** 用于指定 GPU。设置此值后,只有指定的 GPU 会被用来运行 Trial 任务。 可指定单个或多个 GPU 的索引,多个 GPU 之间用逗号(,)隔开,例如 `1``0,1,3`

Chi Song's avatar
Chi Song committed
428
- **machineList**
Chi Song's avatar
Chi Song committed
429
430
431
  
  如果 **trainingServicePlatform** 为 remote,则需要设置 **machineList**。否则应将其置为空。
  
Chi Song's avatar
Chi Song committed
432
  - **ip**
Chi Song's avatar
Chi Song committed
433
434
435
    
    **ip** 是远程计算机的 ip 地址。
  
Chi Song's avatar
Chi Song committed
436
  - **port**
Chi Song's avatar
Chi Song committed
437
438
439
440
441
    
    **端口** 是用于连接远程计算机的 ssh 端口。
    
    注意:如果 port 设为空,则为默认值 22。
  
Chi Song's avatar
Chi Song committed
442
  - **username**
Chi Song's avatar
Chi Song committed
443
444
445
    
    **username** 是远程计算机的用户名。
  
Chi Song's avatar
Chi Song committed
446
  - **passwd**
Chi Song's avatar
Chi Song committed
447
448
449
    
    **passwd** 指定了账户的密码。
  
Chi Song's avatar
Chi Song committed
450
  - **sshKeyPath**
Chi Song's avatar
Chi Song committed
451
452
453
454
455
    
    如果要使用 ssh 密钥登录远程计算机,则需要设置 **sshKeyPath****sshKeyPath** 为有效的 ssh 密钥文件路径。
    
    注意:如果同时设置了 passwd 和 sshKeyPath,NNI 会使用 passwd。
  
Chi Song's avatar
Chi Song committed
456
  - **passphrase**
Chi Song's avatar
Chi Song committed
457
458
    
    **passphrase** 用于保护 ssh 密钥,如果没有使用,可为空。
Chi Song's avatar
Chi Song committed
459
460
461
462
  
  - **gpuIndices**
    
    **gpuIndices** 用于指定 GPU。设置此值后,远程计算机上只有指定的 GPU 会被用来运行 Trial 任务。 可指定单个或多个 GPU 的索引,多个 GPU 之间用逗号(,)隔开,例如 `1``0,1,3`
Chi Song's avatar
Chi Song committed
463

Chi Song's avatar
Chi Song committed
464
- **kubeflowConfig**:
Chi Song's avatar
Chi Song committed
465
  
Chi Song's avatar
Chi Song committed
466
  - **operator**
Chi Song's avatar
Chi Song committed
467
468
469
    
    **operator** 指定了 kubeflow 使用的 operator,NNI 当前版本支持 **tf-operator**
  
Chi Song's avatar
Chi Song committed
470
  - **storage**
Chi Song's avatar
Chi Song committed
471
472
473
    
    **storage** 指定了 kubeflow 的存储类型,包括 {**nfs****azureStorage**}。 此字段可选,默认值为 **nfs**。 如果使用了 azureStorage,此字段必须填写。
  
Chi Song's avatar
Chi Song committed
474
  - **nfs**
Chi Song's avatar
Chi Song committed
475
476
477
478
479
    
    **server** 是 NFS 服务器的地址
    
    **path** 是 NFS 挂载的路径
  
Chi Song's avatar
Chi Song committed
480
  - **keyVault**
Chi Song's avatar
Chi Song committed
481
482
483
    
    如果用户使用 Azure Kubernetes Service,需要设置 keyVault 来使用 Azure 存储账户的私钥。 参考: https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
    
Chi Song's avatar
Chi Song committed
484
    - **vaultName**
Chi Song's avatar
Chi Song committed
485
486
487
      
      **vaultName** 是 az 命令中 `--vault-name` 的值。
    
Chi Song's avatar
Chi Song committed
488
    - **name**
Chi Song's avatar
Chi Song committed
489
490
491
      
      **name** 是 az 命令中 `--name` 的值。
  
Chi Song's avatar
Chi Song committed
492
  - **azureStorage**
Chi Song's avatar
Chi Song committed
493
494
495
    
    如果用户使用了 Azure Kubernetes Service,需要设置 Azure 存储账户来存放代码文件。
    
Chi Song's avatar
Chi Song committed
496
    - **accountName**
Chi Song's avatar
Chi Song committed
497
498
499
      
      **accountName** 是 Azure 存储账户的名称。
    
Chi Song's avatar
Chi Song committed
500
    - **azureShare**
Chi Song's avatar
Chi Song committed
501
502
503
      
      **azureShare** 是 Azure 文件存储的共享参数。

Chi Song's avatar
Chi Song committed
504
- **paiConfig**
Chi Song's avatar
Chi Song committed
505
  
Chi Song's avatar
Chi Song committed
506
  - **userName**
Chi Song's avatar
Chi Song committed
507
508
509
    
    **userName** 是 OpenPAI 的用户名。
  
Chi Song's avatar
Chi Song committed
510
  - **password**
Chi Song's avatar
Chi Song committed
511
512
513
    
    **password** 是 OpenPAI 用户的密码。
  
Chi Song's avatar
Chi Song committed
514
  - **host**
Chi Song's avatar
Chi Song committed
515
516
517
518
519
520
521
    
    **host** 是 OpenPAI 的主机地址。

<a name="Examples"></a>

## 样例

Chi Song's avatar
Chi Song committed
522
- **本机模式**
Chi Song's avatar
Chi Song committed
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
  
  如果要在本机运行 Trial 任务,并使用标记来生成搜索空间,可参考下列配置:
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #可选项: true, false
  useAnnotation: true
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
  
  增加 Assessor 配置
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    #可选项: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
  
  或者可以指定自定义的 Tuner 和 Assessor:
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

Chi Song's avatar
Chi Song committed
617
- **远程模式**
Chi Song's avatar
Chi Song committed
618
  
Chi Song's avatar
Chi Song committed
619
  如果要在远程服务器上运行 Trial 任务,需要增加服务器信息:
Chi Song's avatar
Chi Song committed
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  # 如果是本地 Experiment,machineList 可为空。
  machineList:
  
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```

Chi Song's avatar
Chi Song committed
661
- **pai 模式**
Chi Song's avatar
Chi Song committed
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
  
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC 需要使用 nnictl package 单独安装)
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    # 在 OpenPAI 上用来运行 Nni 作业的 docker 映像
    image: msranni/nni:latest
    # 在 OpenPAI 的 hdfs 上存储数据的目录,如:'hdfs://host:port/directory'
    dataDir: hdfs://10.11.12.13:9000/test
    # 在 OpenPAI 的 hdfs 上存储输出的目录,如:'hdfs://host:port/directory'
    outputDir: hdfs://10.11.12.13:9000/test
  paiConfig:
    # OpenPAI 用户名
    userName: test
    # OpenPAI 密码
    passWord: test
    # OpenPAI 服务器 Ip
    host: 10.10.10.10
  ```

Chi Song's avatar
Chi Song committed
702
- **Kubeflow 模式**
Chi Song's avatar
Chi Song committed
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
  
  使用 NFS 存储。
  
  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```
  
  使用 Azure 存储。
  
  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #可选项: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
    gpuNum: 0
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```