ExperimentConfig.md 22.3 KB
Newer Older
Chi Song's avatar
Chi Song committed
1
2
3
4
# Experiment(实验)配置参考

创建 Experiment 时,需要给 nnictl 命令提供配置文件的路径。 配置文件是 YAML 格式,需要保证其格式正确。 本文介绍了配置文件的内容,并提供了一些示例和模板。

Chi Song's avatar
Chi Song committed
5
6
7
8
- [Experiment(实验)配置参考](#Experiment-config-reference) 
  - [模板](#Template)
  - [说明](#Configuration-spec)
  - [样例](#Examples)
Chi Song's avatar
Chi Song committed
9
10
11
12
13

<a name="Template"></a>

## 模板

Chi Song's avatar
Chi Song committed
14
- **简化版(不包含 Annotation(标记)和 Assessor)**
Chi Song's avatar
Chi Song committed
15
16

```yaml
Chi Song's avatar
Chi Song committed
17
18
19
20
21
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
Chi Song's avatar
Chi Song committed
22
#可选项: local, remote, pai, kubeflow
Chi Song's avatar
Chi Song committed
23
24
25
26
27
28
29
30
trainingServicePlatform:
searchSpacePath:
#可选项: true, false, 默认值: false
useAnnotation:
#可选项: true, false, 默认值: false
multiPhase:
#可选项: true, false, 默认值: false
multiThread:
Chi Song's avatar
Chi Song committed
31
32
33
34
35
36
tuner:
  #可选项: TPE, Random, Anneal, Evolution
  builtinTunerName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
Chi Song's avatar
Chi Song committed
37
  gpuNum:
Chi Song's avatar
Chi Song committed
38
trial:
Chi Song's avatar
Chi Song committed
39
40
41
  command:
  codeDir:
  gpuNum:
Chi Song's avatar
Chi Song committed
42
43
#在本地使用时,machineList 可为空
machineList:
Chi Song's avatar
Chi Song committed
44
45
46
  - ip:
    port:
    username:
Chi Song's avatar
Chi Song committed
47
48
49
    passwd:
```

Chi Song's avatar
Chi Song committed
50
- **使用 Assessor**
Chi Song's avatar
Chi Song committed
51
52
53
54
55
56
57
58
59
60

```yaml
authorName: 
experimentName: 
trialConcurrency: 
maxExecDuration: 
maxTrialNum: 
#可选项: local, remote, pai, kubeflow
trainingServicePlatform: 
searchSpacePath: 
Chi Song's avatar
Chi Song committed
61
62
63
64
65
66
#可选项: true, false, 默认值: false
useAnnotation:
#可选项: true, false, 默认值: false
multiPhase:
#可选项: true, false, 默认值: false
multiThread:
Chi Song's avatar
Chi Song committed
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
tuner:
  #可选项: TPE, Random, Anneal, Evolution
  builtinTunerName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
assessor:
  #可选项: Medianstop
  builtinAssessorName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
trial:
  command: 
  codeDir: 
  gpuNum: 
#在本地使用时,machineList 可为空
machineList:
  - ip: 
    port: 
    username: 
    passwd:
```

Chi Song's avatar
Chi Song committed
93
- **使用 Annotation**
Chi Song's avatar
Chi Song committed
94
95
96
97
98
99
100
101
102

```yaml
authorName: 
experimentName: 
trialConcurrency: 
maxExecDuration: 
maxTrialNum: 
#可选项: local, remote, pai, kubeflow
trainingServicePlatform: 
Chi Song's avatar
Chi Song committed
103
104
105
106
107
108
#可选项: true, false, 默认值: false
useAnnotation:
#可选项: true, false, 默认值: false
multiPhase:
#可选项: true, false, 默认值: false
multiThread:
Chi Song's avatar
Chi Song committed
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
tuner:
  #可选项: TPE, Random, Anneal, Evolution
  builtinTunerName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
assessor:
  #可选项: Medianstop
  builtinAssessorName:
  classArgs:
    #可选项: maximize, minimize
    optimize_mode:
  gpuNum: 
trial:
  command: 
  codeDir: 
  gpuNum: 
#在本地使用时,machineList 可为空
machineList:
  - ip: 
    port: 
    username: 
    passwd:
```

<a name="Configuration"></a>

## 说明

Chi Song's avatar
Chi Song committed
139
- **authorName**
Chi Song's avatar
Chi Song committed
140
  
Chi Song's avatar
Chi Song committed
141
  - 说明
Chi Song's avatar
Chi Song committed
142
    
Chi Song's avatar
Chi Song committed
143
144
145
    **authorName** 是创建 Experiment 的作者。
    
    待定: 增加默认值
Chi Song's avatar
Chi Song committed
146

Chi Song's avatar
Chi Song committed
147
- **experimentName**
Chi Song's avatar
Chi Song committed
148
  
Chi Song's avatar
Chi Song committed
149
  - 说明
Chi Song's avatar
Chi Song committed
150
    
Chi Song's avatar
Chi Song committed
151
152
153
    **experimentName** 是创建的 Experiment 的名称。
    
    待定: 增加默认值
Chi Song's avatar
Chi Song committed
154

Chi Song's avatar
Chi Song committed
155
- **trialConcurrency**
Chi Song's avatar
Chi Song committed
156
  
Chi Song's avatar
Chi Song committed
157
  - 说明
Chi Song's avatar
Chi Song committed
158
159
160
161
162
    
    **trialConcurrency** 定义了并发尝试任务的最大数量。
    
    注意:如果 trialGpuNum 大于空闲的 GPU 数量,并且并发的 Trial 任务数量还没达到 trialConcurrency,Trial 任务会被放入队列,等待分配 GPU 资源。

Chi Song's avatar
Chi Song committed
163
- **maxExecDuration**
Chi Song's avatar
Chi Song committed
164
  
Chi Song's avatar
Chi Song committed
165
  - 说明
Chi Song's avatar
Chi Song committed
166
167
168
169
170
    
    **maxExecDuration** 定义 Experiment 执行的最长时间。时间单位:{**s**, **m**, **h**, **d**},分别代表:{*seconds*, *minutes*, *hours*, *days*}。
    
    注意:maxExecDuration 设置的是 Experiment 执行的时间,不是 Trial 的。 如果 Experiment 达到了设置的最大时间,Experiment 不会停止,但不会再启动新的 Trial 作业。

Chi Song's avatar
Chi Song committed
171
172
173
174
175
176
- **versionCheck**
  
  - 说明
    
    NNI 会校验 remote, pai 和 Kubernetes 模式下 NNIManager 与 trialKeeper 进程的版本。 如果需要禁用版本校验,versionCheck 应设置为 false。

Chi Song's avatar
Chi Song committed
177
- **debug**
Chi Song's avatar
Chi Song committed
178
  
Chi Song's avatar
Chi Song committed
179
180
  - 说明
    
Chi Song's avatar
Chi Song committed
181
    调试模式会将 versionCheck 设置为 False,并将 logLevel 设置为 'debug'。
Chi Song's avatar
Chi Song committed
182
183
184
185

- **maxTrialNum**
  
  - 说明
Chi Song's avatar
Chi Song committed
186
187
188
    
    **maxTrialNum** 定义了 Trial 任务的最大数量,成功和失败的都计算在内。

Chi Song's avatar
Chi Song committed
189
- **trainingServicePlatform**
Chi Song's avatar
Chi Song committed
190
  
Chi Song's avatar
Chi Song committed
191
  - 说明
Chi Song's avatar
Chi Song committed
192
193
194
    
    **trainingServicePlatform** 定义运行 Experiment 的平台,包括:{**local**, **remote**, **pai**, **kubeflow**}.
    
Chi Song's avatar
Chi Song committed
195
    - **local** 在本机的 Ubuntu 上运行 Experiment。
Chi Song's avatar
Chi Song committed
196
    
Chi Song's avatar
Chi Song committed
197
    - **remote** 将任务提交到远程的 Ubuntu 上,必须用 **machineList** 来指定远程的 SSH 连接信息。
Chi Song's avatar
Chi Song committed
198
    
Chi Song's avatar
Chi Song committed
199
    - **pai** 提交任务到微软开源的 [OpenPAI](https://github.com/Microsoft/pai) 上。 更多 OpenPAI 配置,参考 [pai 模式](../TrainingService/PaiMode.md)
Chi Song's avatar
Chi Song committed
200
    
Chi Song's avatar
Chi Song committed
201
    - **kubeflow** 提交任务至 [Kubeflow](https://www.kubeflow.org/docs/about/kubeflow/)。 NNI 支持基于 Kubeflow 的 Kubenetes,以及[Azure Kubernetes](https://azure.microsoft.com/en-us/services/kubernetes-service/)。 详情参考 [Kubeflow 文档](../TrainingService/KubeflowMode.md)
Chi Song's avatar
Chi Song committed
202

Chi Song's avatar
Chi Song committed
203
- **searchSpacePath**
Chi Song's avatar
Chi Song committed
204
  
Chi Song's avatar
Chi Song committed
205
  - 说明
Chi Song's avatar
Chi Song committed
206
207
208
209
210
    
    **searchSpacePath** 定义搜索空间文件的路径,此文件必须在运行 nnictl 的本机。
    
    注意: 如果设置了 useAnnotation=True,searchSpacePath 字段必须被删除。

Chi Song's avatar
Chi Song committed
211
- **useAnnotation**
Chi Song's avatar
Chi Song committed
212
  
Chi Song's avatar
Chi Song committed
213
  - 说明
Chi Song's avatar
Chi Song committed
214
215
216
217
218
    
    **useAnnotation** 定义使用标记来分析代码并生成搜索空间。
    
    注意: 如果设置了 useAnnotation=True,searchSpacePath 字段必须被删除。

Chi Song's avatar
Chi Song committed
219
220
221
222
223
224
225
226
227
228
229
230
- **multiPhase**
  
  - 说明
    
    **multiPhase** 启用[多阶段 Experiment](../AdvancedFeature/MultiPhase.md)

- **multiThread**
  
  - 说明
    
    **multiThread** 如果 multiThread 设为 `true`,可启动 Dispatcher 的多线程模式。Dispatcher 会为来自 NNI 管理器的每个命令启动一个线程。

Chi Song's avatar
Chi Song committed
231
- **nniManagerIp**
Chi Song's avatar
Chi Song committed
232
  
Chi Song's avatar
Chi Song committed
233
  - 说明
Chi Song's avatar
Chi Song committed
234
235
236
237
238
    
    **nniManagerIp** 设置 NNI 管理器运行的 IP 地址。 此字段为可选项,如果没有设置,则会使用 eth0 的 IP 地址。
    
    注意: 可在 NNI 管理器机器上运行 ifconfig 来检查 eth0 是否存在。 如果不存在,推荐显式设置 nnimanagerIp。

Chi Song's avatar
Chi Song committed
239
- **logDir**
Chi Song's avatar
Chi Song committed
240
  
Chi Song's avatar
Chi Song committed
241
  - 说明
Chi Song's avatar
Chi Song committed
242
243
244
    
    **logDir** 配置存储日志和数据的目录。 默认值是 `<user home directory>/nni/experiment`

Chi Song's avatar
Chi Song committed
245
- **logLevel**
Chi Song's avatar
Chi Song committed
246
  
Chi Song's avatar
Chi Song committed
247
  - 说明
Chi Song's avatar
Chi Song committed
248
249
250
    
    **logLevel** 为 Experiment 设置日志级别,支持的日志级别有:`trace, debug, info, warning, error, fatal`。 默认值是 `info`

Chi Song's avatar
Chi Song committed
251
252
253
254
255
256
257
- **logCollection**
  
  - 说明
    
    **logCollection** 设置在 remote, pai, kubeflow, frameworkcontroller 平台下收集日志的方法。 日志支持两种设置,一种是通过 `http`,让 Trial 将日志通过 POST 方法发回日志,这种方法会减慢 trialKeeper 的速度。 另一种方法是 `none`,Trial 不将日志回传回来,仅仅回传 Job 的指标。 如果日志较大,可将此参数设置为 `none`

- **Tuner**
Chi Song's avatar
Chi Song committed
258
  
Chi Song's avatar
Chi Song committed
259
  - 说明
Chi Song's avatar
Chi Song committed
260
261
262
    
    **tuner** 指定了 Experiment 的 Tuner 算法。有两种方法可设置 Tuner。 一种方法是使用 SDK 提供的 Tuner,需要设置 **builtinTunerName****classArgs**。 另一种方法,是使用用户自定义的 Tuner,需要设置 **codeDirectory****classFileName****className****classArgs**
  
Chi Song's avatar
Chi Song committed
263
  - **builtinTunerName****classArgs**
Chi Song's avatar
Chi Song committed
264
    
Chi Song's avatar
Chi Song committed
265
    - **builtinTunerName**
Chi Song's avatar
Chi Song committed
266
267
268
      
      **builtinTunerName** 指定了系统 Tuner 的名字,NNI SDK 提供了多种 Tuner,如:{**TPE**, **Random**, **Anneal**, **Evolution**, **BatchTuner**, **GridSearch**}。
    
Chi Song's avatar
Chi Song committed
269
    - **classArgs**
Chi Song's avatar
Chi Song committed
270
271
272
      
      **classArgs** 指定了 Tuner 算法的参数。 如果 **builtinTunerName** 是{**TPE**, **Random**, **Anneal**, **Evolution**},用户需要设置 **optimize_mode**
  
Chi Song's avatar
Chi Song committed
273
  - **codeDir**, **classFileName**, **className****classArgs**
Chi Song's avatar
Chi Song committed
274
    
Chi Song's avatar
Chi Song committed
275
    - **codeDir**
Chi Song's avatar
Chi Song committed
276
277
278
      
      **codeDir** 指定 Tuner 代码的目录。
    
Chi Song's avatar
Chi Song committed
279
    - **classFileName**
Chi Song's avatar
Chi Song committed
280
281
282
      
      **classFileName** 指定 Tuner 文件名。
    
Chi Song's avatar
Chi Song committed
283
    - **className**
Chi Song's avatar
Chi Song committed
284
285
286
      
      **className** 指定 Tuner 类名。
    
Chi Song's avatar
Chi Song committed
287
    - **classArgs**
Chi Song's avatar
Chi Song committed
288
289
      
      **classArgs** 指定了 Tuner 算法的参数。
Chi Song's avatar
Chi Song committed
290
291
  
  - **gpuNum**
Chi Song's avatar
Chi Song committed
292
    
Chi Song's avatar
Chi Song committed
293
        __gpuNum__ 指定了运行 Tuner 进程的 GPU 数量。 此字段的值必须是正整数。 如果此字段没有设置,NNI不会在脚本中添加 `CUDA_VISIBLE_DEVICES` (也就是说,不会通过 `CUDA_VISIBLE_DEVICES` 来控制 GPU 在 Trial 中是否可见),也不会管理 GPU 资源。
Chi Song's avatar
Chi Song committed
294
        
Chi Song's avatar
Chi Song committed
295
        注意: 只能使用一种方法来指定 Tuner,例如:设置 {tunerName, optimizationMode} 或 {tunerCommand, tunerCwd},不能同时设置两者。
Chi Song's avatar
Chi Song committed
296
297
298
299
300
301
        
  
  - **includeIntermediateResults**
    
        如果 __includeIntermediateResults__ 为 true,最后一个 Assessor 的中间结果会被发送给 Tuner 作为最终结果。 __includeIntermediateResults__ 的默认值为 false。
        
Chi Song's avatar
Chi Song committed
302

Chi Song's avatar
Chi Song committed
303
- **Assessor**
Chi Song's avatar
Chi Song committed
304
  
Chi Song's avatar
Chi Song committed
305
  - 说明
Chi Song's avatar
Chi Song committed
306
307
308
    
    **assessor** 指定了 Experiment 的 Assessor 算法。有两种方法可设置 Assessor。 一种方法是使用 SDK 提供的 Assessor,需要设置 **builtinAssessorName****classArgs**。 另一种方法,是使用用户自定义的 Assessor,需要设置 **codeDirectory****classFileName****className****classArgs**
  
Chi Song's avatar
Chi Song committed
309
  - **builtinAssessorName****classArgs**
Chi Song's avatar
Chi Song committed
310
    
Chi Song's avatar
Chi Song committed
311
    - **builtinAssessorName**
Chi Song's avatar
Chi Song committed
312
313
314
      
      **builtinAssessorName** 指定了系统 Assessor 的名称, NNI 内置的 Assessor 有 {**Medianstop**,等等}。
    
Chi Song's avatar
Chi Song committed
315
    - **classArgs**
Chi Song's avatar
Chi Song committed
316
      
Chi Song's avatar
Chi Song committed
317
      **classArgs** 指定了 Assessor 算法的参数。
Chi Song's avatar
Chi Song committed
318
  
Chi Song's avatar
Chi Song committed
319
  - **codeDir**, **classFileName**, **className****classArgs**
Chi Song's avatar
Chi Song committed
320
    
Chi Song's avatar
Chi Song committed
321
    - **codeDir**
Chi Song's avatar
Chi Song committed
322
323
324
      
      **codeDir** 指定 Assessor 代码的目录。
    
Chi Song's avatar
Chi Song committed
325
    - **classFileName**
Chi Song's avatar
Chi Song committed
326
327
328
      
      **classFileName** 指定 Assessor 文件名。
    
Chi Song's avatar
Chi Song committed
329
    - **className**
Chi Song's avatar
Chi Song committed
330
331
332
      
      **className** 指定 Assessor 类名。
    
Chi Song's avatar
Chi Song committed
333
    - **classArgs**
Chi Song's avatar
Chi Song committed
334
335
336
      
      **classArgs** 指定了 Assessor 算法的参数。
  
Chi Song's avatar
Chi Song committed
337
  - **gpuNum**
Chi Song's avatar
Chi Song committed
338
339
340
    
    **gpuNum** 指定了运行 Assessor 进程的 GPU 数量。 此字段的值必须是正整数。
    
341
    注意: 只能使用一种方法来指定 Assessor,例如:设置 {assessorName, optimizationMode} 或 {assessorCommand, assessorCwd},不能同时设置。如果不需要使用 Assessor,可将其置为空。
Chi Song's avatar
Chi Song committed
342

Chi Song's avatar
Chi Song committed
343
- **trial (local, remote)**
Chi Song's avatar
Chi Song committed
344
  
Chi Song's avatar
Chi Song committed
345
  - **command**
Chi Song's avatar
Chi Song committed
346
347
348
    
    **command** 指定了运行 Trial 进程的命令行。
  
Chi Song's avatar
Chi Song committed
349
  - **codeDir**
Chi Song's avatar
Chi Song committed
350
351
352
    
    **codeDir** 指定了 Trial 代码文件的目录。
  
Chi Song's avatar
Chi Song committed
353
  - **gpuNum**
Chi Song's avatar
Chi Song committed
354
355
356
    
    **gpuNum** 指定了运行 Trial 进程的 GPU 数量。 默认值为 0。

Chi Song's avatar
Chi Song committed
357
- **trial (pai)**
Chi Song's avatar
Chi Song committed
358
  
Chi Song's avatar
Chi Song committed
359
  - **command**
Chi Song's avatar
Chi Song committed
360
361
362
    
    **command** 指定了运行 Trial 进程的命令行。
  
Chi Song's avatar
Chi Song committed
363
  - **codeDir**
Chi Song's avatar
Chi Song committed
364
365
366
    
    **codeDir** 指定了 Trial 代码文件的目录。
  
Chi Song's avatar
Chi Song committed
367
  - **gpuNum**
Chi Song's avatar
Chi Song committed
368
369
370
    
    **gpuNum** 指定了运行 Trial 进程的 GPU 数量。 默认值为 0。
  
Chi Song's avatar
Chi Song committed
371
  - **cpuNum**
Chi Song's avatar
Chi Song committed
372
373
374
    
    **cpuNum** 指定了 OpenPAI 容器中使用的 CPU 数量。
  
Chi Song's avatar
Chi Song committed
375
  - **memoryMB**
Chi Song's avatar
Chi Song committed
376
377
378
    
    **memoryMB** 指定了 OpenPAI 容器中使用的内存数量。
  
Chi Song's avatar
Chi Song committed
379
  - **image**
Chi Song's avatar
Chi Song committed
380
381
382
    
    **image** 指定了 OpenPAI 中使用的 docker 映像。
  
Chi Song's avatar
Chi Song committed
383
  - **dataDir**
Chi Song's avatar
Chi Song committed
384
385
386
    
    **dataDir** 是 HDFS 中用到的数据目录变量。
  
Chi Song's avatar
Chi Song committed
387
  - **outputDir**
Chi Song's avatar
Chi Song committed
388
389
390
    
    **outputDir** 是 HDFS 中用到的输出目录变量。在 OpenPAI 中,stdout 和 stderr 文件会在作业完成后,存放在此目录中。

Chi Song's avatar
Chi Song committed
391
- **trial (kubeflow)**
Chi Song's avatar
Chi Song committed
392
  
Chi Song's avatar
Chi Song committed
393
  - **codeDir**
Chi Song's avatar
Chi Song committed
394
395
396
    
    **codeDir** 指定了代码文件的本机路径。
  
Chi Song's avatar
Chi Song committed
397
  - **ps (可选)**
Chi Song's avatar
Chi Song committed
398
399
400
    
    **ps** 是 Kubeflow 的 Tensorflow-operator 配置。
    
Chi Song's avatar
Chi Song committed
401
    - **replicas**
Chi Song's avatar
Chi Song committed
402
403
404
      
      **replicas****ps** 角色的副本数量。
    
Chi Song's avatar
Chi Song committed
405
    - **command**
Chi Song's avatar
Chi Song committed
406
407
408
      
      **command** 是在 **ps** 的容器中运行的脚本命令。
    
Chi Song's avatar
Chi Song committed
409
    - **gpuNum**
Chi Song's avatar
Chi Song committed
410
411
412
      
      **gpuNum** 是在 **ps** 容器中使用的 GPU 数量。
    
Chi Song's avatar
Chi Song committed
413
    - **cpuNum**
Chi Song's avatar
Chi Song committed
414
415
416
      
      **cpuNum** 是在 **ps** 容器中使用的 CPU 数量。
    
Chi Song's avatar
Chi Song committed
417
    - **memoryMB**
Chi Song's avatar
Chi Song committed
418
419
420
      
      **memoryMB** 指定了容器中使用的内存数量。
    
Chi Song's avatar
Chi Song committed
421
    - **image**
Chi Song's avatar
Chi Song committed
422
      
423
      **image** 设置了 **ps** 使用的 docker 映像。
Chi Song's avatar
Chi Song committed
424
  
Chi Song's avatar
Chi Song committed
425
  - **worker**
Chi Song's avatar
Chi Song committed
426
427
428
    
    **worker** 是 Kubeflow 的 Tensorflow-operator 配置。
    
Chi Song's avatar
Chi Song committed
429
    - **replicas**
Chi Song's avatar
Chi Song committed
430
431
432
      
      **replicas****worker** 角色的副本数量。
    
Chi Song's avatar
Chi Song committed
433
    - **command**
Chi Song's avatar
Chi Song committed
434
435
436
      
      **command** 是在 **worker** 的容器中运行的脚本命令。
    
Chi Song's avatar
Chi Song committed
437
    - **gpuNum**
Chi Song's avatar
Chi Song committed
438
439
440
      
      **gpuNum** 是在 **worker** 容器中使用的 GPU 数量。
    
Chi Song's avatar
Chi Song committed
441
    - **cpuNum**
Chi Song's avatar
Chi Song committed
442
443
444
      
      **cpuNum** 是在 **worker** 容器中使用的 CPU 数量。
    
Chi Song's avatar
Chi Song committed
445
    - **memoryMB**
Chi Song's avatar
Chi Song committed
446
447
448
      
      **memoryMB** 指定了容器中使用的内存数量。
    
Chi Song's avatar
Chi Song committed
449
    - **image**
Chi Song's avatar
Chi Song committed
450
451
452
      
      **image** 设置了 **worker** 使用的 docker 映像。

Chi Song's avatar
Chi Song committed
453
454
455
456
457
458
459
- **localConfig**
  
  **localConfig** 仅在 **trainingServicePlatform** 设为 `local` 时有效,否则,配置文件中不应该有 **localConfig** 部分。
  
  - **gpuIndices**
    
    **gpuIndices** 用于指定 GPU。设置此值后,只有指定的 GPU 会被用来运行 Trial 任务。 可指定单个或多个 GPU 的索引,多个 GPU 之间用逗号(,)隔开,例如 `1``0,1,3`
Chi Song's avatar
Chi Song committed
460
461
462
463
464
465
466
467
  
  - **maxTrialNumPerGpu**
    
    **maxTrialNumPerGpu** 用于指定每个 GPU 设备上最大并发的 Trial 数量。
  
  - **useActiveGpu**
    
    **useActiveGpu** 用于指定 NNI 是否使用还有其它进程的 GPU。 默认情况下,NNI 只会使用没有其它进程的空闲 GPU,如果 **useActiveGpu** 设置为 true,NNI 会使用所有 GPU。 此字段不适用于 Windows 版的 NNI。
Chi Song's avatar
Chi Song committed
468

Chi Song's avatar
Chi Song committed
469
- **machineList**
Chi Song's avatar
Chi Song committed
470
471
472
  
  如果 **trainingServicePlatform** 为 remote,则需要设置 **machineList**。否则应将其置为空。
  
Chi Song's avatar
Chi Song committed
473
  - **ip**
Chi Song's avatar
Chi Song committed
474
475
476
    
    **ip** 是远程计算机的 ip 地址。
  
Chi Song's avatar
Chi Song committed
477
  - **port**
Chi Song's avatar
Chi Song committed
478
479
480
481
482
    
    **端口** 是用于连接远程计算机的 ssh 端口。
    
    注意:如果 port 设为空,则为默认值 22。
  
Chi Song's avatar
Chi Song committed
483
  - **username**
Chi Song's avatar
Chi Song committed
484
485
486
    
    **username** 是远程计算机的用户名。
  
Chi Song's avatar
Chi Song committed
487
  - **passwd**
Chi Song's avatar
Chi Song committed
488
489
490
    
    **passwd** 指定了账户的密码。
  
Chi Song's avatar
Chi Song committed
491
  - **sshKeyPath**
Chi Song's avatar
Chi Song committed
492
493
494
495
496
    
    如果要使用 ssh 密钥登录远程计算机,则需要设置 **sshKeyPath****sshKeyPath** 为有效的 ssh 密钥文件路径。
    
    注意:如果同时设置了 passwd 和 sshKeyPath,NNI 会使用 passwd。
  
Chi Song's avatar
Chi Song committed
497
  - **passphrase**
Chi Song's avatar
Chi Song committed
498
499
    
    **passphrase** 用于保护 ssh 密钥,如果没有使用,可为空。
Chi Song's avatar
Chi Song committed
500
501
502
503
  
  - **gpuIndices**
    
    **gpuIndices** 用于指定 GPU。设置此值后,远程计算机上只有指定的 GPU 会被用来运行 Trial 任务。 可指定单个或多个 GPU 的索引,多个 GPU 之间用逗号(,)隔开,例如 `1``0,1,3`
Chi Song's avatar
Chi Song committed
504
505
506
507
508
509
510
511
  
  - **maxTrialNumPerGpu**
    
    **maxTrialNumPerGpu** 用于指定每个 GPU 设备上最大并发的 Trial 数量。
  
  - **useActiveGpu**
    
    **useActiveGpu** 用于指定 NNI 是否使用还有其它进程的 GPU。 默认情况下,NNI 只会使用没有其它进程的空闲 GPU,如果 **useActiveGpu** 设置为 true,NNI 会使用所有 GPU。 此字段不适用于 Windows 版的 NNI。
Chi Song's avatar
Chi Song committed
512

Chi Song's avatar
Chi Song committed
513
- **kubeflowConfig**:
Chi Song's avatar
Chi Song committed
514
  
Chi Song's avatar
Chi Song committed
515
  - **operator**
Chi Song's avatar
Chi Song committed
516
517
518
    
    **operator** 指定了 kubeflow 使用的 operator,NNI 当前版本支持 **tf-operator**
  
Chi Song's avatar
Chi Song committed
519
  - **storage**
Chi Song's avatar
Chi Song committed
520
521
522
    
    **storage** 指定了 kubeflow 的存储类型,包括 {**nfs****azureStorage**}。 此字段可选,默认值为 **nfs**。 如果使用了 azureStorage,此字段必须填写。
  
Chi Song's avatar
Chi Song committed
523
  - **nfs**
Chi Song's avatar
Chi Song committed
524
525
526
527
528
    
    **server** 是 NFS 服务器的地址
    
    **path** 是 NFS 挂载的路径
  
Chi Song's avatar
Chi Song committed
529
  - **keyVault**
Chi Song's avatar
Chi Song committed
530
    
Chi Song's avatar
Chi Song committed
531
    如果用户使用 Azure Kubernetes Service,需要设置 keyVault 来使用 Azure 存储账户的私钥。 参考: https://docs.microsoft.com/zh-cn/azure/key-vault/key-vault-manage-with-cli2
Chi Song's avatar
Chi Song committed
532
    
Chi Song's avatar
Chi Song committed
533
    - **vaultName**
Chi Song's avatar
Chi Song committed
534
535
536
      
      **vaultName** 是 az 命令中 `--vault-name` 的值。
    
Chi Song's avatar
Chi Song committed
537
    - **name**
Chi Song's avatar
Chi Song committed
538
539
540
      
      **name** 是 az 命令中 `--name` 的值。
  
Chi Song's avatar
Chi Song committed
541
  - **azureStorage**
Chi Song's avatar
Chi Song committed
542
543
544
    
    如果用户使用了 Azure Kubernetes Service,需要设置 Azure 存储账户来存放代码文件。
    
Chi Song's avatar
Chi Song committed
545
    - **accountName**
Chi Song's avatar
Chi Song committed
546
547
548
      
      **accountName** 是 Azure 存储账户的名称。
    
Chi Song's avatar
Chi Song committed
549
    - **azureShare**
Chi Song's avatar
Chi Song committed
550
551
552
      
      **azureShare** 是 Azure 文件存储的共享参数。

Chi Song's avatar
Chi Song committed
553
- **paiConfig**
Chi Song's avatar
Chi Song committed
554
  
Chi Song's avatar
Chi Song committed
555
  - **userName**
Chi Song's avatar
Chi Song committed
556
557
558
    
    **userName** 是 OpenPAI 的用户名。
  
Chi Song's avatar
Chi Song committed
559
  - **password**
Chi Song's avatar
Chi Song committed
560
561
562
    
    **password** 是 OpenPAI 用户的密码。
  
Chi Song's avatar
Chi Song committed
563
  - **host**
Chi Song's avatar
Chi Song committed
564
565
566
567
568
569
570
    
    **host** 是 OpenPAI 的主机地址。

<a name="Examples"></a>

## 样例

Chi Song's avatar
Chi Song committed
571
- **本机模式**
Chi Song's avatar
Chi Song committed
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
  
  如果要在本机运行 Trial 任务,并使用标记来生成搜索空间,可参考下列配置:
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: local
  #可选项: true, false
  useAnnotation: true
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
  
  增加 Assessor 配置
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    #可选项: Medianstop
    builtinAssessorName: Medianstop
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```
  
  或者可以指定自定义的 Tuner 和 Assessor:
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: local
  searchSpacePath: /nni/search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    codeDir: /nni/tuner
    classFileName: mytuner.py
    className: MyTuner
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  assessor:
    codeDir: /nni/assessor
    classFileName: myassessor.py
    className: MyAssessor
    classArgs:
      #choice: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  ```

Chi Song's avatar
Chi Song committed
666
- **远程模式**
Chi Song's avatar
Chi Song committed
667
  
Chi Song's avatar
Chi Song committed
668
  如果要在远程服务器上运行 Trial 任务,需要增加服务器信息:
Chi Song's avatar
Chi Song committed
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
  
  ```yaml
  authorName: test
  experimentName: test_experiment
  trialConcurrency: 3
  maxExecDuration: 1h
  maxTrialNum: 10
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: remote
  searchSpacePath: /nni/search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
    gpuNum: 0
  trial:
    command: python3 mnist.py
    codeDir: /nni/mnist
    gpuNum: 0
  # 如果是本地 Experiment,machineList 可为空。
  machineList:
  
    - ip: 10.10.10.10
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.11
      port: 22
      username: test
      passwd: test
    - ip: 10.10.10.12
      port: 22
      username: test
      sshKeyPath: /nni/sshkey
      passphrase: qwert
  ```

Chi Song's avatar
Chi Song committed
710
- **pai 模式**
Chi Song's avatar
Chi Song committed
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
  
  ```yaml
  authorName: test
  experimentName: nni_test1
  trialConcurrency: 1
  maxExecDuration:500h
  maxTrialNum: 1
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: pai
  searchSpacePath: search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution, BatchTuner
    #SMAC (SMAC 需要使用 nnictl package 单独安装)
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
  trial:
    command: python3 main.py
    codeDir: .
    gpuNum: 4
    cpuNum: 2
    memoryMB: 10000
    # 在 OpenPAI 上用来运行 Nni 作业的 docker 映像
    image: msranni/nni:latest
    # 在 OpenPAI 的 hdfs 上存储数据的目录,如:'hdfs://host:port/directory'
    dataDir: hdfs://10.11.12.13:9000/test
    # 在 OpenPAI 的 hdfs 上存储输出的目录,如:'hdfs://host:port/directory'
    outputDir: hdfs://10.11.12.13:9000/test
  paiConfig:
    # OpenPAI 用户名
    userName: test
    # OpenPAI 密码
    passWord: test
    # OpenPAI 服务器 Ip
    host: 10.10.10.10
  ```

Chi Song's avatar
Chi Song committed
751
- **Kubeflow 模式**
Chi Song's avatar
Chi Song committed
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
  
  使用 NFS 存储。
  
  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #可选项: true, false
  useAnnotation: false
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    nfs:
      server: 10.10.10.10
      path: /var/nfs/general
  ```
  
  使用 Azure 存储。
  
  ```yaml
  authorName: default
  experimentName: example_mni
  trialConcurrency: 1
  maxExecDuration: 1h
  maxTrialNum: 1
  #可选项: local, remote, pai, kubeflow
  trainingServicePlatform: kubeflow
  searchSpacePath: search_space.json
  #可选项: true, false
  useAnnotation: false
  #nniManagerIp: 10.10.10.10
  tuner:
    #可选项: TPE, Random, Anneal, Evolution
    builtinTunerName: TPE
    classArgs:
      #可选项: maximize, minimize
      optimize_mode: maximize
  assessor:
    builtinAssessorName: Medianstop
    classArgs:
      optimize_mode: maximize
    gpuNum: 0
  trial:
    codeDir: .
    worker:
      replicas: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 4096
      image: msranni/nni:latest
  kubeflowConfig:
    operator: tf-operator
    keyVault:
      vaultName: Contoso-Vault
      name: AzureStorageAccountKey
    azureStorage:
      accountName: storage
      azureShare: share01
  ```