experiment_config.rst 20.3 KB
Newer Older
liuzhe-lz's avatar
liuzhe-lz committed
1
2
3
4
===========================
Experiment Config Reference
===========================

kvartet's avatar
kvartet committed
5
A config file is needed when creating an experiment. This document describes the rules to write a config file and provides some examples.
liuzhe-lz's avatar
liuzhe-lz committed
6

kvartet's avatar
kvartet committed
7
.. Note::
liuzhe-lz's avatar
liuzhe-lz committed
8

kvartet's avatar
kvartet committed
9
    1. This document lists field names with ``camelCase``. If users use these fields in the pythonic way with NNI Python APIs (e.g., ``nni.experiment``), the field names should be converted to ``snake_case``.
liuzhe-lz's avatar
liuzhe-lz committed
10

kvartet's avatar
kvartet committed
11
    2. In this document, the type of fields are formatted as `Python type hint <https://docs.python.org/3.10/library/typing.html>`_. Therefore JSON objects are called `dict` and arrays are called `list`.
liuzhe-lz's avatar
liuzhe-lz committed
12

kvartet's avatar
kvartet committed
13
    .. _path: 
liuzhe-lz's avatar
liuzhe-lz committed
14

kvartet's avatar
kvartet committed
15
    3. Some fields take a path to a file or directory. Unless otherwise noted, both absolute path and relative path are supported, and ``~`` will be expanded to the home directory.
liuzhe-lz's avatar
liuzhe-lz committed
16

kvartet's avatar
kvartet committed
17
18
19
20
21
22
23
24
25
26
       - When written in the YAML file, relative paths are relative to the directory containing that file.
       - When assigned in Python code, relative paths are relative to the current working directory.
       - All relative paths are converted to absolute when loading YAML file into Python class, and when saving Python class to YAML file.

    4. Setting a field to ``None`` or ``null`` is equivalent to not setting the field.

.. contents:: Contents
   :local:
   :depth: 3
 
liuzhe-lz's avatar
liuzhe-lz committed
27

liuzhe-lz's avatar
liuzhe-lz committed
28
29
30
31
32
33
34
35
36
37
38
39
40
Examples
========

Local Mode
^^^^^^^^^^

.. code-block:: yaml

    experimentName: MNIST
    searchSpaceFile: search_space.json
    trialCommand: python mnist.py
    trialCodeDirectory: .
    trialGpuNumber: 1
liuzhe-lz's avatar
liuzhe-lz committed
41
    trialConcurrency: 2
liuzhe-lz's avatar
liuzhe-lz committed
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
    maxExperimentDuration: 24h
    maxTrialNumber: 100
    tuner:
      name: TPE
      classArgs:
        optimize_mode: maximize
    trainingService:
      platform: local
      useActiveGpu: True

Local Mode (Inline Search Space)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: yaml

    searchSpace:
      batch_size:
        _type: choice
        _value: [16, 32, 64]
      learning_rate:
        _type: loguniform
        _value: [0.0001, 0.1]
    trialCommand: python mnist.py
    trialGpuNumber: 1
liuzhe-lz's avatar
liuzhe-lz committed
66
    trialConcurrency: 2
liuzhe-lz's avatar
liuzhe-lz committed
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
    tuner:
      name: TPE
      classArgs:
        optimize_mode: maximize
    trainingService:
      platform: local
      useActiveGpu: True

Remote Mode
^^^^^^^^^^^

.. code-block:: yaml

    experimentName: MNIST
    searchSpaceFile: search_space.json
    trialCommand: python mnist.py
    trialCodeDirectory: .
    trialGpuNumber: 1
liuzhe-lz's avatar
liuzhe-lz committed
85
    trialConcurrency: 2
liuzhe-lz's avatar
liuzhe-lz committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
    maxExperimentDuration: 24h
    maxTrialNumber: 100
    tuner:
      name: TPE
      classArgs:
        optimize_mode: maximize
    trainingService:
      platform: remote
      machineList:
        - host: 11.22.33.44
          user: alice
          password: xxxxx
        - host: my.domain.com
          user: bob
          sshKeyFile: ~/.ssh/id_rsa

Reference
=========

liuzhe-lz's avatar
liuzhe-lz committed
105
ExperimentConfig
liuzhe-lz's avatar
liuzhe-lz committed
106
^^^^^^^^^^^^^^^^
liuzhe-lz's avatar
liuzhe-lz committed
107

liuzhe-lz's avatar
liuzhe-lz committed
108
109
experimentName
--------------
liuzhe-lz's avatar
liuzhe-lz committed
110

kvartet's avatar
kvartet committed
111
Mnemonic name of the experiment, which will be shown in WebUI and nnictl.
liuzhe-lz's avatar
liuzhe-lz committed
112
113
114
115

type: ``Optional[str]``


liuzhe-lz's avatar
liuzhe-lz committed
116
117
searchSpaceFile
---------------
liuzhe-lz's avatar
liuzhe-lz committed
118

kvartet's avatar
kvartet committed
119
Path_ to the JSON file containing the search space.
liuzhe-lz's avatar
liuzhe-lz committed
120
121
122

type: ``Optional[str]``

kvartet's avatar
kvartet committed
123
Search space format is determined by tuner. The common format for built-in tuners is documented  `here <../Tutorial/SearchSpaceSpec.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
124

liuzhe-lz's avatar
liuzhe-lz committed
125
Mutually exclusive to `searchSpace`_.
liuzhe-lz's avatar
liuzhe-lz committed
126
127


liuzhe-lz's avatar
liuzhe-lz committed
128
129
searchSpace
-----------
liuzhe-lz's avatar
liuzhe-lz committed
130
131
132

Search space object.

liuzhe-lz's avatar
liuzhe-lz committed
133
type: ``Optional[JSON]``
liuzhe-lz's avatar
liuzhe-lz committed
134

liuzhe-lz's avatar
liuzhe-lz committed
135
The format is determined by tuner. Common format for built-in tuners is documented `here <../Tutorial/SearchSpaceSpec.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
136
137
138

Note that ``None`` means "no such field" so empty search space should be written as ``{}``.

liuzhe-lz's avatar
liuzhe-lz committed
139
Mutually exclusive to `searchSpaceFile`_.
liuzhe-lz's avatar
liuzhe-lz committed
140
141


liuzhe-lz's avatar
liuzhe-lz committed
142
143
trialCommand
------------
liuzhe-lz's avatar
liuzhe-lz committed
144

liuzhe-lz's avatar
liuzhe-lz committed
145
Command to launch trial.
liuzhe-lz's avatar
liuzhe-lz committed
146
147
148

type: ``str``

liuzhe-lz's avatar
liuzhe-lz committed
149
The command will be executed in bash on Linux and macOS, and in PowerShell on Windows.
liuzhe-lz's avatar
liuzhe-lz committed
150

kvartet's avatar
kvartet committed
151
152
Note that using ``python3`` on Linux and macOS, and using ``python`` on Windows.

liuzhe-lz's avatar
liuzhe-lz committed
153

liuzhe-lz's avatar
liuzhe-lz committed
154
155
trialCodeDirectory
------------------
liuzhe-lz's avatar
liuzhe-lz committed
156
157
158
159
160
161
162

`Path`_ to the directory containing trial source files.

type: ``str``

default: ``"."``

kvartet's avatar
kvartet committed
163
164
All files in this directory will be sent to the training machine, unless in the ``.nniignore`` file.
(See :ref:`nniignore <nniignore>` for details.)
liuzhe-lz's avatar
liuzhe-lz committed
165
166


liuzhe-lz's avatar
liuzhe-lz committed
167
168
trialConcurrency
----------------
liuzhe-lz's avatar
liuzhe-lz committed
169
170
171
172
173
174
175
176

Specify how many trials should be run concurrently.

type: ``int``

The real concurrency also depends on hardware resources and may be less than this value.


liuzhe-lz's avatar
liuzhe-lz committed
177
178
trialGpuNumber
--------------
liuzhe-lz's avatar
liuzhe-lz committed
179
180
181
182
183

Number of GPUs used by each trial.

type: ``Optional[int]``

kvartet's avatar
kvartet committed
184
This field might have slightly different meanings for various training services,
liuzhe-lz's avatar
liuzhe-lz committed
185
especially when set to ``0`` or ``None``.
kvartet's avatar
kvartet committed
186
See `training service's document <../training_services.rst>`__ for details.
liuzhe-lz's avatar
liuzhe-lz committed
187

kvartet's avatar
kvartet committed
188
In local mode, setting the field to ``0`` will prevent trials from accessing GPU (by empty ``CUDA_VISIBLE_DEVICES``).
liuzhe-lz's avatar
liuzhe-lz committed
189
190
And when set to ``None``, trials will be created and scheduled as if they did not use GPU,
but they can still use all GPU resources if they want.
liuzhe-lz's avatar
liuzhe-lz committed
191
192


liuzhe-lz's avatar
liuzhe-lz committed
193
194
maxExperimentDuration
---------------------
liuzhe-lz's avatar
liuzhe-lz committed
195
196
197
198
199
200
201
202
203

Limit the duration of this experiment if specified.

type: ``Optional[str]``

format: ``number + s|m|h|d``

examples: ``"10m"``, ``"0.5h"``

kvartet's avatar
kvartet committed
204
When time runs out, the experiment will stop creating trials but continue to serve WebUI.
liuzhe-lz's avatar
liuzhe-lz committed
205
206


liuzhe-lz's avatar
liuzhe-lz committed
207
208
maxTrialNumber
--------------
liuzhe-lz's avatar
liuzhe-lz committed
209
210
211
212
213

Limit the number of trials to create if specified.

type: ``Optional[int]``

kvartet's avatar
kvartet committed
214
When the budget runs out, the experiment will stop creating trials but continue to serve WebUI.
liuzhe-lz's avatar
liuzhe-lz committed
215
216


Ni Hao's avatar
Ni Hao committed
217
218
219
220
221
222
223
224
225
226
227
228
229
230
maxTrialDuration
---------------------

Limit the duration of trial job if specified.

type: ``Optional[str]``

format: ``number + s|m|h|d``

examples: ``"10m"``, ``"0.5h"``

When time runs out, the current trial job will stop.


liuzhe-lz's avatar
liuzhe-lz committed
231
232
nniManagerIp
------------
liuzhe-lz's avatar
liuzhe-lz committed
233

kvartet's avatar
kvartet committed
234
IP of the current machine, used by training machines to access NNI manager. Not used in local mode.
liuzhe-lz's avatar
liuzhe-lz committed
235
236
237

type: ``Optional[str]``

liuzhe-lz's avatar
liuzhe-lz committed
238
If not specified, IPv4 address of ``eth0`` will be used.
liuzhe-lz's avatar
liuzhe-lz committed
239

kvartet's avatar
kvartet committed
240
Except for the local mode, it is highly recommended to set this field manually.
liuzhe-lz's avatar
liuzhe-lz committed
241
242


liuzhe-lz's avatar
liuzhe-lz committed
243
244
245
246
useAnnotation
-------------

Enable `annotation <../Tutorial/AnnotationSpec.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
247
248
249
250
251

type: ``bool``

default: ``False``

liuzhe-lz's avatar
liuzhe-lz committed
252
When using annotation, `searchSpace`_ and `searchSpaceFile`_ should not be specified manually.
liuzhe-lz's avatar
liuzhe-lz committed
253
254
255
256
257
258
259
260
261
262
263


debug
-----

Enable debug mode.

type: ``bool``

default: ``False``

kvartet's avatar
kvartet committed
264
When enabled, logging will be more verbose and some internal validation will be loosened.
liuzhe-lz's avatar
liuzhe-lz committed
265
266


liuzhe-lz's avatar
liuzhe-lz committed
267
268
logLevel
--------
liuzhe-lz's avatar
liuzhe-lz committed
269

kvartet's avatar
kvartet committed
270
Set log level of the whole system.
liuzhe-lz's avatar
liuzhe-lz committed
271
272
273
274
275

type: ``Optional[str]``

values: ``"trace"``, ``"debug"``, ``"info"``, ``"warning"``, ``"error"``, ``"fatal"``

kvartet's avatar
kvartet committed
276
Defaults to "info" or "debug", depending on `debug`_ option. When debug mode is enabled, Loglevel is set to "debug", otherwise, Loglevel is set to "info".
liuzhe-lz's avatar
liuzhe-lz committed
277
278
279
280
281

Most modules of NNI will be affected by this value, including NNI manager, tuner, training service, etc.

The exception is trial, whose logging level is directly managed by trial code.

liuzhe-lz's avatar
liuzhe-lz committed
282
For Python modules, "trace" acts as logging level 0 and "fatal" acts as ``logging.CRITICAL``.
liuzhe-lz's avatar
liuzhe-lz committed
283
284


liuzhe-lz's avatar
liuzhe-lz committed
285
286
experimentWorkingDirectory
--------------------------
liuzhe-lz's avatar
liuzhe-lz committed
287

kvartet's avatar
kvartet committed
288
Specify the :ref:`directory <path>` to place log, checkpoint, metadata, and other run-time stuff.
liuzhe-lz's avatar
liuzhe-lz committed
289
290
291
292
293

type: ``Optional[str]``

By default uses ``~/nni-experiments``.

kvartet's avatar
kvartet committed
294
NNI will create a subdirectory named by experiment ID, so it is safe to use the same directory for multiple experiments.
liuzhe-lz's avatar
liuzhe-lz committed
295
296


liuzhe-lz's avatar
liuzhe-lz committed
297
298
tunerGpuIndices
---------------
liuzhe-lz's avatar
liuzhe-lz committed
299
300
301

Limit the GPUs visible to tuner, assessor, and advisor.

kvartet's avatar
kvartet committed
302
type: ``Optional[list[int] | str | int]``
liuzhe-lz's avatar
liuzhe-lz committed
303
304
305

This will be the ``CUDA_VISIBLE_DEVICES`` environment variable of tuner process.

kvartet's avatar
kvartet committed
306
Because tuner, assessor, and advisor run in the same process, this option will affect them all.
liuzhe-lz's avatar
liuzhe-lz committed
307
308
309
310
311


tuner
-----

kvartet's avatar
kvartet committed
312
Specify the tuner. 
liuzhe-lz's avatar
liuzhe-lz committed
313
314
315

type: Optional `AlgorithmConfig`_

kvartet's avatar
kvartet committed
316
317
The built-in tuners can be found `here <../builtin_tuner.rst>`__ and you can follow `this tutorial <../Tuner/CustomizeTuner.rst>`__ to customize a new tuner.

liuzhe-lz's avatar
liuzhe-lz committed
318
319
320
321

assessor
--------

kvartet's avatar
kvartet committed
322
Specify the assessor. 
liuzhe-lz's avatar
liuzhe-lz committed
323
324
325

type: Optional `AlgorithmConfig`_

kvartet's avatar
kvartet committed
326
327
The built-in assessors can be found `here <../builtin_assessor.rst>`__ and you can follow `this tutorial <../Assessor/CustomizeAssessor.rst>`__ to customize a new assessor.

liuzhe-lz's avatar
liuzhe-lz committed
328
329
330
331

advisor
-------

kvartet's avatar
kvartet committed
332
Specify the advisor. 
liuzhe-lz's avatar
liuzhe-lz committed
333
334
335

type: Optional `AlgorithmConfig`_

kvartet's avatar
kvartet committed
336
337
NNI provides two built-in advisors: `BOHB <../Tuner/BohbAdvisor.rst>`__ and `Hyperband <../Tuner/HyperbandAdvisor.rst>`__, and you can follow `this tutorial <../Tuner/CustomizeAdvisor.rst>`__ to customize a new advisor.

liuzhe-lz's avatar
liuzhe-lz committed
338

liuzhe-lz's avatar
liuzhe-lz committed
339
340
trainingService
---------------
liuzhe-lz's avatar
liuzhe-lz committed
341

kvartet's avatar
kvartet committed
342
Specify the `training service <../TrainingService/Overview.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
343
344
345
346

type: `TrainingServiceConfig`_


kvartet's avatar
kvartet committed
347
348
349
350
351
352
353
354
sharedStorage
-------------

Configure the shared storage, detailed usage can be found `here <../Tutorial/HowToUseSharedStorage.rst>`__.

type: Optional `SharedStorageConfig`_


liuzhe-lz's avatar
liuzhe-lz committed
355
AlgorithmConfig
liuzhe-lz's avatar
liuzhe-lz committed
356
357
358
359
^^^^^^^^^^^^^^^

``AlgorithmConfig`` describes a tuner / assessor / advisor algorithm.

kvartet's avatar
kvartet committed
360
For customized algorithms, there are two ways to describe them:
liuzhe-lz's avatar
liuzhe-lz committed
361

kvartet's avatar
kvartet committed
362
  1. `Register the algorithm <../Tutorial/InstallCustomizedAlgos.rst>`__ to use it like built-in. (preferred)
liuzhe-lz's avatar
liuzhe-lz committed
363
364

  2. Specify code directory and class name directly.
liuzhe-lz's avatar
liuzhe-lz committed
365
366
367
368
369


name
----

kvartet's avatar
kvartet committed
370
Name of the built-in or registered algorithm.
liuzhe-lz's avatar
liuzhe-lz committed
371

kvartet's avatar
kvartet committed
372
type: ``str`` for the built-in and registered algorithm, ``None`` for other customized algorithms.
liuzhe-lz's avatar
liuzhe-lz committed
373
374


liuzhe-lz's avatar
liuzhe-lz committed
375
376
className
---------
liuzhe-lz's avatar
liuzhe-lz committed
377

kvartet's avatar
kvartet committed
378
Qualified class name of not registered customized algorithm.
liuzhe-lz's avatar
liuzhe-lz committed
379

kvartet's avatar
kvartet committed
380
type: ``None`` for the built-in and registered algorithm, ``str`` for other customized algorithms.
liuzhe-lz's avatar
liuzhe-lz committed
381
382
383
384

example: ``"my_tuner.MyTuner"``


liuzhe-lz's avatar
liuzhe-lz committed
385
386
codeDirectory
-------------
liuzhe-lz's avatar
liuzhe-lz committed
387

kvartet's avatar
kvartet committed
388
`Path`_ to the directory containing the customized algorithm class.
liuzhe-lz's avatar
liuzhe-lz committed
389

kvartet's avatar
kvartet committed
390
type: ``None`` for the built-in and registered algorithm, ``str`` for other customized algorithms.
liuzhe-lz's avatar
liuzhe-lz committed
391
392


liuzhe-lz's avatar
liuzhe-lz committed
393
394
classArgs
---------
liuzhe-lz's avatar
liuzhe-lz committed
395
396
397
398
399
400
401
402
403

Keyword arguments passed to algorithm class' constructor.

type: ``Optional[dict[str, Any]]``

See algorithm's document for supported value.


TrainingServiceConfig
liuzhe-lz's avatar
liuzhe-lz committed
404
^^^^^^^^^^^^^^^^^^^^^
liuzhe-lz's avatar
liuzhe-lz committed
405

kvartet's avatar
kvartet committed
406
One of the following:
liuzhe-lz's avatar
liuzhe-lz committed
407

liuzhe-lz's avatar
liuzhe-lz committed
408
409
- `LocalConfig`_
- `RemoteConfig`_
kvartet's avatar
kvartet committed
410
- :ref:`OpenpaiConfig <openpai-class>`
liuzhe-lz's avatar
liuzhe-lz committed
411
- `AmlConfig`_
412
- `DlcConfig`_
kvartet's avatar
kvartet committed
413
- `HybridConfig`_
liuzhe-lz's avatar
liuzhe-lz committed
414

kvartet's avatar
kvartet committed
415
For `Kubeflow <../TrainingService/KubeflowMode.rst>`_, `FrameworkController <../TrainingService/FrameworkControllerMode.rst>`_, and `AdaptDL <../TrainingService/AdaptDLMode.rst>`_ training platforms, it is suggested to use `v1 config schema <../Tutorial/ExperimentConfig.rst>`_ for now.
liuzhe-lz's avatar
liuzhe-lz committed
416
417
418


LocalConfig
kvartet's avatar
kvartet committed
419
-----------
liuzhe-lz's avatar
liuzhe-lz committed
420

kvartet's avatar
kvartet committed
421
Detailed usage can be found `here <../TrainingService/LocalMode.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
422
423

platform
kvartet's avatar
kvartet committed
424
""""""""
liuzhe-lz's avatar
liuzhe-lz committed
425
426
427
428

Constant string ``"local"``.


liuzhe-lz's avatar
liuzhe-lz committed
429
useActiveGpu
kvartet's avatar
kvartet committed
430
""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
431
432
433

Specify whether NNI should submit trials to GPUs occupied by other tasks.

liuzhe-lz's avatar
liuzhe-lz committed
434
type: ``Optional[bool]``
liuzhe-lz's avatar
liuzhe-lz committed
435

kvartet's avatar
kvartet committed
436
Must be set when `trialGpuNumber`_ greater than zero.
liuzhe-lz's avatar
liuzhe-lz committed
437

kvartet's avatar
kvartet committed
438
439
440
441
442
443
444
445
446
447
Following processes can make GPU "active":

  - non-NNI CUDA programs
  - graphical desktop
  - trials submitted by other NNI instances, if you have more than one NNI experiments running at same time
  - other users' CUDA programs, if you are using a shared server
  
If you are using a graphical OS like Windows 10 or Ubuntu desktop, set this field to ``True``, otherwise, the GUI will prevent NNI from launching any trial.

When you create multiple NNI experiments and ``useActiveGpu`` is set to ``True``, they will submit multiple trials to the same GPU(s) simultaneously.
liuzhe-lz's avatar
liuzhe-lz committed
448
449


liuzhe-lz's avatar
liuzhe-lz committed
450
maxTrialNumberPerGpu
kvartet's avatar
kvartet committed
451
""""""""""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
452
453
454
455
456
457
458
459

Specify how many trials can share one GPU.

type: ``int``

default: ``1``


liuzhe-lz's avatar
liuzhe-lz committed
460
gpuIndices
kvartet's avatar
kvartet committed
461
""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
462
463
464

Limit the GPUs visible to trial processes.

kvartet's avatar
kvartet committed
465
type: ``Optional[list[int] | str | int]``
liuzhe-lz's avatar
liuzhe-lz committed
466

liuzhe-lz's avatar
liuzhe-lz committed
467
If `trialGpuNumber`_ is less than the length of this value, only a subset will be visible to each trial.
liuzhe-lz's avatar
liuzhe-lz committed
468
469
470
471
472

This will be used as ``CUDA_VISIBLE_DEVICES`` environment variable.


RemoteConfig
kvartet's avatar
kvartet committed
473
------------
liuzhe-lz's avatar
liuzhe-lz committed
474

kvartet's avatar
kvartet committed
475
Detailed usage can be found `here <../TrainingService/RemoteMachineMode.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
476
477

platform
kvartet's avatar
kvartet committed
478
""""""""
liuzhe-lz's avatar
liuzhe-lz committed
479
480
481
482

Constant string ``"remote"``.


liuzhe-lz's avatar
liuzhe-lz committed
483
machineList
kvartet's avatar
kvartet committed
484
"""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
485
486
487
488
489
490

List of training machines.

type: list of `RemoteMachineConfig`_


liuzhe-lz's avatar
liuzhe-lz committed
491
reuseMode
kvartet's avatar
kvartet committed
492
"""""""""
liuzhe-lz's avatar
liuzhe-lz committed
493

kvartet's avatar
kvartet committed
494
Enable `reuse mode <../TrainingService/Overview.rst#training-service-under-reuse-mode>`__.
liuzhe-lz's avatar
liuzhe-lz committed
495

liuzhe-lz's avatar
liuzhe-lz committed
496
type: ``bool``
liuzhe-lz's avatar
liuzhe-lz committed
497
498
499


RemoteMachineConfig
kvartet's avatar
kvartet committed
500
"""""""""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
501
502

host
kvartet's avatar
kvartet committed
503
****
liuzhe-lz's avatar
liuzhe-lz committed
504
505
506
507
508
509
510

IP or hostname (domain name) of the machine.

type: ``str``


port
kvartet's avatar
kvartet committed
511
****
liuzhe-lz's avatar
liuzhe-lz committed
512
513
514
515
516

SSH service port.

type: ``int``

liuzhe-lz's avatar
liuzhe-lz committed
517
default: ``22``
liuzhe-lz's avatar
liuzhe-lz committed
518
519
520


user
kvartet's avatar
kvartet committed
521
****
liuzhe-lz's avatar
liuzhe-lz committed
522
523
524
525
526
527
528

Login user name.

type: ``str``


password
kvartet's avatar
kvartet committed
529
********
liuzhe-lz's avatar
liuzhe-lz committed
530
531
532
533
534

Login password.

type: ``Optional[str]``

liuzhe-lz's avatar
liuzhe-lz committed
535
If not specified, `sshKeyFile`_ will be used instead.
liuzhe-lz's avatar
liuzhe-lz committed
536
537


liuzhe-lz's avatar
liuzhe-lz committed
538
sshKeyFile
kvartet's avatar
kvartet committed
539
**********
liuzhe-lz's avatar
liuzhe-lz committed
540

liuzhe-lz's avatar
liuzhe-lz committed
541
`Path`_ to sshKeyFile (identity file).
liuzhe-lz's avatar
liuzhe-lz committed
542

liuzhe-lz's avatar
liuzhe-lz committed
543
type: ``Optional[str]``
liuzhe-lz's avatar
liuzhe-lz committed
544
545
546
547

Only used when `password`_ is not specified.


liuzhe-lz's avatar
liuzhe-lz committed
548
sshPassphrase
kvartet's avatar
kvartet committed
549
*************
liuzhe-lz's avatar
liuzhe-lz committed
550
551
552
553
554
555

Passphrase of SSH identity file.

type: ``Optional[str]``


liuzhe-lz's avatar
liuzhe-lz committed
556
useActiveGpu
kvartet's avatar
kvartet committed
557
************
liuzhe-lz's avatar
liuzhe-lz committed
558
559
560
561
562

Specify whether NNI should submit trials to GPUs occupied by other tasks.

type: ``bool``

liuzhe-lz's avatar
liuzhe-lz committed
563
564
default: ``False``

kvartet's avatar
kvartet committed
565
566
567
568
569
570
571
572
573
574
575
576
577
Must be set when `trialGpuNumber`_ greater than zero.

Following processes can make GPU "active":

  - non-NNI CUDA programs
  - graphical desktop
  - trials submitted by other NNI instances, if you have more than one NNI experiments running at same time
  - other users' CUDA programs, if you are using a shared server
  
If your remote machine is a graphical OS like Ubuntu desktop, set this field to ``True``, otherwise, the GUI will prevent NNI from launching any trial.

When you create multiple NNI experiments and ``useActiveGpu`` is set to ``True``, they will submit multiple trials to the same GPU(s) simultaneously.

liuzhe-lz's avatar
liuzhe-lz committed
578

liuzhe-lz's avatar
liuzhe-lz committed
579
maxTrialNumberPerGpu
kvartet's avatar
kvartet committed
580
********************
liuzhe-lz's avatar
liuzhe-lz committed
581
582
583
584
585
586
587
588

Specify how many trials can share one GPU.

type: ``int``

default: ``1``


liuzhe-lz's avatar
liuzhe-lz committed
589
gpuIndices
kvartet's avatar
kvartet committed
590
**********
liuzhe-lz's avatar
liuzhe-lz committed
591
592
593

Limit the GPUs visible to trial processes.

kvartet's avatar
kvartet committed
594
type: ``Optional[list[int] | str | int]``
liuzhe-lz's avatar
liuzhe-lz committed
595

liuzhe-lz's avatar
liuzhe-lz committed
596
If `trialGpuNumber`_ is less than the length of this value, only a subset will be visible to each trial.
liuzhe-lz's avatar
liuzhe-lz committed
597
598
599
600

This will be used as ``CUDA_VISIBLE_DEVICES`` environment variable.


601
pythonPath
kvartet's avatar
kvartet committed
602
**********
liuzhe-lz's avatar
liuzhe-lz committed
603

kvartet's avatar
kvartet committed
604
Specify a Python environment.
liuzhe-lz's avatar
liuzhe-lz committed
605
606
607

type: ``Optional[str]``

kvartet's avatar
kvartet committed
608
609
610
611
612
613
614
615
616
617
This path will be inserted at the front of PATH. Here are some examples: 

    - (linux) pythonPath: ``/opt/python3.7/bin``
    - (windows) pythonPath: ``C:/Python37``

If you are working on Anaconda, there is some difference. On Windows, you also have to add ``../script`` and ``../Library/bin`` separated by ``;``. Examples are as below:

    - (linux anaconda) pythonPath: ``/home/yourname/anaconda3/envs/myenv/bin/``
    - (windows anaconda) pythonPath: ``C:/Users/yourname/.conda/envs/myenv;C:/Users/yourname/.conda/envs/myenv/Scripts;C:/Users/yourname/.conda/envs/myenv/Library/bin``

liuzhe-lz's avatar
liuzhe-lz committed
618
619
This is useful if preparing steps vary for different machines.

liuzhe-lz's avatar
liuzhe-lz committed
620
.. _openpai-class:
liuzhe-lz's avatar
liuzhe-lz committed
621

liuzhe-lz's avatar
liuzhe-lz committed
622
OpenpaiConfig
kvartet's avatar
kvartet committed
623
-------------
liuzhe-lz's avatar
liuzhe-lz committed
624

kvartet's avatar
kvartet committed
625
Detailed usage can be found `here <../TrainingService/PaiMode.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
626
627

platform
kvartet's avatar
kvartet committed
628
""""""""
liuzhe-lz's avatar
liuzhe-lz committed
629
630
631
632
633

Constant string ``"openpai"``.


host
kvartet's avatar
kvartet committed
634
""""
liuzhe-lz's avatar
liuzhe-lz committed
635
636
637
638
639

Hostname of OpenPAI service.

type: ``str``

kvartet's avatar
kvartet committed
640
This may include ``https://`` or ``http://`` prefix.
liuzhe-lz's avatar
liuzhe-lz committed
641
642
643

HTTPS will be used by default.

liuzhe-lz's avatar
liuzhe-lz committed
644
645

username
kvartet's avatar
kvartet committed
646
""""""""
liuzhe-lz's avatar
liuzhe-lz committed
647
648
649
650
651
652
653

OpenPAI user name.

type: ``str``


token
kvartet's avatar
kvartet committed
654
"""""
liuzhe-lz's avatar
liuzhe-lz committed
655
656
657
658
659
660
661
662

OpenPAI user token.

type: ``str``

This can be found in your OpenPAI user settings page.


kvartet's avatar
kvartet committed
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
trialCpuNumber
""""""""""""""

Specify the CPU number of each trial to be used in OpenPAI container.

type: ``int``


trialMemorySize
"""""""""""""""

Specify the memory size of each trial to be used in OpenPAI container.

type: ``str``

format: ``number + tb|gb|mb|kb``

examples: ``"8gb"``, ``"8192mb"``


storageConfigName
"""""""""""""""""

Specify the storage name used in OpenPAI.

type: ``str``


liuzhe-lz's avatar
liuzhe-lz committed
691
dockerImage
kvartet's avatar
kvartet committed
692
"""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
693

liuzhe-lz's avatar
liuzhe-lz committed
694
Name and tag of docker image to run the trials.
liuzhe-lz's avatar
liuzhe-lz committed
695

liuzhe-lz's avatar
liuzhe-lz committed
696
type: ``str``
liuzhe-lz's avatar
liuzhe-lz committed
697

liuzhe-lz's avatar
liuzhe-lz committed
698
default: ``"msranni/nni:latest"``
liuzhe-lz's avatar
liuzhe-lz committed
699
700


kvartet's avatar
kvartet committed
701
702
localStorageMountPoint
""""""""""""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
703

kvartet's avatar
kvartet committed
704
:ref:`Mount point <path>` of storage service (typically NFS) on the local machine.
liuzhe-lz's avatar
liuzhe-lz committed
705
706
707
708

type: ``str``


liuzhe-lz's avatar
liuzhe-lz committed
709
containerStorageMountPoint
kvartet's avatar
kvartet committed
710
""""""""""""""""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
711

liuzhe-lz's avatar
liuzhe-lz committed
712
Mount point of storage service (typically NFS) in docker container.
liuzhe-lz's avatar
liuzhe-lz committed
713
714
715

type: ``str``

liuzhe-lz's avatar
liuzhe-lz committed
716
This must be an absolute path.
liuzhe-lz's avatar
liuzhe-lz committed
717
718


liuzhe-lz's avatar
liuzhe-lz committed
719
reuseMode
kvartet's avatar
kvartet committed
720
"""""""""
liuzhe-lz's avatar
liuzhe-lz committed
721

kvartet's avatar
kvartet committed
722
Enable `reuse mode <../TrainingService/Overview.rst#training-service-under-reuse-mode>`__.
liuzhe-lz's avatar
liuzhe-lz committed
723
724
725
726
727
728

type: ``bool``

default: ``False``


liuzhe-lz's avatar
liuzhe-lz committed
729
openpaiConfig
kvartet's avatar
kvartet committed
730
"""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
731

liuzhe-lz's avatar
liuzhe-lz committed
732
733
734
735
736
737
Embedded OpenPAI config file.

type: ``Optional[JSON]``


openpaiConfigFile
kvartet's avatar
kvartet committed
738
"""""""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
739
740
741
742
743

`Path`_ to OpenPAI config file.

type: ``Optional[str]``

kvartet's avatar
kvartet committed
744
An example can be found `here <https://github.com/microsoft/pai/blob/master/docs/manual/cluster-user/examples/hello-world-job.yaml>`__.
liuzhe-lz's avatar
liuzhe-lz committed
745
746
747


AmlConfig
kvartet's avatar
kvartet committed
748
---------
liuzhe-lz's avatar
liuzhe-lz committed
749

kvartet's avatar
kvartet committed
750
Detailed usage can be found `here <../TrainingService/AMLMode.rst>`__.
liuzhe-lz's avatar
liuzhe-lz committed
751
752
753


platform
kvartet's avatar
kvartet committed
754
""""""""
liuzhe-lz's avatar
liuzhe-lz committed
755
756
757
758
759

Constant string ``"aml"``.


dockerImage
kvartet's avatar
kvartet committed
760
"""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
761
762

Name and tag of docker image to run the trials.
liuzhe-lz's avatar
liuzhe-lz committed
763
764
765

type: ``str``

liuzhe-lz's avatar
liuzhe-lz committed
766
default: ``"msranni/nni:latest"``
liuzhe-lz's avatar
liuzhe-lz committed
767
768


liuzhe-lz's avatar
liuzhe-lz committed
769
subscriptionId
kvartet's avatar
kvartet committed
770
""""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
771
772

Azure subscription ID.
liuzhe-lz's avatar
liuzhe-lz committed
773
774
775
776

type: ``str``


liuzhe-lz's avatar
liuzhe-lz committed
777
resourceGroup
kvartet's avatar
kvartet committed
778
"""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
779
780

Azure resource group name.
liuzhe-lz's avatar
liuzhe-lz committed
781

liuzhe-lz's avatar
liuzhe-lz committed
782
type: ``str``
liuzhe-lz's avatar
liuzhe-lz committed
783
784


liuzhe-lz's avatar
liuzhe-lz committed
785
workspaceName
kvartet's avatar
kvartet committed
786
"""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
787

liuzhe-lz's avatar
liuzhe-lz committed
788
789
790
Azure workspace name.

type: ``str``
liuzhe-lz's avatar
liuzhe-lz committed
791
792


liuzhe-lz's avatar
liuzhe-lz committed
793
computeTarget
kvartet's avatar
kvartet committed
794
"""""""""""""
liuzhe-lz's avatar
liuzhe-lz committed
795
796
797
798

AML compute cluster name.

type: ``str``
kvartet's avatar
kvartet committed
799
800


801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
DlcConfig
---------

Detailed usage can be found `here <../TrainingService/DlcMode.rst>`__.


platform
""""""""

Constant string ``"dlc"``.


type
""""

Job spec type.

type: ``str``

default: ``"worker"``


image
"""""

Name and tag of docker image to run the trials.

type: ``str``


jobType
"""""""

PAI-DLC training job type, ``"TFJob"`` or ``"PyTorchJob"``.

type: ``str``


podCount
""""""""

Pod count to run a single training job.

type: ``str``


ecsSpec
"""""""

Training server config spec string.

type: ``str``


region
""""""

The region where PAI-DLC public-cluster locates.

type: ``str``


nasDataSourceId
"""""""""""""""

The NAS datasource id configurated in PAI-DLC side.

type: ``str``



accessKeyId
"""""""""""

The accessKeyId of your cloud account.

type: ``str``



accessKeySecret
"""""""""""""""

The accessKeySecret of your cloud account.

type: ``str``



localStorageMountPoint
""""""""""""""""""""""

The mount point of the NAS on PAI-DSW server, default is /home/admin/workspace/.

type: ``str``


containerStorageMountPoint
""""""""""""""""""""""""""

The mount point of the NAS on PAI-DLC side, default is /root/data/.

type: ``str``


kvartet's avatar
kvartet committed
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
HybridConfig
------------

Currently only support `LocalConfig`_, `RemoteConfig`_, :ref:`OpenpaiConfig <openpai-class>` and `AmlConfig`_ . Detailed usage can be found `here <../TrainingService/HybridMode.rst>`__.

type: list of `TrainingServiceConfig`_


SharedStorageConfig
^^^^^^^^^^^^^^^^^^^

Detailed usage can be found `here <../Tutorial/HowToUseSharedStorage.rst>`__.


nfsConfig
---------

storageType
"""""""""""

Constant string ``"NFS"``.


localMountPoint
"""""""""""""""

The path that the storage has been or will be mounted in the local machine.

type: ``str``

If the path does not exist, it will be created automatically. Recommended to use an absolute path, i.e. ``/tmp/nni-shared-storage``.


remoteMountPoint
""""""""""""""""

942
The path that the storage will be mounted in the remote machine.
kvartet's avatar
kvartet committed
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998

type: ``str``

If the path does not exist, it will be created automatically. Recommended to use a relative path. i.e. ``./nni-shared-storage``.


localMounted
""""""""""""

Specify the object and status to mount the shared storage.

type: ``str``

values: ``"usermount"``, ``"nnimount"``, ``"nomount"``

``usermount`` means the user has already mounted this storage on localMountPoint. ``nnimount`` means NNI will try to mount this storage on localMountPoint. ``nomount`` means storage will not mount in the local machine, will support partial storages in the future.


nfsServer
"""""""""

NFS server host.

type: ``str``


exportedDirectory
"""""""""""""""""

Exported directory of NFS server, detailed `here <https://www.ibm.com/docs/en/aix/7.2?topic=system-nfs-exporting-mounting>`_.

type: ``str``


azureBlobConfig
---------------

storageType
"""""""""""

Constant string ``"AzureBlob"``.


localMountPoint
"""""""""""""""

The path that the storage has been or will be mounted in the local machine.

type: ``str``

If the path does not exist, it will be created automatically. Recommended to use an absolute path, i.e. ``/tmp/nni-shared-storage``.


remoteMountPoint
""""""""""""""""

999
The path that the storage will be mounted in the remote machine.
kvartet's avatar
kvartet committed
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041

type: ``str``

If the path does not exist, it will be created automatically. Recommended to use a relative path. i.e. ``./nni-shared-storage``.

Note that the directory must be empty when using AzureBlob. 


localMounted
""""""""""""

Specify the object and status to mount the shared storage.

type: ``str``

values: ``"usermount"``, ``"nnimount"``, ``"nomount"``

``usermount`` means the user has already mounted this storage on localMountPoint. ``nnimount`` means NNI will try to mount this storage on localMountPoint. ``nomount`` means storage will not mount in the local machine, will support partial storages in the future.


storageAccountName
""""""""""""""""""

Azure storage account name.

type: ``str``


storageAccountKey
"""""""""""""""""

Azure storage account key.

type: ``Optional[str]``


containerName
"""""""""""""

AzureBlob container name.

type: ``str``