adaptdl.rst 9.63 KB
Newer Older
1
2
AdaptDL Training Service
========================
3

4
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__, which is a resource-adaptive deep learning training and scheduling framework. With AdaptDL training service, your trial program will run as AdaptDL job in Kubernetes cluster.
5
6
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

7
8
Prerequisite
------------
9

10
Before starting to use NNI AdaptDL training service, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster.
11
12
13

#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
14
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
15
16
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
liuzhe-lz's avatar
liuzhe-lz committed
17
#. Install **NNI**.
18

19
20
Verify the Prerequisites
^^^^^^^^^^^^^^^^^^^^^^^^
21

22
..  code-block:: bash
23

24
25
    nnictl --version
    # Expected: <version_number>
26

27
..  code-block:: bash
28

29
30
    kubectl version
    # Expected that the kubectl client version matches the server version.
31

32
..  code-block:: bash
33

34
35
    kubectl api-versions | grep adaptdl
    # Expected: adaptdl.petuum.com/v1
36

37
38
Usage
-----
39

40
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yaml>`)
41
42
43

Here is a template configuration specification to use AdaptDL as a training service.

44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
..  code-block:: yaml

    authorName: default
    experimentName: minimal_adl

    trainingServicePlatform: adl
    nniManagerIp: 10.1.10.11
    logCollection: http

    tuner:
      builtinTunerName: GridSearch
    searchSpacePath: search_space.json

    trialConcurrency: 2
    maxTrialNum: 2

    trial:
      adaptive: false # optional.
      image: <image_tag>
      imagePullSecrets:  # optional
        - name: stagingsecret
      codeDir: .
      command: python main.py
      gpuNum: 1
      cpuNum: 1  # optional
      memorySize: 8Gi  # optional
      nfs: # optional
        server: 10.20.41.55
        path: /
        containerMountPath: /nfs
      checkpoint: # optional
        storageClass: dfs
        storageSize: 1Gi

..  warning::
79
    This configuration is written following the specification of `legacy experiment configuration <https://nni.readthedocs.io/en/v2.6/Tutorial/ExperimentConfig.html>`__. It is still supported, and will be updated to the latest version in future release.
80

81
The following explains the configuration fields of AdaptDL training service.
82
83
84
85
86
87
88
89

* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**\ : *Required* to get the correct info and metrics back from the cluster, for ``adl`` training service.
  IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**\ : *Recommended* to set as ``http``. It will collect the trial logs on cluster back to your machine via http.
* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**\ : It defines the specs of an ``adl`` trial.

90
  * **namespace**\: (*Optional*\ ) Kubernetes namespace to launch the trials. Default to ``default`` namespace.
91
92
93
94
95
96
97
98
99
100
101
  * **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
  * **image**\ : Docker image for the trial
  * **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
    you need to provide the secret to successfully pull the image.
  * **codeDir**\ : the working directory of the container. ``.`` means the default working directory defined by the image.
  * **command**\ : the bash command to start the trial
  * **gpuNum**\ : the number of GPUs requested for this trial. It must be non-negative integer.
  * **cpuNum**\ : (*Optional*\ ) the number of CPUs requested for this trial.  It must be non-negative integer.
  * **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
    `default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
  * **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
102
103
104
105
  * **checkpoint** (*Optional*\ ) storage settings for model checkpoints.

    * **storageClass**\ : check `Kubernetes storage documentation <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for how to use the appropriate ``storageClass``.
    * **storageSize**\ : this value should be large enough to fit your model's checkpoints, or it could cause "disk quota exceeded" error.
106

107
108
109
More Features
-------------

110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
NFS Storage
^^^^^^^^^^^

As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.

Note that ``adl`` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The ``adl`` training service can then mount it to the kubernetes for every trials, with the proper configurations:


* **server**\ : NFS server address, e.g. IP address or domain
* **path**\ : NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**\ : In container absolute path to mount the NFS **path** above,
  so that every trial will have the access to the NFS.
  In the trial containers, you can access the NFS with this path.

Use cases:

* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
  and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
  So if you want to export your trained models,
  you may mount the NFS to the trial to persist and export your trained models.

In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.

Monitor via Log Stream
137
^^^^^^^^^^^^^^^^^^^^^^
138
139
140
141
142

Follow the log streaming of a certain trial:

.. code-block:: bash

Yuge Zhang's avatar
Yuge Zhang committed
143
   nnictl log trial --trial_id=TRIAL_ID
144
145
146

.. code-block:: bash

Yuge Zhang's avatar
Yuge Zhang committed
147
   nnictl log trial EXPERIMENT_ID --trial_id=TRIAL_ID
148
149
150
151
152
153
154

Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.

Monitor via TensorBoard
155
^^^^^^^^^^^^^^^^^^^^^^^
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.

You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.

In the trial container you may have access to two environment variables:


* ``ADAPTDL_TENSORBOARD_LOGDIR``\ : the TensorBoard logging directory for the current experiment,
* ``NNI_TRIAL_JOB_ID``\ : the ``trial`` job id for the current trial.

It is recommended for to have them joined as the directory for trial,
for example in Python:

.. code-block:: python

   import os
   tensorboard_logdir = os.path.join(
       os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
       os.getenv("NNI_TRIAL_JOB_ID")
   )

If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.

With the above setting, you can monitor the experiment easily
via TensorBoard by

.. code-block:: bash

   nnictl tensorboard start

If having multiple experiment running at the same time, you may use

.. code-block:: bash

Yuge Zhang's avatar
Yuge Zhang committed
198
   nnictl tensorboard start EXPERIMENT_ID
199
200
201
202
203

It will provide you the web url to access the tensorboard.

Note that you have the flexibility to set up the local ``--port``
for the TensorBoard.