DLCMode.rst 4.2 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
**Run an Experiment on Aliyun PAI-DSW + PAI-DLC**
===================================================

NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ called dlc mode.

PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.

Setup environment
-----------------

Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.

Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.

Step 3. Open PAI-DLC `here <https://pai-dlc.console.aliyun.com/#/guide>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)

Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to `this link <https://help.aliyun.com/document_detail/203290.html>`__. Skip this step if SDK is already installed.


.. code-block:: bash

   wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
   unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
   pip install ./pai-dlc-20201203  # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.


Run an experiment
-----------------

Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:

.. code-block:: yaml

  # working directory on DSW, please provie FULL path
  experimentWorkingDirectory: /home/admin/workspace/{your_working_dir}
  searchSpaceFile: search_space.json
  # the command on trial runner(or, DLC container), be aware of data_dir
  trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
  trialConcurrency: 1  # NOTE: please provide number <= 3 due to DLC system limit.
  maxTrialNumber: 10
  tuner:
    name: TPE
    classArgs:
      optimize_mode: maximize
  # ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
  trainingService:
    platform: dlc
    type: Worker
    image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
    jobType: PyTorchJob                             # choices: [TFJob, PyTorchJob]
    podCount: 1
    ecsSpec: ecs.c6.large
    region: cn-hangzhou
    accessKeyId: ${your_ak_id}
    accessKeySecret: ${your_ak_key}
56
57
    nasDataSourceId: ${your_nas_data_source_id}     # NAS datasource ID, e.g., datat56by9n1xt0a
    ossDataSourceId: ${your_oss_data_source_id}     # OSS datasource ID, in case your data is on oss
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
    localStorageMountPoint: /home/admin/workspace/  # default NAS path on DSW
    containerStorageMountPoint: /root/data/         # default NAS path on DLC container, change it according your setting

Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode.

Compared with `LocalMode <LocalMode.rst>`__ training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__.

Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``.

Run the following commands to start the example experiment:

.. code-block:: bash

   git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
   cd nni/examples/trials/mnist-pytorch

   # modify config_dlc.yml ...

   nnictl create --config config_dlc.yml

Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``.

Monitor your job
----------------

To monitor your job on DLC, you need to visit `DLC  <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status.