PAI-DLC Training Service ======================== NNI supports running an experiment on `PAI-DSW `__ , submit trials to `PAI-DLC `__ which is deep learning containers based on Alibaba ACK. PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs. Prerequisite ------------ Step 1. Install NNI, follow the :doc:`install guide `. Step 2. Create PAI-DSW server following this `link `__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU. Step 3. Open PAI-DLC `here `__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.) Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to `this link `__. Skip this step if SDK is already installed. .. code-block:: bash wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly. Usage ----- Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like: .. code-block:: yaml # working directory on DSW, please provie FULL path experimentWorkingDirectory: /home/admin/workspace/{your_working_dir} searchSpaceFile: search_space.json # the command on trial runner(or, DLC container), be aware of data_dir trialCommand: python mnist.py --data_dir /root/data/{your_data_dir} trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit. maxTrialNumber: 10 tuner: name: TPE classArgs: optimize_mode: maximize # ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x trainingService: platform: dlc type: Worker image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04 jobType: PyTorchJob # choices: [TFJob, PyTorchJob] podCount: 1 ecsSpec: ecs.c6.large region: cn-hangzhou accessKeyId: ${your_ak_id} accessKeySecret: ${your_ak_key} nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID, e.g., datat56by9n1xt0a ossDataSourceId: ${your_oss_data_source_id} # OSS datasource ID, in case your data is on oss localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode. Compared with :doc:`local`, training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link `__. Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``. Run the following commands to start the example experiment: .. code-block:: bash git clone -b ${NNI_VERSION} https://github.com/microsoft/nni cd nni/examples/trials/mnist-pytorch # modify config_dlc.yml ... nnictl create --config config_dlc.yml Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``. Monitor your job ^^^^^^^^^^^^^^^^ To monitor your job on DLC, you need to visit `DLC `__ to check job status.