Unverified Commit e9c21fd3 authored by Weidan Kong's avatar Weidan Kong Committed by GitHub
Browse files

HPO: Alibaba DSW+DLC support (#4055)

parent 5d0251fc
**Run an Experiment on Aliyun PAI-DSW + PAI-DLC**
===================================================
NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ called dlc mode.
PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.
Setup environment
-----------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.
Step 3. Open PAI-DLC `here <https://pai-dlc.console.aliyun.com/#/guide>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)
Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to `this link <https://help.aliyun.com/document_detail/203290.html>`__. Skip this step if SDK is already installed.
.. code-block:: bash
wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.
Run an experiment
-----------------
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
# working directory on DSW, please provie FULL path
experimentWorkingDirectory: /home/admin/workspace/{your_working_dir}
searchSpaceFile: search_space.json
# the command on trial runner(or, DLC container), be aware of data_dir
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit.
maxTrialNumber: 10
tuner:
name: TPE
classArgs:
optimize_mode: maximize
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
trainingService:
platform: dlc
type: Worker
image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
jobType: PyTorchJob # choices: [TFJob, PyTorchJob]
podCount: 1
ecsSpec: ecs.c6.large
region: cn-hangzhou
nasDataSourceId: ${your_nas_data_source_id}
accessKeyId: ${your_ak_id}
accessKeySecret: ${your_ak_key}
nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID,e.g., datat56by9n1xt0a
localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW
containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting
Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode.
Compared with `LocalMode <LocalMode.rst>`__ training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__.
Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``.
Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-pytorch
# modify config_dlc.yml ...
nnictl create --config config_dlc.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``.
Monitor your job
----------------
To monitor your job on DLC, you need to visit `DLC <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status.
......@@ -6,7 +6,7 @@ What is Training Service?
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.rst>`__\ , `remote machines <./RemoteMachineMode.rst>`__\ , and on clusters like `PAI <./PaiMode.rst>`__\ , `Kubeflow <./KubeflowMode.rst>`__\ , `AdaptDL <./AdaptDLMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ , `DLTS <./DLTSMode.rst>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.rst>`__\ , `remote machines <./RemoteMachineMode.rst>`__\ , and on clusters like `PAI <./PaiMode.rst>`__\ , `Kubeflow <./KubeflowMode.rst>`__\ , `AdaptDL <./AdaptDLMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ , `DLTS <./DLTSMode.rst>`__, `AML <./AMLMode.rst>`__ and `DLC <./DLCMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to `how to implement training service <./HowToImplementTrainingService.rst>`__ for details.
......@@ -44,6 +44,8 @@ Built-in Training Services
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `AML <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
* - `DLC <./DLCMode.rst>`__
- NNI supports running an experiment on `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ , called dlc mode.
What does Training Service do?
......@@ -77,4 +79,4 @@ When reuse mode is enabled, a cluster, such as a remote machine or a computer in
In the reuse mode, user needs to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).
.. note:: Currently, only `Local <./LocalMode.rst>`__, `Remote <./RemoteMachineMode.rst>`__, `OpenPAI <./PaiMode.rst>`__ and `AML <./AMLMode.rst>`__ training services support resue mode. For Remote and OpenPAI training platforms, you can enable reuse mode according to `here <../reference/experiment_config.rst>`__ manually. AML is implemented under reuse mode, so the default mode is reuse mode, no need to manually enable.
.. note:: Currently, only `Local <./LocalMode.rst>`__, `Remote <./RemoteMachineMode.rst>`__, `OpenPAI <./PaiMode.rst>`__, `AML <./AMLMode.rst>`__ and `DLC <./DLCMode.rst>`__ training services support resue mode. For Remote and OpenPAI training platforms, you can enable reuse mode according to `here <../reference/experiment_config.rst>`__ manually. AML is implemented under reuse mode, so the default mode is reuse mode, no need to manually enable.
......@@ -409,6 +409,7 @@ One of the following:
- `RemoteConfig`_
- :ref:`OpenpaiConfig <openpai-class>`
- `AmlConfig`_
- `DlcConfig`_
- `HybridConfig`_
For `Kubeflow <../TrainingService/KubeflowMode.rst>`_, `FrameworkController <../TrainingService/FrameworkControllerMode.rst>`_, and `AdaptDL <../TrainingService/AdaptDLMode.rst>`_ training platforms, it is suggested to use `v1 config schema <../Tutorial/ExperimentConfig.rst>`_ for now.
......@@ -797,6 +798,111 @@ AML compute cluster name.
type: ``str``
DlcConfig
---------
Detailed usage can be found `here <../TrainingService/DlcMode.rst>`__.
platform
""""""""
Constant string ``"dlc"``.
type
""""
Job spec type.
type: ``str``
default: ``"worker"``
image
"""""
Name and tag of docker image to run the trials.
type: ``str``
jobType
"""""""
PAI-DLC training job type, ``"TFJob"`` or ``"PyTorchJob"``.
type: ``str``
podCount
""""""""
Pod count to run a single training job.
type: ``str``
ecsSpec
"""""""
Training server config spec string.
type: ``str``
region
""""""
The region where PAI-DLC public-cluster locates.
type: ``str``
nasDataSourceId
"""""""""""""""
The NAS datasource id configurated in PAI-DLC side.
type: ``str``
accessKeyId
"""""""""""
The accessKeyId of your cloud account.
type: ``str``
accessKeySecret
"""""""""""""""
The accessKeySecret of your cloud account.
type: ``str``
localStorageMountPoint
""""""""""""""""""""""
The mount point of the NAS on PAI-DSW server, default is /home/admin/workspace/.
type: ``str``
containerStorageMountPoint
""""""""""""""""""""""""""
The mount point of the NAS on PAI-DLC side, default is /root/data/.
type: ``str``
HybridConfig
------------
......
......@@ -11,4 +11,5 @@ Introduction to NNI Training Services
FrameworkController<./TrainingService/FrameworkControllerMode>
DLTS<./TrainingService/DLTSMode>
AML<./TrainingService/AMLMode>
PAI-DLC<./TrainingService/DLCMode>
Hybrid<./TrainingService/HybridMode>
# working directory on DSW, please provie FULL path
searchSpaceFile: search_space.json
# the command on trial runner(or, DLC container), be aware of data_dir
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit.
maxTrialNumber: 10
tuner:
name: TPE
classArgs:
optimize_mode: maximize
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
trainingService:
platform: dlc
type: Worker
image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
jobType: PyTorchJob # choices: [TFJob, PyTorchJob]
podCount: 1
ecsSpec: ecs.c6.large
region: cn-hangzhou
nasDataSourceId: ${your_nas_data_source_id}
accessKeyId: ${your_ak_id}
accessKeySecret: ${your_ak_key}
nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID,e.g., datat56by9n1xt0a
localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW, MUST provide full path.
containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting
......@@ -9,4 +9,5 @@ from .aml import *
from .kubeflow import *
from .frameworkcontroller import *
from .adl import *
from .dlc import *
from .shared_storage import *
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
from dataclasses import dataclass
from .common import TrainingServiceConfig
__all__ = ['DlcConfig']
@dataclass(init=False)
class DlcConfig(TrainingServiceConfig):
platform: str = 'dlc'
type: str = 'Worker'
image: str # 'registry-vpc.{region}.aliyuncs.com/pai-dlc/tensorflow-training:1.15.0-cpu-py36-ubuntu18.04',
job_type: str = 'TFJob'
pod_count: int
ecs_spec: str # e.g.,'ecs.c6.large'
region: str
nas_data_source_id: str
access_key_id: str
access_key_secret: str
local_storage_mount_point: str
container_storage_mount_point: str
_validation_rules = {
'platform': lambda value: (value == 'dlc', 'cannot be modified')
}
......@@ -68,6 +68,12 @@ def _inverse_cluster_metadata(platform: str, metadata_config: list) -> dict:
inverse_config['amlConfig'] = kv['value']
elif kv['key'] == 'trial_config':
inverse_config['trial'] = kv['value']
elif platform == 'dlc':
for kv in metadata_config:
if kv['key'] == 'dlc_config':
inverse_config['dlcConfig'] = kv['value']
elif kv['key'] == 'trial_config':
inverse_config['trial'] = kv['value']
elif platform == 'adl':
for kv in metadata_config:
if kv['key'] == 'adl_config':
......
......@@ -9,6 +9,7 @@ _builtin_training_services = [
'remote',
'openpai', 'pai',
'aml',
'dlc'
'kubeflow',
'frameworkcontroller',
'adl',
......
......@@ -73,6 +73,25 @@ export interface AmlConfig extends TrainingServiceConfig {
maxTrialNumberPerGpu: number;
}
/* Alibaba PAI DLC */
export interface DlcConfig extends TrainingServiceConfig {
platfrom: 'dlc';
type: string;
image: string;
jobType: string;
podCount: number;
ecsSpec: string;
region: string;
nasDataSourceId: string;
accessKeyId: string;
accessKeySecret: string;
localStorageMountPoint: string;
containerStorageMountPoint: string;
}
/* Kubeflow */
// FIXME: merge with shared storage config
export interface KubeflowStorageConfig {
storageType: string;
maxTrialNumberPerGpu?: number;
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
import os
import sys
import time
import json
from argparse import ArgumentParser
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
from alibabacloud_pai_dlc20201203.client import Client
from alibabacloud_tea_openapi.models import Config
from alibabacloud_pai_dlc20201203.models import * #CreateJobRequest, JobSpec
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument('--type', help='the type of job spec')
parser.add_argument('--image', help='the docker image of job')
parser.add_argument('--job_type', choices=['TFJob', 'PyTorchJob'], help='the job type')
parser.add_argument('--pod_count', type=int, default=1, help='pod count')
parser.add_argument('--ecs_spec', help='ecs spec')
parser.add_argument('--region', help='region')
parser.add_argument('--nas_data_source_id', help='nas data_source_id of DLC dataset configuration')
parser.add_argument('--access_key_id', help='access_key_id')
parser.add_argument('--access_key_secret', help='access_key_secret')
parser.add_argument('--experiment_name', help='the experiment name')
parser.add_argument('--user_command', help='user command')
args = parser.parse_args()
# init client
client = Client(
Config(
access_key_id=args.access_key_id,
access_key_secret=args.access_key_secret,
region_id=args.region,
endpoint=f'pai-dlc.{args.region}.aliyuncs.com'
)
)
nas_1 = DataSourceItem(
data_source_type = 'nas',
data_source_id=args.nas_data_source_id,
)
# job spec
spec = JobSpec(
type=args.type,
image=args.image,
pod_count=args.pod_count,
ecs_spec=args.ecs_spec,
)
req = CreateJobRequest(
display_name=args.experiment_name,
job_type=args.job_type,
job_specs=[spec],
data_sources=[nas_1],
user_command=args.user_command
)
# DLC submit
response = client.create_job(req)
job_id = response.body.job_id
print('job id: ' + job_id)
while True:
line = sys.stdin.readline().rstrip()
if line == 'update_status':
print('status:' + client.get_job(job_id).body.status)
elif line == 'tracking_url':
#TODO: 1. get this url by api? 2. change this url in private dlc mode.
print('tracking_url:' + f'https://pai-dlc.console.aliyun.com/#/jobs/detail?jobId={job_id}&regionId={args.region}')
elif line == 'stop':
client.stop_job(job_id)
exit(0)
......@@ -63,7 +63,7 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN
function usage(): void {
console.info('usage: node main.js --port <port> --mode \
<local/remote/pai/kubeflow/frameworkcontroller/aml/adl/hybrid> --start_mode <new/resume> --experiment_id <id> --foreground <true/false>');
<local/remote/pai/kubeflow/frameworkcontroller/aml/adl/hybrid/dlc> --start_mode <new/resume> --experiment_id <id> --foreground <true/false>');
}
const strPort: string = parseArg(['--port', '-p']);
......
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
'use strict';
import { Deferred } from 'ts-deferred';
import { PythonShell } from 'python-shell';
import { getLogger, Logger } from '../../../common/log';
export class DlcClient {
private log: Logger;
public type: string;
public image: string;
public jobType: string;
public podCount: number;
public ecsSpec: string;
public region: string;
// e.g., data1e6vg1tu0zi7, to generate it, go to 'Dataset Config' page of DLC
// create a NAS data and copy the 'DataSet ConfigurationID'
public nasDataSourceId: string;
public accessKeyId: string;
public accessKeySecret: string;
public experimentId: string;
public environmentId: string;
public userCommand: string;
public pythonShellClient: undefined | PythonShell;
constructor(
type: string,
image: string,
jobType: string,
podCount: number,
experimentId: string,
environmentId: string,
ecsSpec: string,
region: string,
nasDataSourceId: string,
accessKeyId: string,
accessKeySecret: string,
userCommand: string,
) {
this.log = getLogger('DlcClient');
this.type = type;
this.image = image;
this.jobType = jobType;
this.podCount = podCount;
this.ecsSpec = ecsSpec;
this.image = image;
this.region = region;
this.nasDataSourceId = nasDataSourceId;
this.accessKeyId = accessKeyId;
this.accessKeySecret = accessKeySecret
this.experimentId = experimentId;
this.environmentId = environmentId;
this.userCommand = userCommand;
}
public submit(): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
this.pythonShellClient = new PythonShell('dlcUtil.py', {
scriptPath: './config/dlc',
pythonPath: 'python3',
pythonOptions: ['-u'], // get print results in real-time
args: [
'--type', this.type,
'--image', this.image,
'--job_type', this.jobType,
'--pod_count', String(this.podCount),
'--ecs_spec', this.ecsSpec,
'--region', this.region,
'--nas_data_source_id', this.nasDataSourceId,
'--access_key_id', this.accessKeyId,
'--access_key_secret', this.accessKeySecret,
'--experiment_name', `nni_exp_${this.experimentId}_env_${this.environmentId}`,
'--user_command', this.userCommand,
]
});
this.log.debug(this.pythonShellClient.command);
this.pythonShellClient.on('message', function (envId: any) {
// received a message sent from the Python script (a simple "print" statement)
deferred.resolve(envId);
});
this.monitorError(this.pythonShellClient, deferred);
return deferred.promise;
}
public stop(): void {
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('stop');
}
public getTrackingUrl(): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('tracking_url');
this.pythonShellClient.on('message', (status: any) => {
const trackingUrl = this.parseContent('tracking_url', status);
if (trackingUrl !== '') {
deferred.resolve(trackingUrl);
}
});
this.monitorError(this.pythonShellClient, deferred);
return deferred.promise;
}
public updateStatus(oldStatus: string): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('update_status');
this.pythonShellClient.on('message', (status: any) => {
let newStatus = this.parseContent('status', status);
if (newStatus === '') {
newStatus = oldStatus;
}
deferred.resolve(newStatus);
});
this.monitorError(this.pythonShellClient, deferred);
return deferred.promise;
}
public sendCommand(message: string): void {
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.log.debug(`command:${message}`);
this.pythonShellClient.send(`command:${message}`);
}
public receiveCommand(): Promise<string> {
const deferred: Deferred<string> = new Deferred<string>();
if (this.pythonShellClient === undefined) {
throw Error('python shell client not initialized!');
}
this.pythonShellClient.send('receive');
this.pythonShellClient.on('message', (command: any) => {
const message = this.parseContent('receive', command);
if (message !== '') {
deferred.resolve(JSON.parse(message))
}
});
this.monitorError(this.pythonShellClient, deferred);
return deferred.promise;
}
// Monitor error information in dlc python shell client
private monitorError(pythonShellClient: PythonShell, deferred: Deferred<any>): void {
pythonShellClient.on('error', function (error: any) {
deferred.reject(error);
});
pythonShellClient.on('close', function (error: any) {
deferred.reject(error);
});
}
// Parse command content, command format is {head}:{content}
public parseContent(head: string, command: string): string {
const items = command.split(':');
if (items[0] === head) {
return command.slice(head.length + 1);
}
return '';
}
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
'use strict';
import { TrialConfig } from '../../common/trialConfig';
import { EnvironmentInformation } from '../environment';
import { DlcClient } from '../dlc/dlcClient';
export class DlcClusterConfig {
public readonly type: string;
public readonly image: string;
public readonly podCount: number;
public readonly ecsSpec: string;
constructor(type: string, image: string, podCount: number, ecsSpec: string) {
this.type = type;
this.image = image;
this.podCount = podCount;
this.ecsSpec = ecsSpec;
}
}
export class DlcTrialConfig extends TrialConfig {
public readonly image: string;
public readonly command: string;
public readonly codeDir: string;
constructor(codeDir: string, command: string, image: string) {
super("", codeDir, 0);
this.codeDir = codeDir;
this.command = command;
this.image = image;
}
}
export class DlcEnvironmentInformation extends EnvironmentInformation {
public dlcClient?: DlcClient;
public currentMessageIndex: number = -1;
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
'use strict';
import * as fs from 'fs';
import * as path from 'path';
import * as component from '../../../common/component';
import { getLogger, Logger } from '../../../common/log';
import { ExperimentConfig, DlcConfig, flattenConfig } from '../../../common/experimentConfig';
import { ExperimentStartupInfo } from '../../../common/experimentStartupInfo';
import { DlcClient } from '../dlc/dlcClient';
import { DlcEnvironmentInformation } from '../dlc/dlcConfig';
import { EnvironmentInformation, EnvironmentService } from '../environment';
import { EventEmitter } from "events";
import { FileCommandChannel } from '../channels/fileCommandChannel';
import { MountedStorageService } from '../storages/mountedStorageService';
import { Scope } from 'typescript-ioc';
import { StorageService } from '../storageService';
interface FlattenDlcConfig extends ExperimentConfig, DlcConfig { }
/**
* Collector DLC jobs info from DLC cluster, and update dlc job status locally
*/
@component.Singleton
export class DlcEnvironmentService extends EnvironmentService {
private readonly log: Logger = getLogger('dlcEnvironmentService');
private experimentId: string;
private config: FlattenDlcConfig;
constructor(config: ExperimentConfig, info: ExperimentStartupInfo) {
super();
this.experimentId = info.experimentId;
this.config = flattenConfig(config, 'dlc');
component.Container.bind(StorageService).to(MountedStorageService).scope(Scope.Singleton);
const storageService = component.get<StorageService>(StorageService)
const remoteRoot = storageService.joinPath(this.config.localStorageMountPoint, 'nni-experiments', this.experimentId);
const localRoot = storageService.joinPath(this.config.localStorageMountPoint, 'nni-experiments');
storageService.initialize(localRoot, remoteRoot);
}
public get hasStorageService(): boolean {
return true;
}
public initCommandChannel(eventEmitter: EventEmitter): void {
this.commandChannel = new FileCommandChannel(eventEmitter);
}
public createEnvironmentInformation(envId: string, envName: string): EnvironmentInformation {
return new DlcEnvironmentInformation(envId, envName);
}
public get getName(): string {
return 'dlc';
}
public async refreshEnvironmentsStatus(environments: EnvironmentInformation[]): Promise<void> {
environments.forEach(async (environment) => {
const dlcClient = (environment as DlcEnvironmentInformation).dlcClient;
if (!dlcClient) {
return Promise.reject('DLC client not initialized!');
}
const newStatus = await dlcClient.updateStatus(environment.status);
switch (newStatus.toUpperCase()) {
case 'CREATING':
case 'CREATED':
case 'WAITING':
case 'QUEUED':
environment.setStatus('WAITING');
break;
case 'RUNNING':
environment.setStatus('RUNNING');
break;
case 'COMPLETED':
case 'SUCCEEDED':
environment.setStatus('SUCCEEDED');
break;
case 'FAILED':
environment.setStatus('FAILED');
return Promise.reject(`DLC: job ${environment.envId} is failed!`);
case 'STOPPED':
case 'STOPPING':
environment.setStatus('USER_CANCELED');
break;
default:
environment.setStatus('UNKNOWN');
}
});
}
public async startEnvironment(environment: EnvironmentInformation): Promise<void> {
const dlcEnvironment: DlcEnvironmentInformation = environment as DlcEnvironmentInformation;
const environmentRoot = path.join(this.config.containerStorageMountPoint, `/nni-experiments/${this.experimentId}`);
const localRoot = path.join(this.config.localStorageMountPoint, `/nni-experiments/${this.experimentId}`);
dlcEnvironment.workingFolder = `${localRoot}/envs/${environment.id}`;
dlcEnvironment.runnerWorkingFolder = `${environmentRoot}/envs/${environment.id}`;
// environment id dir and command dir, folder created on DLC side can't be accessed on DSW.
if (!fs.existsSync(`${dlcEnvironment.workingFolder}/commands`)) {
await fs.promises.mkdir(`${dlcEnvironment.workingFolder}/commands`, {recursive: true});
}
environment.command = `cd ${environmentRoot} && ${environment.command} 1>${environment.runnerWorkingFolder}/trialrunner_stdout 2>${environment.runnerWorkingFolder}/trialrunner_stderr`;
const dlcClient = new DlcClient(
this.config.type,
this.config.image,
this.config.jobType,
this.config.podCount,
this.experimentId,
environment.id,
this.config.ecsSpec,
this.config.region,
this.config.nasDataSourceId,
this.config.accessKeyId,
this.config.accessKeySecret,
environment.command,
);
dlcEnvironment.id = await dlcClient.submit();
this.log.debug('dlc: before getTrackingUrl');
dlcEnvironment.trackingUrl = await dlcClient.getTrackingUrl();
this.log.debug(`dlc trackingUrl: ${dlcEnvironment.trackingUrl}`);
dlcEnvironment.dlcClient = dlcClient;
}
public async stopEnvironment(environment: EnvironmentInformation): Promise<void> {
const dlcEnvironment: DlcEnvironmentInformation = environment as DlcEnvironmentInformation;
const dlcClient = dlcEnvironment.dlcClient;
if (!dlcClient) {
throw new Error('DLC client not initialized!');
}
dlcClient.stop();
}
}
......@@ -8,6 +8,7 @@ import { ExperimentConfig } from '../../../common/experimentConfig';
import { ExperimentStartupInfo } from '../../../common/experimentStartupInfo';
import { getCustomEnvironmentServiceConfig } from '../../../common/nniConfig';
import { importModule } from '../../../common/utils';
import { DlcEnvironmentService } from './dlcEnvironmentService';
export async function createEnvironmentService(name: string, config: ExperimentConfig): Promise<EnvironmentService> {
const info = ExperimentStartupInfo.getInstance();
......@@ -23,6 +24,8 @@ export async function createEnvironmentService(name: string, config: ExperimentC
return new OpenPaiEnvironmentService(config, info);
case 'kubeflow':
return new KubeflowEnvironmentService(config, info);
case 'dlc':
return new DlcEnvironmentService(config, info);
}
const esConfig = await getCustomEnvironmentServiceConfig(name);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment