Unverified Commit 134368fa authored by George Cheng's avatar George Cheng Committed by GitHub
Browse files

DLTS integration (#1945)



* skeleton of dlts training service (#1844)

* Hello, DLTS!

* Revert version

* Remove fs-extra

* Add some default cluster config

* schema

* fix

* Optional cluster (default to `.default`)

Depends on DLWorkspace#837

* fix

* fix

* optimize gpu type

* No more copy

* Format

* Code clean up

* Issue fix

* Add optional fields in config

* Issue fix

* Lint

* Lint

* Validate email, password and team

* Doc

* Doc fix

* Set TMPDIR

* Use metadata instead of gpu_capacity

* Cancel paused DLTS job

* workaround lint rules

* pylint

* doc
Co-authored-by: default avatarQuanluZhang <z.quanluzhang@gmail.com>
parent 03cea2b4
**Run an Experiment on DLTS**
===
NNI supports running an experiment on [DLTS](https://github.com/microsoft/DLWorkspace.git), called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
## Setup Environment
Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.
![Choose Cluster](../../img/dlts-step1.png)
Step 2. Prepare a NNI config YAML like the following:
```yaml
# Set this field to "dlts"
trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
trialConcurrency: 2
maxExecDuration: 3h
maxTrialNum: 100
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
# Configuration to access DLTS
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
```
Remember to fill the cluster dashboard URL to the last line.
Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.
![Copy Config](../../img/dlts-step3.png)
Step 4. Submit a NNI manager job to the specified cluster.
![Submit Job](../../img/dlts-step4.png)
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to check trial's information.
![View NNI WebUI](../../img/dlts-step5.png)
...@@ -9,3 +9,4 @@ Introduction to NNI Training Services ...@@ -9,3 +9,4 @@ Introduction to NNI Training Services
OpenPAI Yarn Mode<./TrainingService/PaiYarnMode> OpenPAI Yarn Mode<./TrainingService/PaiYarnMode>
Kubeflow<./TrainingService/KubeflowMode> Kubeflow<./TrainingService/KubeflowMode>
FrameworkController<./TrainingService/FrameworkControllerMode> FrameworkController<./TrainingService/FrameworkControllerMode>
OpenPAI<./TrainingService/DLTSMode>
debug: true
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: dlts
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
#The docker image to run nni job on dlts
image: msranni/nni:latest
dltsConfig:
dashboard: http://azure-eastus-p40-dev1-infra01.eastus.cloudapp.azure.com/
# The following fields are all optional and could be retrieved from environment
# variables if running in DLTS job container.
# cluster: .default
# team: platform
# email: example@microsoft.com
# password: # Paste from DLTS dashboard
...@@ -26,6 +26,7 @@ import { PAIYarnTrainingService } from './training_service/pai/paiYarn/paiYarnTr ...@@ -26,6 +26,7 @@ import { PAIYarnTrainingService } from './training_service/pai/paiYarn/paiYarnTr
import { import {
RemoteMachineTrainingService RemoteMachineTrainingService
} from './training_service/remote_machine/remoteMachineTrainingService'; } from './training_service/remote_machine/remoteMachineTrainingService';
import { DLTSTrainingService } from './training_service/dlts/dltsTrainingService';
function initStartupInfo( function initStartupInfo(
startExpMode: string, resumeExperimentId: string, basePort: number, startExpMode: string, resumeExperimentId: string, basePort: number,
...@@ -60,6 +61,10 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN ...@@ -60,6 +61,10 @@ async function initContainer(foreground: boolean, platformMode: string, logFileN
Container.bind(TrainingService) Container.bind(TrainingService)
.to(FrameworkControllerTrainingService) .to(FrameworkControllerTrainingService)
.scope(Scope.Singleton); .scope(Scope.Singleton);
} else if (platformMode === 'dlts') {
Container.bind(TrainingService)
.to(DLTSTrainingService)
.scope(Scope.Singleton);
} else { } else {
throw new Error(`Error: unsupported mode: ${platformMode}`); throw new Error(`Error: unsupported mode: ${platformMode}`);
} }
...@@ -108,7 +113,7 @@ const foreground: boolean = foregroundArg.toLowerCase() === 'true' ? true : fals ...@@ -108,7 +113,7 @@ const foreground: boolean = foregroundArg.toLowerCase() === 'true' ? true : fals
const port: number = parseInt(strPort, 10); const port: number = parseInt(strPort, 10);
const mode: string = parseArg(['--mode', '-m']); const mode: string = parseArg(['--mode', '-m']);
if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn'].includes(mode)) { if (!['local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts'].includes(mode)) {
console.log(`FATAL: unknown mode: ${mode}`); console.log(`FATAL: unknown mode: ${mode}`);
usage(); usage();
process.exit(1); process.exit(1);
......
...@@ -140,6 +140,15 @@ export namespace ValidationSchemas { ...@@ -140,6 +140,15 @@ export namespace ValidationSchemas {
}), }),
uploadRetryCount: joi.number().min(1) uploadRetryCount: joi.number().min(1)
}), }),
dlts_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
dashboard: joi.string().min(1),
cluster: joi.string().min(1),
team: joi.string().min(1),
email: joi.string().min(1),
password: joi.string().min(1)
}),
nni_manager_ip: joi.object({ // eslint-disable-line @typescript-eslint/camelcase nni_manager_ip: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
nniManagerIp: joi.string().min(1) nniManagerIp: joi.string().min(1)
}) })
......
...@@ -18,6 +18,7 @@ export enum TrialConfigMetadataKey { ...@@ -18,6 +18,7 @@ export enum TrialConfigMetadataKey {
KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config', KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config',
NNI_MANAGER_IP = 'nni_manager_ip', NNI_MANAGER_IP = 'nni_manager_ip',
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config', FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config',
DLTS_CLUSTER_CONFIG = 'dlts_config',
VERSION_CHECK = 'version_check', VERSION_CHECK = 'version_check',
LOG_COLLECTION = 'log_collection' LOG_COLLECTION = 'log_collection'
} }
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
export interface DLTSClusterConfig {
dashboard: string;
cluster: string;
team: string;
email: string;
password: string;
gpuType?: string;
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
export const DLTS_TRIAL_COMMAND_FORMAT: string =
`export NNI_PLATFORM=dlts NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4} MULTI_PHASE={5} \
&& cd $NNI_SYS_DIR && sh install_nni.sh \
&& cd '{6}' && python3 -m nni_trial_tool.trial_keeper --trial_command '{7}' \
--nnimanager_ip '{8}' --nnimanager_port '{9}' --nni_manager_version '{10}' --log_collection '{11}'`;
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
import { DLTSClusterConfig } from "./dltsClusterConfig";
export class DLTSJobConfig {
public readonly team: string;
public readonly userName: string;
public readonly vcName: string;
public readonly gpuType: string;
public readonly jobType = "training";
public readonly jobtrainingtype = "RegularJob";
public readonly ssh = false;
public readonly ipython = false;
public readonly tensorboard = false;
public readonly workPath = '';
public readonly enableworkpath = true;
public readonly dataPath = '';
public readonly enabledatapath = false;
public readonly jobPath = '';
public readonly enablejobpath = true;
public readonly mountpoints = [];
public readonly env = [{ name: 'TMPDIR', value: '$HOME/tmp' }]
public readonly hostNetwork = false;
public readonly useGPUTopology = false;
public readonly isPrivileged = false;
public readonly hostIPC = false;
public readonly preemptionAllowed = "False"
public constructor(
clusterConfig: DLTSClusterConfig,
public readonly jobName: string,
public readonly resourcegpu: number,
public readonly image: string,
public readonly cmd: string,
public readonly interactivePorts: number[],
) {
if (clusterConfig.gpuType === undefined) {
throw Error('GPU type not fetched')
}
this.vcName = this.team = clusterConfig.team
this.gpuType = clusterConfig.gpuType
this.userName = clusterConfig.email
}
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
'use strict';
import { Request, Response, Router } from 'express';
import { Inject } from 'typescript-ioc';
import * as component from '../../common/component';
import { ClusterJobRestServer } from '../common/clusterJobRestServer';
import { DLTSTrainingService } from './dltsTrainingService';
export interface ParameterFileMeta {
readonly experimentId: string;
readonly trialId: string;
readonly filePath: string;
}
/**
* DLTS Training service Rest server, provides rest API to support DLTS job metrics update
*
*/
@component.Singleton
export class DLTSJobRestServer extends ClusterJobRestServer {
private parameterFileMetaList: ParameterFileMeta[] = [];
@Inject
private readonly dltsTrainingService: DLTSTrainingService;
/**
* constructor to provide NNIRestServer's own rest property, e.g. port
*/
constructor() {
super();
this.dltsTrainingService = component.get(DLTSTrainingService);
}
// tslint:disable-next-line:no-any
protected handleTrialMetrics(jobId: string, metrics: any[]): void {
// Split metrics array into single metric, then emit
// Warning: If not split metrics into single ones, the behavior will be UNKNOWN
for (const singleMetric of metrics) {
this.dltsTrainingService.MetricsEmitter.emit('metric', {
id : jobId,
data : singleMetric
});
}
}
protected createRestHandler(): Router {
const router: Router = super.createRestHandler();
router.post(`/parameter-file-meta`, (req: Request, res: Response) => {
try {
this.log.info(`POST /parameter-file-meta, body is ${JSON.stringify(req.body)}`);
this.parameterFileMetaList.push(req.body);
res.send();
} catch (err) {
this.log.error(`POST parameter-file-meta error: ${err}`);
res.status(500);
res.send(err.message);
}
});
router.get(`/parameter-file-meta`, (req: Request, res: Response) => {
try {
this.log.info(`GET /parameter-file-meta`);
res.send(this.parameterFileMetaList);
} catch (err) {
this.log.error(`GET parameter-file-meta error: ${err}`);
res.status(500);
res.send(err.message);
}
});
return router;
}
}
This diff is collapsed.
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
import { TrialConfig } from "training_service/common/trialConfig";
export class DLTSTrialConfig extends TrialConfig {
public constructor(
command: string,
codeDir: string,
gpuNum: number,
public readonly image: string
) {
super(command, codeDir, gpuNum);
}
}
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
import {
TrialJobDetail,
TrialJobStatus,
TrialJobApplicationForm
} from "../../common/trainingService";
export class DLTSTrialJobDetail implements TrialJobDetail {
public startTime?: number;
public endTime?: number;
public tags?: string[];
public url?: string;
public isEarlyStopped?: boolean;
// DLTS staff
public dltsJobId?: string;
public dltsPaused: boolean = false;
public constructor (
public id: string,
public status: TrialJobStatus,
public submitTime: number,
public workingDirectory: string,
public form: TrialJobApplicationForm,
// DLTS staff
public dltsJobName: string,
) {}
}
...@@ -9,7 +9,7 @@ if trial_env_vars.NNI_PLATFORM is None: ...@@ -9,7 +9,7 @@ if trial_env_vars.NNI_PLATFORM is None:
from .standalone import * from .standalone import *
elif trial_env_vars.NNI_PLATFORM == 'unittest': elif trial_env_vars.NNI_PLATFORM == 'unittest':
from .test import * from .test import *
elif trial_env_vars.NNI_PLATFORM in ('local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn'): elif trial_env_vars.NNI_PLATFORM in ('local', 'remote', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts'):
from .local import * from .local import *
else: else:
raise RuntimeError('Unknown platform %s' % trial_env_vars.NNI_PLATFORM) raise RuntimeError('Unknown platform %s' % trial_env_vars.NNI_PLATFORM)
...@@ -32,7 +32,8 @@ common_schema = { ...@@ -32,7 +32,8 @@ common_schema = {
'trialConcurrency': setNumberRange('trialConcurrency', int, 1, 99999), 'trialConcurrency': setNumberRange('trialConcurrency', int, 1, 99999),
Optional('maxExecDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxExecDuration format is [digit]{s,m,h,d}')), Optional('maxExecDuration'): And(Regex(r'^[1-9][0-9]*[s|m|h|d]$', error='ERROR: maxExecDuration format is [digit]{s,m,h,d}')),
Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999), Optional('maxTrialNum'): setNumberRange('maxTrialNum', int, 1, 99999),
'trainingServicePlatform': setChoice('trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn'), 'trainingServicePlatform': setChoice(
'trainingServicePlatform', 'remote', 'local', 'pai', 'kubeflow', 'frameworkcontroller', 'paiYarn', 'dlts'),
Optional('searchSpacePath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'searchSpacePath'), Optional('searchSpacePath'): And(os.path.exists, error=SCHEMA_PATH_ERROR % 'searchSpacePath'),
Optional('multiPhase'): setType('multiPhase', bool), Optional('multiPhase'): setType('multiPhase', bool),
Optional('multiThread'): setType('multiThread', bool), Optional('multiThread'): setType('multiThread', bool),
...@@ -297,6 +298,27 @@ pai_config_schema = { ...@@ -297,6 +298,27 @@ pai_config_schema = {
}) })
} }
dlts_trial_schema = {
'trial':{
'command': setType('command', str),
'codeDir': setPathCheck('codeDir'),
'gpuNum': setNumberRange('gpuNum', int, 0, 99999),
'image': setType('image', str),
}
}
dlts_config_schema = {
'dltsConfig': {
'dashboard': setType('dashboard', str),
Optional('cluster'): setType('cluster', str),
Optional('team'): setType('team', str),
Optional('email'): setType('email', str),
Optional('password'): setType('password', str),
}
}
kubeflow_trial_schema = { kubeflow_trial_schema = {
'trial':{ 'trial':{
'codeDir': setPathCheck('codeDir'), 'codeDir': setPathCheck('codeDir'),
...@@ -438,6 +460,8 @@ PAI_CONFIG_SCHEMA = Schema({**common_schema, **pai_trial_schema, **pai_config_sc ...@@ -438,6 +460,8 @@ PAI_CONFIG_SCHEMA = Schema({**common_schema, **pai_trial_schema, **pai_config_sc
PAI_YARN_CONFIG_SCHEMA = Schema({**common_schema, **pai_yarn_trial_schema, **pai_yarn_config_schema}) PAI_YARN_CONFIG_SCHEMA = Schema({**common_schema, **pai_yarn_trial_schema, **pai_yarn_config_schema})
DLTS_CONFIG_SCHEMA = Schema({**common_schema, **dlts_trial_schema, **dlts_config_schema})
KUBEFLOW_CONFIG_SCHEMA = Schema({**common_schema, **kubeflow_trial_schema, **kubeflow_config_schema}) KUBEFLOW_CONFIG_SCHEMA = Schema({**common_schema, **kubeflow_trial_schema, **kubeflow_config_schema})
FRAMEWORKCONTROLLER_CONFIG_SCHEMA = Schema({**common_schema, **frameworkcontroller_trial_schema, **frameworkcontroller_config_schema}) FRAMEWORKCONTROLLER_CONFIG_SCHEMA = Schema({**common_schema, **frameworkcontroller_trial_schema, **frameworkcontroller_config_schema})
...@@ -289,6 +289,25 @@ def set_frameworkcontroller_config(experiment_config, port, config_file_name): ...@@ -289,6 +289,25 @@ def set_frameworkcontroller_config(experiment_config, port, config_file_name):
#set trial_config #set trial_config
return set_trial_config(experiment_config, port, config_file_name), err_message return set_trial_config(experiment_config, port, config_file_name), err_message
def set_dlts_config(experiment_config, port, config_file_name):
'''set dlts configuration'''
dlts_config_data = dict()
dlts_config_data['dlts_config'] = experiment_config['dltsConfig']
response = rest_put(cluster_metadata_url(port), json.dumps(dlts_config_data), REST_TIME_OUT)
err_message = None
if not response or not response.status_code == 200:
if response is not None:
err_message = response.text
_, stderr_full_path = get_log_path(config_file_name)
with open(stderr_full_path, 'a+') as fout:
fout.write(json.dumps(json.loads(err_message), indent=4, sort_keys=True, separators=(',', ':')))
return False, err_message
result, message = setNNIManagerIp(experiment_config, port, config_file_name)
if not result:
return result, message
#set trial_config
return set_trial_config(experiment_config, port, config_file_name), err_message
def set_experiment(experiment_config, mode, port, config_file_name): def set_experiment(experiment_config, mode, port, config_file_name):
'''Call startExperiment (rest POST /experiment) with yaml file content''' '''Call startExperiment (rest POST /experiment) with yaml file content'''
request_data = dict() request_data = dict()
...@@ -389,6 +408,8 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res ...@@ -389,6 +408,8 @@ def set_platform_config(platform, experiment_config, port, config_file_name, res
config_result, err_msg = set_kubeflow_config(experiment_config, port, config_file_name) config_result, err_msg = set_kubeflow_config(experiment_config, port, config_file_name)
elif platform == 'frameworkcontroller': elif platform == 'frameworkcontroller':
config_result, err_msg = set_frameworkcontroller_config(experiment_config, port, config_file_name) config_result, err_msg = set_frameworkcontroller_config(experiment_config, port, config_file_name)
elif platform == 'dlts':
config_result, err_msg = set_dlts_config(experiment_config, port, config_file_name)
else: else:
raise Exception(ERROR_INFO % 'Unsupported platform!') raise Exception(ERROR_INFO % 'Unsupported platform!')
exit(1) exit(1)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment