Unverified Commit 6ff24a5e authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge pull request #143 from Microsoft/master

merge master
parents 5e777d2f c1e6098d
...@@ -82,10 +82,10 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包 ...@@ -82,10 +82,10 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包
## **使用场景** ## **使用场景**
* 在本地 Trial 不同的自动机器学习算法来训练模型。 * 在本机尝试使用不同的自动机器学习(AutoML)算法来训练模型。
* 在分布式环境中加速自动机器学习(如:远程 GPU 工作站和云服务器)。 * 在分布式环境中加速自动机器学习(如:远程 GPU 工作站和云服务器)。
* 定制自动机器学习算法,或比较不同的自动机器学习算法。 * 定制自动机器学习算法,或比较不同的自动机器学习算法。
*自己的机器学习平台中支持自动机器学习。 * 在机器学习平台中支持自动机器学习。
## 相关项目 ## 相关项目
...@@ -93,7 +93,7 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包 ...@@ -93,7 +93,7 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包
* [OpenPAI](https://github.com/Microsoft/pai):作为开源平台,提供了完整的 AI 模型训练和资源管理能力,能轻松扩展,并支持各种规模的私有部署、云和混合环境。 * [OpenPAI](https://github.com/Microsoft/pai):作为开源平台,提供了完整的 AI 模型训练和资源管理能力,能轻松扩展,并支持各种规模的私有部署、云和混合环境。
* [FrameworkController](https://github.com/Microsoft/frameworkcontroller):开源的通用 Kubernetes Pod 控制器,通过单个控制器来编排 Kubernetes 上所有类型的应用。 * [FrameworkController](https://github.com/Microsoft/frameworkcontroller):开源的通用 Kubernetes Pod 控制器,通过单个控制器来编排 Kubernetes 上所有类型的应用。
* [MMdnn](https://github.com/Microsoft/MMdnn):一个完、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。 * [MMdnn](https://github.com/Microsoft/MMdnn):一个完、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。
## **安装和验证** ## **安装和验证**
......
...@@ -149,6 +149,11 @@ machineList: ...@@ -149,6 +149,11 @@ machineList:
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more. Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
* __debug__
* Description
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set debug be false.
* __maxTrialNum__ * __maxTrialNum__
* Description * Description
......
...@@ -35,7 +35,7 @@ Note: ...@@ -35,7 +35,7 @@ Note:
If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment. If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.
If you start your own docker image, you may need to install NNI package first, please [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/Installation.md). If you start your own docker image, you may need to install NNI package first, please [refer](Installation.md).
If you want to run NNI's offical examples, you may need to clone NNI repo in github using If you want to run NNI's offical examples, you may need to clone NNI repo in github using
``` ```
...@@ -43,11 +43,11 @@ git clone https://github.com/Microsoft/nni.git ...@@ -43,11 +43,11 @@ git clone https://github.com/Microsoft/nni.git
``` ```
then you could enter `nni/examples/trials` to start an experiment. then you could enter `nni/examples/trials` to start an experiment.
After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md) After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](QuickStart.md)
## Using docker in remote platform ## Using docker in remote platform
NNI support starting experiments in [remoteTrainingService](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode. NNI support starting experiments in [remoteTrainingService](RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode.
### Step 1: Setting docker environment ### Step 1: Setting docker environment
...@@ -78,7 +78,7 @@ If you use your own docker image as remote server, please make sure that this im ...@@ -78,7 +78,7 @@ If you use your own docker image as remote server, please make sure that this im
### Step3: Run NNI experiments ### Step3: Run NNI experiments
You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine. You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine.
`port:` The host machine's port, mapping to docker's SSH port. `port:` The host machine's port, mapping to docker's SSH port.
...@@ -88,4 +88,4 @@ You could set your config file as remote platform, and setting the `machineList` ...@@ -88,4 +88,4 @@ You could set your config file as remote platform, and setting the `machineList`
`sshKeyPath:` The path of private key of docker container. `sshKeyPath:` The path of private key of docker container.
After the configuration of config file, you could start an experiment, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md) After the configuration of config file, you could start an experiment, [refer](QuickStart.md)
\ No newline at end of file
...@@ -45,6 +45,12 @@ nnictl support commands: ...@@ -45,6 +45,12 @@ nnictl support commands:
|------|------|------|------| |------|------|------|------|
|--config, -c| True| |YAML configure file of the experiment| |--config, -c| True| |YAML configure file of the experiment|
|--port, -p|False| |the port of restful server| |--port, -p|False| |the port of restful server|
|--debug, -d|False||set debug mode|
Note:
```
Debug mode will disable version check function in Trialkeeper.
```
<a name="resume"></a> <a name="resume"></a>
* __nnictl resume__ * __nnictl resume__
...@@ -65,6 +71,7 @@ nnictl support commands: ...@@ -65,6 +71,7 @@ nnictl support commands:
|------|------|------ |------| |------|------|------ |------|
|id| False| |The id of the experiment you want to resume| |id| False| |The id of the experiment you want to resume|
|--port, -p| False| |Rest port of the experiment you want to resume| |--port, -p| False| |Rest port of the experiment you want to resume|
|--debug, -d|False||set debug mode|
<a name="stop"></a> <a name="stop"></a>
* __nnictl stop__ * __nnictl stop__
......
...@@ -92,7 +92,7 @@ with tf.Session() as sess: ...@@ -92,7 +92,7 @@ with tf.Session() as sess:
batch_size = 128 batch_size = 128
for i in range(10000): for i in range(10000):
batch = mnist.train.next_batch(batch_size) batch = mnist.train.next_batch(batch_size)
+ """@nni.variable(nni.choice(1, 5), name=dropout_rate)""" + """@nni.variable(nni.choice(0.1, 0.5), name=dropout_rate)"""
dropout_rate = 0.5 dropout_rate = 0.5
mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0], mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
mnist_network.labels: batch[1], mnist_network.labels: batch[1],
......
...@@ -36,6 +36,7 @@ interface ExperimentParams { ...@@ -36,6 +36,7 @@ interface ExperimentParams {
trainingServicePlatform: string; trainingServicePlatform: string;
multiPhase?: boolean; multiPhase?: boolean;
multiThread?: boolean; multiThread?: boolean;
versionCheck?: boolean;
tuner?: { tuner?: {
className: string; className: string;
builtinTunerName?: string; builtinTunerName?: string;
......
...@@ -345,6 +345,19 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number): ...@@ -345,6 +345,19 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
}); });
} }
/**
* get the version of current package
*/
async function getVersion(): Promise<string> {
const deferred : Deferred<string> = new Deferred<string>();
import(path.join(__dirname, '..', 'package.json')).then((pkg)=>{
deferred.resolve(pkg.version);
}).catch((error)=>{
deferred.reject(error);
});
return deferred.promise;
}
export {countFilesRecursively, getRemoteTmpDir, generateParamFileName, getMsgDispatcherCommand, getCheckpointDir, export {countFilesRecursively, getRemoteTmpDir, generateParamFileName, getMsgDispatcherCommand, getCheckpointDir,
getLogDir, getExperimentRootDir, getJobCancelStatus, getDefaultDatabaseDir, getIPV4Address, getLogDir, getExperimentRootDir, getJobCancelStatus, getDefaultDatabaseDir, getIPV4Address,
mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect }; mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect, getVersion };
...@@ -127,7 +127,11 @@ class NNIManager implements Manager { ...@@ -127,7 +127,11 @@ class NNIManager implements Manager {
if (expParams.multiPhase && this.trainingService.isMultiPhaseJobSupported) { if (expParams.multiPhase && this.trainingService.isMultiPhaseJobSupported) {
this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString()); this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString());
} }
// Set up versionCheck config
if (expParams.versionCheck !== undefined) {
this.trainingService.setClusterMetadata('version_check', expParams.versionCheck.toString());
}
const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor, const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor,
expParams.multiPhase, expParams.multiThread); expParams.multiPhase, expParams.multiThread);
this.log.debug(`dispatcher command: ${dispatcherCommand}`); this.log.debug(`dispatcher command: ${dispatcherCommand}`);
...@@ -162,6 +166,11 @@ class NNIManager implements Manager { ...@@ -162,6 +166,11 @@ class NNIManager implements Manager {
this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString()); this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString());
} }
// Set up versionCheck config
if (expParams.versionCheck !== undefined) {
this.trainingService.setClusterMetadata('versionCheck', expParams.versionCheck.toString());
}
const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor, const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor,
expParams.multiPhase, expParams.multiThread); expParams.multiPhase, expParams.multiThread);
this.log.debug(`dispatcher command: ${dispatcherCommand}`); this.log.debug(`dispatcher command: ${dispatcherCommand}`);
......
...@@ -30,6 +30,7 @@ import { getLogger, Logger } from '../common/log'; ...@@ -30,6 +30,7 @@ import { getLogger, Logger } from '../common/log';
import { ExperimentProfile, Manager, TrialJobStatistics} from '../common/manager'; import { ExperimentProfile, Manager, TrialJobStatistics} from '../common/manager';
import { ValidationSchemas } from './restValidationSchemas'; import { ValidationSchemas } from './restValidationSchemas';
import { NNIRestServer } from './nniRestServer'; import { NNIRestServer } from './nniRestServer';
import { getVersion } from '../common/utils';
const expressJoi = require('express-joi-validator'); const expressJoi = require('express-joi-validator');
...@@ -104,8 +105,8 @@ class NNIRestHandler { ...@@ -104,8 +105,8 @@ class NNIRestHandler {
private version(router: Router): void { private version(router: Router): void {
router.get('/version', async (req: Request, res: Response) => { router.get('/version', async (req: Request, res: Response) => {
const pkg = await import(path.join(__dirname, '..', 'package.json')); const version = await getVersion();
res.send(pkg.version); res.send(version);
}); });
} }
......
...@@ -139,6 +139,7 @@ export namespace ValidationSchemas { ...@@ -139,6 +139,7 @@ export namespace ValidationSchemas {
maxExecDuration: joi.number().min(0).required(), maxExecDuration: joi.number().min(0).required(),
multiPhase: joi.boolean(), multiPhase: joi.boolean(),
multiThread: joi.boolean(), multiThread: joi.boolean(),
versionCheck: joi.boolean(),
advisor: joi.object({ advisor: joi.object({
builtinAdvisorName: joi.string().valid('Hyperband'), builtinAdvisorName: joi.string().valid('Hyperband'),
codeDir: joi.string(), codeDir: joi.string(),
......
...@@ -31,5 +31,6 @@ export enum TrialConfigMetadataKey { ...@@ -31,5 +31,6 @@ export enum TrialConfigMetadataKey {
PAI_CLUSTER_CONFIG = 'pai_config', PAI_CLUSTER_CONFIG = 'pai_config',
KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config', KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config',
NNI_MANAGER_IP = 'nni_manager_ip', NNI_MANAGER_IP = 'nni_manager_ip',
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config' FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config',
VERSION_CHECK = 'version_check'
} }
...@@ -191,7 +191,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -191,7 +191,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
await cpp.exec(`mkdir -p ${trialLocalTempFolder}`); await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
for(let taskRole of this.fcTrialConfig.taskRoles) { for(let taskRole of this.fcTrialConfig.taskRoles) {
const runScriptContent: string = this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder, const runScriptContent: string = await this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder,
this.generateCommandScript(taskRole.command), curTrialSequenceId.toString(), taskRole.name, taskRole.gpuNum); this.generateCommandScript(taskRole.command), curTrialSequenceId.toString(), taskRole.name, taskRole.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, `run_${taskRole.name}.sh`), runScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, `run_${taskRole.name}.sh`), runScriptContent, { encoding: 'utf8' });
} }
...@@ -267,6 +267,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -267,6 +267,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
return Promise.reject(new Error(error)); return Promise.reject(new Error(error));
} }
break; break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default: default:
break; break;
} }
......
...@@ -188,7 +188,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber ...@@ -188,7 +188,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
// Write worker file content run_worker.sh to local tmp folders // Write worker file content run_worker.sh to local tmp folders
if(kubeflowTrialConfig.worker) { if(kubeflowTrialConfig.worker) {
const workerRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder, const workerRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
kubeflowTrialConfig.worker.command, curTrialSequenceId.toString(), 'worker', kubeflowTrialConfig.worker.gpuNum); kubeflowTrialConfig.worker.command, curTrialSequenceId.toString(), 'worker', kubeflowTrialConfig.worker.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_worker.sh'), workerRunScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_worker.sh'), workerRunScriptContent, { encoding: 'utf8' });
...@@ -197,7 +197,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber ...@@ -197,7 +197,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
if(this.kubeflowClusterConfig.operator === 'tf-operator') { if(this.kubeflowClusterConfig.operator === 'tf-operator') {
let tensorflowTrialConfig: KubeflowTrialConfigTensorflow = <KubeflowTrialConfigTensorflow>this.kubeflowTrialConfig; let tensorflowTrialConfig: KubeflowTrialConfigTensorflow = <KubeflowTrialConfigTensorflow>this.kubeflowTrialConfig;
if(tensorflowTrialConfig.ps){ if(tensorflowTrialConfig.ps){
const psRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder, const psRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
tensorflowTrialConfig.ps.command, curTrialSequenceId.toString(), 'ps', tensorflowTrialConfig.ps.gpuNum); tensorflowTrialConfig.ps.command, curTrialSequenceId.toString(), 'ps', tensorflowTrialConfig.ps.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_ps.sh'), psRunScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_ps.sh'), psRunScriptContent, { encoding: 'utf8' });
} }
...@@ -205,7 +205,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber ...@@ -205,7 +205,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
else if(this.kubeflowClusterConfig.operator === 'pytorch-operator') { else if(this.kubeflowClusterConfig.operator === 'pytorch-operator') {
let pytorchTrialConfig: KubeflowTrialConfigPytorch = <KubeflowTrialConfigPytorch>this.kubeflowTrialConfig; let pytorchTrialConfig: KubeflowTrialConfigPytorch = <KubeflowTrialConfigPytorch>this.kubeflowTrialConfig;
if(pytorchTrialConfig.master){ if(pytorchTrialConfig.master){
const masterRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder, const masterRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
pytorchTrialConfig.master.command, curTrialSequenceId.toString(), 'master', pytorchTrialConfig.master.gpuNum); pytorchTrialConfig.master.command, curTrialSequenceId.toString(), 'master', pytorchTrialConfig.master.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_master.sh'), masterRunScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_master.sh'), masterRunScriptContent, { encoding: 'utf8' });
} }
...@@ -317,6 +317,9 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber ...@@ -317,6 +317,9 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
return Promise.reject(new Error(error)); return Promise.reject(new Error(error));
} }
break; break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default: default:
break; break;
} }
......
...@@ -71,5 +71,5 @@ mkdir -p $NNI_OUTPUT_DIR ...@@ -71,5 +71,5 @@ mkdir -p $NNI_OUTPUT_DIR
cp -rT $NNI_CODE_DIR $NNI_SYS_DIR cp -rT $NNI_CODE_DIR $NNI_SYS_DIR
cd $NNI_SYS_DIR cd $NNI_SYS_DIR
sh install_nni.sh sh install_nni.sh
python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' --nnimanager_ip {9} --nnimanager_port {10} ` python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' --nnimanager_ip {9} --nnimanager_port {10} --version '{11}'`
+ `1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr` + `1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr`
...@@ -25,7 +25,7 @@ import * as path from 'path'; ...@@ -25,7 +25,7 @@ import * as path from 'path';
import { EventEmitter } from 'events'; import { EventEmitter } from 'events';
import { getExperimentId, getInitTrialSequenceId } from '../../common/experimentStartupInfo'; import { getExperimentId, getInitTrialSequenceId } from '../../common/experimentStartupInfo';
import { getLogger, Logger } from '../../common/log'; import { getLogger, Logger } from '../../common/log';
import { getExperimentRootDir, uniqueString, getJobCancelStatus, getIPV4Address } from '../../common/utils'; import { getExperimentRootDir, uniqueString, getJobCancelStatus, getIPV4Address, getVersion } from '../../common/utils';
import { import {
TrialJobDetail, TrialJobMetric, NNIManagerIpConfig TrialJobDetail, TrialJobMetric, NNIManagerIpConfig
} from '../../common/trainingService'; } from '../../common/trainingService';
...@@ -61,6 +61,7 @@ abstract class KubernetesTrainingService { ...@@ -61,6 +61,7 @@ abstract class KubernetesTrainingService {
protected kubernetesCRDClient?: KubernetesCRDClient; protected kubernetesCRDClient?: KubernetesCRDClient;
protected kubernetesJobRestServer?: KubernetesJobRestServer; protected kubernetesJobRestServer?: KubernetesJobRestServer;
protected kubernetesClusterConfig?: KubernetesClusterConfig; protected kubernetesClusterConfig?: KubernetesClusterConfig;
protected versionCheck?: boolean = true;
constructor() { constructor() {
this.log = getLogger(); this.log = getLogger();
...@@ -179,8 +180,8 @@ abstract class KubernetesTrainingService { ...@@ -179,8 +180,8 @@ abstract class KubernetesTrainingService {
* @param command * @param command
* @param trialSequenceId sequence id * @param trialSequenceId sequence id
*/ */
protected generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string, protected async generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string,
command: string, trialSequenceId: string, roleName: string, gpuNum: number): string { command: string, trialSequenceId: string, roleName: string, gpuNum: number): Promise<string> {
let nvidia_script: string = ''; let nvidia_script: string = '';
// Nvidia devcie plugin for K8S has a known issue that requesting zero GPUs allocates all GPUs // Nvidia devcie plugin for K8S has a known issue that requesting zero GPUs allocates all GPUs
// Refer https://github.com/NVIDIA/k8s-device-plugin/issues/61 // Refer https://github.com/NVIDIA/k8s-device-plugin/issues/61
...@@ -189,6 +190,7 @@ abstract class KubernetesTrainingService { ...@@ -189,6 +190,7 @@ abstract class KubernetesTrainingService {
nvidia_script = `export CUDA_VISIBLE_DEVICES='0'`; nvidia_script = `export CUDA_VISIBLE_DEVICES='0'`;
} }
const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address(); const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address();
const version = this.versionCheck? await getVersion(): '';
const runScript: string = String.Format( const runScript: string = String.Format(
KubernetesScriptFormat, KubernetesScriptFormat,
platform, platform,
...@@ -201,9 +203,10 @@ abstract class KubernetesTrainingService { ...@@ -201,9 +203,10 @@ abstract class KubernetesTrainingService {
nvidia_script, nvidia_script,
command, command,
nniManagerIp, nniManagerIp,
this.kubernetesRestServerPort this.kubernetesRestServerPort,
version
); );
return runScript; return Promise.resolve(runScript);
} }
protected async createNFSStorage(nfsServer: string, nfsPath: string): Promise<void> { protected async createNFSStorage(nfsServer: string, nfsPath: string): Promise<void> {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}`); await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}`);
......
...@@ -64,7 +64,7 @@ export const PAI_TRIAL_COMMAND_FORMAT: string = ...@@ -64,7 +64,7 @@ export const PAI_TRIAL_COMMAND_FORMAT: string =
`export NNI_PLATFORM=pai NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4} `export NNI_PLATFORM=pai NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4}
&& cd $NNI_SYS_DIR && sh install_nni.sh && cd $NNI_SYS_DIR && sh install_nni.sh
&& python3 -m nni_trial_tool.trial_keeper --trial_command '{5}' --nnimanager_ip '{6}' --nnimanager_port '{7}' && python3 -m nni_trial_tool.trial_keeper --trial_command '{5}' --nnimanager_ip '{6}' --nnimanager_port '{7}'
--pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1'`; --pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1' --version '{12}'`;
export const PAI_OUTPUT_DIR_FORMAT: string = export const PAI_OUTPUT_DIR_FORMAT: string =
`hdfs://{0}:9000/`; `hdfs://{0}:9000/`;
......
...@@ -39,7 +39,7 @@ import { ...@@ -39,7 +39,7 @@ import {
TrialJobDetail, TrialJobMetric, NNIManagerIpConfig TrialJobDetail, TrialJobMetric, NNIManagerIpConfig
} from '../../common/trainingService'; } from '../../common/trainingService';
import { delay, generateParamFileName, import { delay, generateParamFileName,
getExperimentRootDir, getIPV4Address, uniqueString } from '../../common/utils'; getExperimentRootDir, getIPV4Address, uniqueString, getVersion } from '../../common/utils';
import { PAIJobRestServer } from './paiJobRestServer' import { PAIJobRestServer } from './paiJobRestServer'
import { PAITrialJobDetail, PAI_TRIAL_COMMAND_FORMAT, PAI_OUTPUT_DIR_FORMAT, PAI_LOG_PATH_FORMAT } from './paiData'; import { PAITrialJobDetail, PAI_TRIAL_COMMAND_FORMAT, PAI_OUTPUT_DIR_FORMAT, PAI_LOG_PATH_FORMAT } from './paiData';
import { PAIJobInfoCollector } from './paiJobInfoCollector'; import { PAIJobInfoCollector } from './paiJobInfoCollector';
...@@ -75,6 +75,7 @@ class PAITrainingService implements TrainingService { ...@@ -75,6 +75,7 @@ class PAITrainingService implements TrainingService {
private paiRestServerPort?: number; private paiRestServerPort?: number;
private nniManagerIpConfig?: NNIManagerIpConfig; private nniManagerIpConfig?: NNIManagerIpConfig;
private copyExpCodeDirPromise?: Promise<void>; private copyExpCodeDirPromise?: Promise<void>;
private versionCheck?: boolean = true;
constructor() { constructor() {
this.log = getLogger(); this.log = getLogger();
...@@ -211,6 +212,7 @@ class PAITrainingService implements TrainingService { ...@@ -211,6 +212,7 @@ class PAITrainingService implements TrainingService {
hdfsLogPath); hdfsLogPath);
this.trialJobsMap.set(trialJobId, trialJobDetail); this.trialJobsMap.set(trialJobId, trialJobDetail);
const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address(); const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address();
const version = this.versionCheck? await getVersion(): '';
const nniPaiTrialCommand : string = String.Format( const nniPaiTrialCommand : string = String.Format(
PAI_TRIAL_COMMAND_FORMAT, PAI_TRIAL_COMMAND_FORMAT,
// PAI will copy job's codeDir into /root directory // PAI will copy job's codeDir into /root directory
...@@ -225,7 +227,8 @@ class PAITrainingService implements TrainingService { ...@@ -225,7 +227,8 @@ class PAITrainingService implements TrainingService {
hdfsOutputDir, hdfsOutputDir,
this.hdfsOutputHost, this.hdfsOutputHost,
this.paiClusterConfig.userName, this.paiClusterConfig.userName,
HDFSClientUtility.getHdfsExpCodeDir(this.paiClusterConfig.userName) HDFSClientUtility.getHdfsExpCodeDir(this.paiClusterConfig.userName),
version
).replace(/\r\n|\n|\r/gm, ''); ).replace(/\r\n|\n|\r/gm, '');
console.log(`nniPAItrial command is ${nniPaiTrialCommand.trim()}`); console.log(`nniPAItrial command is ${nniPaiTrialCommand.trim()}`);
...@@ -434,6 +437,9 @@ class PAITrainingService implements TrainingService { ...@@ -434,6 +437,9 @@ class PAITrainingService implements TrainingService {
deferred.resolve(); deferred.resolve();
break; break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default: default:
//Reject for unknown keys //Reject for unknown keys
throw new Error(`Uknown key: ${key}`); throw new Error(`Uknown key: ${key}`);
......
...@@ -250,8 +250,8 @@ export NNI_PLATFORM=remote NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={ ...@@ -250,8 +250,8 @@ export NNI_PLATFORM=remote NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={
cd $NNI_SYS_DIR cd $NNI_SYS_DIR
sh install_nni.sh sh install_nni.sh
echo $$ >{6} echo $$ >{6}
python3 -m nni_trial_tool.trial_keeper --trial_command '{7}' --nnimanager_ip '{8}' --nnimanager_port '{9}' 1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr python3 -m nni_trial_tool.trial_keeper --trial_command '{7}' --nnimanager_ip '{8}' --nnimanager_port '{9}' --version '{10}' 1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr
echo $? \`date +%s%3N\` >{10}`; echo $? \`date +%s%3N\` >{11}`;
export const HOST_JOB_SHELL_FORMAT: string = export const HOST_JOB_SHELL_FORMAT: string =
`#!/bin/bash `#!/bin/bash
......
...@@ -51,7 +51,7 @@ import { SSHClientUtility } from './sshClientUtility'; ...@@ -51,7 +51,7 @@ import { SSHClientUtility } from './sshClientUtility';
import { validateCodeDir } from '../common/util'; import { validateCodeDir } from '../common/util';
import { RemoteMachineJobRestServer } from './remoteMachineJobRestServer'; import { RemoteMachineJobRestServer } from './remoteMachineJobRestServer';
import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData'; import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData';
import { mkDirP } from '../../common/utils'; import { mkDirP, getVersion } from '../../common/utils';
/** /**
* Training Service implementation for Remote Machine (Linux) * Training Service implementation for Remote Machine (Linux)
...@@ -76,6 +76,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -76,6 +76,7 @@ class RemoteMachineTrainingService implements TrainingService {
private remoteRestServerPort?: number; private remoteRestServerPort?: number;
private readonly remoteOS: string; private readonly remoteOS: string;
private nniManagerIpConfig?: NNIManagerIpConfig; private nniManagerIpConfig?: NNIManagerIpConfig;
private versionCheck: boolean = true;
constructor(@component.Inject timer: ObservableTimer) { constructor(@component.Inject timer: ObservableTimer) {
this.remoteOS = 'linux'; this.remoteOS = 'linux';
...@@ -372,6 +373,9 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -372,6 +373,9 @@ class RemoteMachineTrainingService implements TrainingService {
case TrialConfigMetadataKey.MULTI_PHASE: case TrialConfigMetadataKey.MULTI_PHASE:
this.isMultiPhase = (value === 'true' || value === 'True'); this.isMultiPhase = (value === 'true' || value === 'True');
break; break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default: default:
//Reject for unknown keys //Reject for unknown keys
throw new Error(`Uknown key: ${key}`); throw new Error(`Uknown key: ${key}`);
...@@ -580,6 +584,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -580,6 +584,7 @@ class RemoteMachineTrainingService implements TrainingService {
const restServer: RemoteMachineJobRestServer = component.get(RemoteMachineJobRestServer); const restServer: RemoteMachineJobRestServer = component.get(RemoteMachineJobRestServer);
this.remoteRestServerPort = restServer.clusterRestServerPort; this.remoteRestServerPort = restServer.clusterRestServerPort;
} }
const version = this.versionCheck? await getVersion(): '';
const runScriptTrialContent: string = String.Format( const runScriptTrialContent: string = String.Format(
REMOTEMACHINE_TRIAL_COMMAND_FORMAT, REMOTEMACHINE_TRIAL_COMMAND_FORMAT,
trialWorkingFolder, trialWorkingFolder,
...@@ -592,6 +597,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -592,6 +597,7 @@ class RemoteMachineTrainingService implements TrainingService {
command, command,
nniManagerIp, nniManagerIp,
this.remoteRestServerPort, this.remoteRestServerPort,
version,
path.join(trialWorkingFolder, '.nni', 'code') path.join(trialWorkingFolder, '.nni', 'code')
) )
......
...@@ -40,6 +40,8 @@ def update_training_service_config(args): ...@@ -40,6 +40,8 @@ def update_training_service_config(args):
config[args.ts]['trial']['dataDir'] = args.data_dir config[args.ts]['trial']['dataDir'] = args.data_dir
if args.output_dir is not None: if args.output_dir is not None:
config[args.ts]['trial']['outputDir'] = args.output_dir config[args.ts]['trial']['outputDir'] = args.output_dir
if args.vc is not None:
config[args.ts]['trial']['virtualCluster'] = args.vc
elif args.ts == 'kubeflow': elif args.ts == 'kubeflow':
if args.nfs_server is not None: if args.nfs_server is not None:
config[args.ts]['kubeflowConfig']['nfs']['server'] = args.nfs_server config[args.ts]['kubeflowConfig']['nfs']['server'] = args.nfs_server
...@@ -78,6 +80,7 @@ if __name__ == '__main__': ...@@ -78,6 +80,7 @@ if __name__ == '__main__':
parser.add_argument("--pai_host", type=str) parser.add_argument("--pai_host", type=str)
parser.add_argument("--data_dir", type=str) parser.add_argument("--data_dir", type=str)
parser.add_argument("--output_dir", type=str) parser.add_argument("--output_dir", type=str)
parser.add_argument("--vc", type=str)
# args for kubeflow # args for kubeflow
parser.add_argument("--nfs_server", type=str) parser.add_argument("--nfs_server", type=str)
parser.add_argument("--nfs_path", type=str) parser.add_argument("--nfs_path", type=str)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment