"...user_guide/git@developer.sourcefind.cn:wangsen/mineru.git" did not exist on "845a3ff067f9ae4c2f7d9461c0ffef07cc0389d8"
Unverified Commit 6ff24a5e authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge pull request #143 from Microsoft/master

merge master
parents 5e777d2f c1e6098d
......@@ -82,10 +82,10 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包
## **使用场景**
* 在本地 Trial 不同的自动机器学习算法来训练模型。
* 在本机尝试使用不同的自动机器学习(AutoML)算法来训练模型。
* 在分布式环境中加速自动机器学习(如:远程 GPU 工作站和云服务器)。
* 定制自动机器学习算法,或比较不同的自动机器学习算法。
*自己的机器学习平台中支持自动机器学习。
* 在机器学习平台中支持自动机器学习。
## 相关项目
......@@ -93,7 +93,7 @@ NNI (Neural Network Intelligence) 是自动机器学习(AutoML)的工具包
* [OpenPAI](https://github.com/Microsoft/pai):作为开源平台,提供了完整的 AI 模型训练和资源管理能力,能轻松扩展,并支持各种规模的私有部署、云和混合环境。
* [FrameworkController](https://github.com/Microsoft/frameworkcontroller):开源的通用 Kubernetes Pod 控制器,通过单个控制器来编排 Kubernetes 上所有类型的应用。
* [MMdnn](https://github.com/Microsoft/MMdnn):一个完、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。
* [MMdnn](https://github.com/Microsoft/MMdnn):一个完、跨框架的解决方案,能够转换、可视化、诊断深度神经网络模型。 MMdnn 中的 "MM" 表示model management(模型管理),而 "dnn" 是 deep neural network(深度神经网络)的缩写。 我们鼓励研究人员和学生利用这些项目来加速 AI 开发和研究。
## **安装和验证**
......
......@@ -149,6 +149,11 @@ machineList:
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
* __debug__
* Description
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set debug be false.
* __maxTrialNum__
* Description
......
......@@ -35,7 +35,7 @@ Note:
If you start a docker image using NNI's offical image `msranni/nni`, you could directly start NNI experiments by using `nnictl` command. Our offical image has NNI's running environment and basic python and deep learning frameworks environment.
If you start your own docker image, you may need to install NNI package first, please [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/Installation.md).
If you start your own docker image, you may need to install NNI package first, please [refer](Installation.md).
If you want to run NNI's offical examples, you may need to clone NNI repo in github using
```
......@@ -43,11 +43,11 @@ git clone https://github.com/Microsoft/nni.git
```
then you could enter `nni/examples/trials` to start an experiment.
After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md)
After you prepare NNI's environment, you could start a new experiment using `nnictl` command, [refer](QuickStart.md)
## Using docker in remote platform
NNI support starting experiments in [remoteTrainingService](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode.
NNI support starting experiments in [remoteTrainingService](RemoteMachineMode.md), and run trial jobs in remote machines. As docker could start an independent Ubuntu system as SSH server, docker container could be used as the remote machine in NNI's remot mode.
### Step 1: Setting docker environment
......@@ -78,7 +78,7 @@ If you use your own docker image as remote server, please make sure that this im
### Step3: Run NNI experiments
You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine.
You could set your config file as remote platform, and setting the `machineList` configuration to connect your docker SSH server, [refer](RemoteMachineMode.md). Note that you should set correct `port`,`username` and `passwd` or `sshKeyPath` of your host machine.
`port:` The host machine's port, mapping to docker's SSH port.
......@@ -88,4 +88,4 @@ You could set your config file as remote platform, and setting the `machineList`
`sshKeyPath:` The path of private key of docker container.
After the configuration of config file, you could start an experiment, [refer](https://github.com/Microsoft/nni/blob/master/docs/en_US/QuickStart.md)
\ No newline at end of file
After the configuration of config file, you could start an experiment, [refer](QuickStart.md)
......@@ -45,6 +45,12 @@ nnictl support commands:
|------|------|------|------|
|--config, -c| True| |YAML configure file of the experiment|
|--port, -p|False| |the port of restful server|
|--debug, -d|False||set debug mode|
Note:
```
Debug mode will disable version check function in Trialkeeper.
```
<a name="resume"></a>
* __nnictl resume__
......@@ -65,6 +71,7 @@ nnictl support commands:
|------|------|------ |------|
|id| False| |The id of the experiment you want to resume|
|--port, -p| False| |Rest port of the experiment you want to resume|
|--debug, -d|False||set debug mode|
<a name="stop"></a>
* __nnictl stop__
......
......@@ -92,7 +92,7 @@ with tf.Session() as sess:
batch_size = 128
for i in range(10000):
batch = mnist.train.next_batch(batch_size)
+ """@nni.variable(nni.choice(1, 5), name=dropout_rate)"""
+ """@nni.variable(nni.choice(0.1, 0.5), name=dropout_rate)"""
dropout_rate = 0.5
mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
mnist_network.labels: batch[1],
......
......@@ -36,6 +36,7 @@ interface ExperimentParams {
trainingServicePlatform: string;
multiPhase?: boolean;
multiThread?: boolean;
versionCheck?: boolean;
tuner?: {
className: string;
builtinTunerName?: string;
......
......@@ -345,6 +345,19 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
});
}
/**
* get the version of current package
*/
async function getVersion(): Promise<string> {
const deferred : Deferred<string> = new Deferred<string>();
import(path.join(__dirname, '..', 'package.json')).then((pkg)=>{
deferred.resolve(pkg.version);
}).catch((error)=>{
deferred.reject(error);
});
return deferred.promise;
}
export {countFilesRecursively, getRemoteTmpDir, generateParamFileName, getMsgDispatcherCommand, getCheckpointDir,
getLogDir, getExperimentRootDir, getJobCancelStatus, getDefaultDatabaseDir, getIPV4Address,
mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect };
mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect, getVersion };
......@@ -127,6 +127,10 @@ class NNIManager implements Manager {
if (expParams.multiPhase && this.trainingService.isMultiPhaseJobSupported) {
this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString());
}
// Set up versionCheck config
if (expParams.versionCheck !== undefined) {
this.trainingService.setClusterMetadata('version_check', expParams.versionCheck.toString());
}
const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor,
expParams.multiPhase, expParams.multiThread);
......@@ -162,6 +166,11 @@ class NNIManager implements Manager {
this.trainingService.setClusterMetadata('multiPhase', expParams.multiPhase.toString());
}
// Set up versionCheck config
if (expParams.versionCheck !== undefined) {
this.trainingService.setClusterMetadata('versionCheck', expParams.versionCheck.toString());
}
const dispatcherCommand: string = getMsgDispatcherCommand(expParams.tuner, expParams.assessor, expParams.advisor,
expParams.multiPhase, expParams.multiThread);
this.log.debug(`dispatcher command: ${dispatcherCommand}`);
......
......@@ -30,6 +30,7 @@ import { getLogger, Logger } from '../common/log';
import { ExperimentProfile, Manager, TrialJobStatistics} from '../common/manager';
import { ValidationSchemas } from './restValidationSchemas';
import { NNIRestServer } from './nniRestServer';
import { getVersion } from '../common/utils';
const expressJoi = require('express-joi-validator');
......@@ -104,8 +105,8 @@ class NNIRestHandler {
private version(router: Router): void {
router.get('/version', async (req: Request, res: Response) => {
const pkg = await import(path.join(__dirname, '..', 'package.json'));
res.send(pkg.version);
const version = await getVersion();
res.send(version);
});
}
......
......@@ -139,6 +139,7 @@ export namespace ValidationSchemas {
maxExecDuration: joi.number().min(0).required(),
multiPhase: joi.boolean(),
multiThread: joi.boolean(),
versionCheck: joi.boolean(),
advisor: joi.object({
builtinAdvisorName: joi.string().valid('Hyperband'),
codeDir: joi.string(),
......
......@@ -31,5 +31,6 @@ export enum TrialConfigMetadataKey {
PAI_CLUSTER_CONFIG = 'pai_config',
KUBEFLOW_CLUSTER_CONFIG = 'kubeflow_config',
NNI_MANAGER_IP = 'nni_manager_ip',
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config'
FRAMEWORKCONTROLLER_CLUSTER_CONFIG = 'frameworkcontroller_config',
VERSION_CHECK = 'version_check'
}
......@@ -191,7 +191,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
for(let taskRole of this.fcTrialConfig.taskRoles) {
const runScriptContent: string = this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder,
const runScriptContent: string = await this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder,
this.generateCommandScript(taskRole.command), curTrialSequenceId.toString(), taskRole.name, taskRole.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, `run_${taskRole.name}.sh`), runScriptContent, { encoding: 'utf8' });
}
......@@ -267,6 +267,9 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
return Promise.reject(new Error(error));
}
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default:
break;
}
......
......@@ -188,7 +188,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
// Write worker file content run_worker.sh to local tmp folders
if(kubeflowTrialConfig.worker) {
const workerRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
const workerRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
kubeflowTrialConfig.worker.command, curTrialSequenceId.toString(), 'worker', kubeflowTrialConfig.worker.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_worker.sh'), workerRunScriptContent, { encoding: 'utf8' });
......@@ -197,7 +197,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
if(this.kubeflowClusterConfig.operator === 'tf-operator') {
let tensorflowTrialConfig: KubeflowTrialConfigTensorflow = <KubeflowTrialConfigTensorflow>this.kubeflowTrialConfig;
if(tensorflowTrialConfig.ps){
const psRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
const psRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
tensorflowTrialConfig.ps.command, curTrialSequenceId.toString(), 'ps', tensorflowTrialConfig.ps.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_ps.sh'), psRunScriptContent, { encoding: 'utf8' });
}
......@@ -205,7 +205,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
else if(this.kubeflowClusterConfig.operator === 'pytorch-operator') {
let pytorchTrialConfig: KubeflowTrialConfigPytorch = <KubeflowTrialConfigPytorch>this.kubeflowTrialConfig;
if(pytorchTrialConfig.master){
const masterRunScriptContent: string = this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
const masterRunScriptContent: string = await this.generateRunScript('kubeflow', trialJobId, trialWorkingFolder,
pytorchTrialConfig.master.command, curTrialSequenceId.toString(), 'master', pytorchTrialConfig.master.gpuNum);
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'run_master.sh'), masterRunScriptContent, { encoding: 'utf8' });
}
......@@ -317,6 +317,9 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
return Promise.reject(new Error(error));
}
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default:
break;
}
......
......@@ -71,5 +71,5 @@ mkdir -p $NNI_OUTPUT_DIR
cp -rT $NNI_CODE_DIR $NNI_SYS_DIR
cd $NNI_SYS_DIR
sh install_nni.sh
python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' --nnimanager_ip {9} --nnimanager_port {10} `
python3 -m nni_trial_tool.trial_keeper --trial_command '{8}' --nnimanager_ip {9} --nnimanager_port {10} --version '{11}'`
+ `1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr`
......@@ -25,7 +25,7 @@ import * as path from 'path';
import { EventEmitter } from 'events';
import { getExperimentId, getInitTrialSequenceId } from '../../common/experimentStartupInfo';
import { getLogger, Logger } from '../../common/log';
import { getExperimentRootDir, uniqueString, getJobCancelStatus, getIPV4Address } from '../../common/utils';
import { getExperimentRootDir, uniqueString, getJobCancelStatus, getIPV4Address, getVersion } from '../../common/utils';
import {
TrialJobDetail, TrialJobMetric, NNIManagerIpConfig
} from '../../common/trainingService';
......@@ -61,6 +61,7 @@ abstract class KubernetesTrainingService {
protected kubernetesCRDClient?: KubernetesCRDClient;
protected kubernetesJobRestServer?: KubernetesJobRestServer;
protected kubernetesClusterConfig?: KubernetesClusterConfig;
protected versionCheck?: boolean = true;
constructor() {
this.log = getLogger();
......@@ -179,8 +180,8 @@ abstract class KubernetesTrainingService {
* @param command
* @param trialSequenceId sequence id
*/
protected generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string,
command: string, trialSequenceId: string, roleName: string, gpuNum: number): string {
protected async generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string,
command: string, trialSequenceId: string, roleName: string, gpuNum: number): Promise<string> {
let nvidia_script: string = '';
// Nvidia devcie plugin for K8S has a known issue that requesting zero GPUs allocates all GPUs
// Refer https://github.com/NVIDIA/k8s-device-plugin/issues/61
......@@ -189,6 +190,7 @@ abstract class KubernetesTrainingService {
nvidia_script = `export CUDA_VISIBLE_DEVICES='0'`;
}
const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address();
const version = this.versionCheck? await getVersion(): '';
const runScript: string = String.Format(
KubernetesScriptFormat,
platform,
......@@ -201,9 +203,10 @@ abstract class KubernetesTrainingService {
nvidia_script,
command,
nniManagerIp,
this.kubernetesRestServerPort
this.kubernetesRestServerPort,
version
);
return runScript;
return Promise.resolve(runScript);
}
protected async createNFSStorage(nfsServer: string, nfsPath: string): Promise<void> {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}`);
......
......@@ -64,7 +64,7 @@ export const PAI_TRIAL_COMMAND_FORMAT: string =
`export NNI_PLATFORM=pai NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4}
&& cd $NNI_SYS_DIR && sh install_nni.sh
&& python3 -m nni_trial_tool.trial_keeper --trial_command '{5}' --nnimanager_ip '{6}' --nnimanager_port '{7}'
--pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1'`;
--pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1' --version '{12}'`;
export const PAI_OUTPUT_DIR_FORMAT: string =
`hdfs://{0}:9000/`;
......
......@@ -39,7 +39,7 @@ import {
TrialJobDetail, TrialJobMetric, NNIManagerIpConfig
} from '../../common/trainingService';
import { delay, generateParamFileName,
getExperimentRootDir, getIPV4Address, uniqueString } from '../../common/utils';
getExperimentRootDir, getIPV4Address, uniqueString, getVersion } from '../../common/utils';
import { PAIJobRestServer } from './paiJobRestServer'
import { PAITrialJobDetail, PAI_TRIAL_COMMAND_FORMAT, PAI_OUTPUT_DIR_FORMAT, PAI_LOG_PATH_FORMAT } from './paiData';
import { PAIJobInfoCollector } from './paiJobInfoCollector';
......@@ -75,6 +75,7 @@ class PAITrainingService implements TrainingService {
private paiRestServerPort?: number;
private nniManagerIpConfig?: NNIManagerIpConfig;
private copyExpCodeDirPromise?: Promise<void>;
private versionCheck?: boolean = true;
constructor() {
this.log = getLogger();
......@@ -211,6 +212,7 @@ class PAITrainingService implements TrainingService {
hdfsLogPath);
this.trialJobsMap.set(trialJobId, trialJobDetail);
const nniManagerIp = this.nniManagerIpConfig?this.nniManagerIpConfig.nniManagerIp:getIPV4Address();
const version = this.versionCheck? await getVersion(): '';
const nniPaiTrialCommand : string = String.Format(
PAI_TRIAL_COMMAND_FORMAT,
// PAI will copy job's codeDir into /root directory
......@@ -225,7 +227,8 @@ class PAITrainingService implements TrainingService {
hdfsOutputDir,
this.hdfsOutputHost,
this.paiClusterConfig.userName,
HDFSClientUtility.getHdfsExpCodeDir(this.paiClusterConfig.userName)
HDFSClientUtility.getHdfsExpCodeDir(this.paiClusterConfig.userName),
version
).replace(/\r\n|\n|\r/gm, '');
console.log(`nniPAItrial command is ${nniPaiTrialCommand.trim()}`);
......@@ -434,6 +437,9 @@ class PAITrainingService implements TrainingService {
deferred.resolve();
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default:
//Reject for unknown keys
throw new Error(`Uknown key: ${key}`);
......
......@@ -250,8 +250,8 @@ export NNI_PLATFORM=remote NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={
cd $NNI_SYS_DIR
sh install_nni.sh
echo $$ >{6}
python3 -m nni_trial_tool.trial_keeper --trial_command '{7}' --nnimanager_ip '{8}' --nnimanager_port '{9}' 1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr
echo $? \`date +%s%3N\` >{10}`;
python3 -m nni_trial_tool.trial_keeper --trial_command '{7}' --nnimanager_ip '{8}' --nnimanager_port '{9}' --version '{10}' 1>$NNI_OUTPUT_DIR/trialkeeper_stdout 2>$NNI_OUTPUT_DIR/trialkeeper_stderr
echo $? \`date +%s%3N\` >{11}`;
export const HOST_JOB_SHELL_FORMAT: string =
`#!/bin/bash
......
......@@ -51,7 +51,7 @@ import { SSHClientUtility } from './sshClientUtility';
import { validateCodeDir } from '../common/util';
import { RemoteMachineJobRestServer } from './remoteMachineJobRestServer';
import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData';
import { mkDirP } from '../../common/utils';
import { mkDirP, getVersion } from '../../common/utils';
/**
* Training Service implementation for Remote Machine (Linux)
......@@ -76,6 +76,7 @@ class RemoteMachineTrainingService implements TrainingService {
private remoteRestServerPort?: number;
private readonly remoteOS: string;
private nniManagerIpConfig?: NNIManagerIpConfig;
private versionCheck: boolean = true;
constructor(@component.Inject timer: ObservableTimer) {
this.remoteOS = 'linux';
......@@ -372,6 +373,9 @@ class RemoteMachineTrainingService implements TrainingService {
case TrialConfigMetadataKey.MULTI_PHASE:
this.isMultiPhase = (value === 'true' || value === 'True');
break;
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
default:
//Reject for unknown keys
throw new Error(`Uknown key: ${key}`);
......@@ -580,6 +584,7 @@ class RemoteMachineTrainingService implements TrainingService {
const restServer: RemoteMachineJobRestServer = component.get(RemoteMachineJobRestServer);
this.remoteRestServerPort = restServer.clusterRestServerPort;
}
const version = this.versionCheck? await getVersion(): '';
const runScriptTrialContent: string = String.Format(
REMOTEMACHINE_TRIAL_COMMAND_FORMAT,
trialWorkingFolder,
......@@ -592,6 +597,7 @@ class RemoteMachineTrainingService implements TrainingService {
command,
nniManagerIp,
this.remoteRestServerPort,
version,
path.join(trialWorkingFolder, '.nni', 'code')
)
......
......@@ -40,6 +40,8 @@ def update_training_service_config(args):
config[args.ts]['trial']['dataDir'] = args.data_dir
if args.output_dir is not None:
config[args.ts]['trial']['outputDir'] = args.output_dir
if args.vc is not None:
config[args.ts]['trial']['virtualCluster'] = args.vc
elif args.ts == 'kubeflow':
if args.nfs_server is not None:
config[args.ts]['kubeflowConfig']['nfs']['server'] = args.nfs_server
......@@ -78,6 +80,7 @@ if __name__ == '__main__':
parser.add_argument("--pai_host", type=str)
parser.add_argument("--data_dir", type=str)
parser.add_argument("--output_dir", type=str)
parser.add_argument("--vc", type=str)
# args for kubeflow
parser.add_argument("--nfs_server", type=str)
parser.add_argument("--nfs_path", type=str)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment