Unverified Commit 40bae6e2 authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Merge pull request #172 from microsoft/master

merge master
parents c7ca4510 d8e1c4af
...@@ -100,6 +100,8 @@ Targeting at openness and advancing state-of-art technology, [Microsoft Research ...@@ -100,6 +100,8 @@ Targeting at openness and advancing state-of-art technology, [Microsoft Research
* [OpenPAI](https://github.com/Microsoft/pai) : an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale. * [OpenPAI](https://github.com/Microsoft/pai) : an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.
* [FrameworkController](https://github.com/Microsoft/frameworkcontroller) : an open source general-purpose Kubernetes Pod Controller that orchestrate all kinds of applications on Kubernetes by a single controller. * [FrameworkController](https://github.com/Microsoft/frameworkcontroller) : an open source general-purpose Kubernetes Pod Controller that orchestrate all kinds of applications on Kubernetes by a single controller.
* [MMdnn](https://github.com/Microsoft/MMdnn) : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network. * [MMdnn](https://github.com/Microsoft/MMdnn) : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.
* [SPTAG](https://github.com/Microsoft/SPTAG) : Space Partition Tree And Graph (SPTAG) is an open source library for large scale vector approximate nearest neighbor search scenario.
We encourage researchers and students leverage these projects to accelerate the AI development and research. We encourage researchers and students leverage these projects to accelerate the AI development and research.
## **Install & Verify** ## **Install & Verify**
......
######################
Research Blog
######################
.. toctree::
:maxdepth: 2
Hyperparameter Optimization Comparison<HpoComparison>
Neural Architecture Search Comparison<NasComparison>
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
NNI provides state-of-the-art tuning algorithm as our builtin-tuners and makes them easy to use. Below is the brief summary of NNI currently built-in Tuners: NNI provides state-of-the-art tuning algorithm as our builtin-tuners and makes them easy to use. Below is the brief summary of NNI currently built-in Tuners:
Note: Click the **Tuner's name** to get a detailed description of the algorithm, click the corresponding **Usage** to get the Tuner's installation requirements, suggested scenario and using example. Here is an [article](./Blog/HPOComparison.md) about the comparison of different Tuners on several problems. Note: Click the **Tuner's name** to get a detailed description of the algorithm, click the corresponding **Usage** to get the Tuner's installation requirements, suggested scenario and using example. Here is an [article](./CommunitySharings/HPOComparison.md) about the comparison of different Tuners on several problems.
Currently we support the following algorithms: Currently we support the following algorithms:
......
# Automatically tuning SVD on NNI
In this tutorial, we first introduce a github repo [Recommenders](https://github.com/Microsoft/Recommenders). It is a repository that provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. It has various models that are popular and widely deployed in recommendation systems. To provide a complete end-to-end experience, they present each example in five key tasks, as shown below:
- [Prepare Data](https://github.com/Microsoft/Recommenders/blob/master/notebooks/01_prepare_data/README.md): Preparing and loading data for each recommender algorithm
- [Model](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/README.md): Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares ([ALS](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/recommendation.html#ALS)) or eXtreme Deep Factorization Machines ([xDeepFM](https://arxiv.org/abs/1803.05170)).
- [Evaluate](https://github.com/Microsoft/Recommenders/blob/master/notebooks/03_evaluate/README.md): Evaluating algorithms with offline metrics
- [Model Select and Optimize](https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/README.md): Tuning and optimizing hyperparameters for recommender models
- [Operationalize](https://github.com/Microsoft/Recommenders/blob/master/notebooks/05_operationalize/README.md): Operationalizing models in a production environment on Azure
The fourth task is tuning and optimizing the model's hyperparametrs, this is where NNI could help. To give a concrete example that NNI tunes the models in Recommenders, let's demonstrate with the model [SVD](https://github.com/Microsoft/Recommenders/blob/master/notebooks/02_model/surprise_svd_deep_dive.ipynb), and data Movielens100k. There are more than 10 hyperparameters to be tuned in this model.
[This Jupyter notebook](https://github.com/Microsoft/Recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) provided by Recommenders is a very detailed step-by-step tutorial for this example. It uses different built-in tuning algorithms in NNI, including `Annealing`, `SMAC`, `Random Search`, `TPE`, `Hyperband`, `Metis` and `Evolution`. Finally, the results of different tuning algorithms are compared. Please go through this notebook to learn how to use NNI to tune SVD model, then you could further use NNI to tune other models in Recommenders.
\ No newline at end of file
######################
Community Sharings
######################
In addtion to the official tutorilas and examples, we encourage community contributors to share their AutoML practices especially the NNI usage practices from their experience.
.. toctree::
:maxdepth: 2
NNI Practice Sharing<nni_practice_sharing>
Neural Architecture Search Comparison<CommunitySharings/NasComparison>
Hyper-parameter Tuning Algorithm Comparsion<CommunitySharings/HpoComparison>
...@@ -19,4 +19,4 @@ Contents ...@@ -19,4 +19,4 @@ Contents
FAQ FAQ
Contribution<contribution> Contribution<contribution>
Changelog<Release> Changelog<Release>
Blog<Blog/index> Community Sharings<community_sharings>
#################
Tutorials
#################
Sharing the practice of leveraging NNI to tune models and systems.
.. toctree::
:maxdepth: 2
Tuning SVD of Recommenders on NNI<CommunitySharings/NniPracticeSharing/RecommendersSvd>
\ No newline at end of file
...@@ -7,6 +7,7 @@ import logging ...@@ -7,6 +7,7 @@ import logging
import math import math
import tempfile import tempfile
import time import time
import argparse
import tensorflow as tf import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data from tensorflow.examples.tutorials.mnist import input_data
...@@ -20,7 +21,7 @@ class MnistNetwork(object): ...@@ -20,7 +21,7 @@ class MnistNetwork(object):
def __init__(self, params, feature_size = 784): def __init__(self, params, feature_size = 784):
config = [] config = []
for i in range(10): for i in range(4):
config.append(params['layer'+str(i)]) config.append(params['layer'+str(i)])
self.config = config self.config = config
self.feature_size = feature_size self.feature_size = feature_size
......
...@@ -357,13 +357,18 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number): ...@@ -357,13 +357,18 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
}); });
let fileCount: number = -1; let fileCount: number = -1;
cpp.exec(`find ${directory} -type f | wc -l`).then((result) => { let cmd: string;
if(process.platform === "win32") {
cmd = `powershell "Get-ChildItem -Path ${directory} -Recurse -File | Measure-Object | %{$_.Count}"`
} else {
cmd = `find ${directory} -type f | wc -l`;
}
cpp.exec(cmd).then((result) => {
if(result.stdout && parseInt(result.stdout)) { if(result.stdout && parseInt(result.stdout)) {
fileCount = parseInt(result.stdout); fileCount = parseInt(result.stdout);
} }
deferred.resolve(fileCount); deferred.resolve(fileCount);
}); });
return Promise.race([deferred.promise, delayTimeout]).finally(() => { return Promise.race([deferred.promise, delayTimeout]).finally(() => {
clearTimeout(timeoutId); clearTimeout(timeoutId);
}); });
...@@ -459,6 +464,16 @@ function getNewLine(): string{ ...@@ -459,6 +464,16 @@ function getNewLine(): string{
} }
} }
/**
* Use '/' to join path instead of '\' for all kinds of platform
* @param path
*/
function unixPathJoin(...paths: any[]): string {
const dir: string = paths.filter((path: any) => path !== '').join('/');
if (dir === '') return '.';
return dir;
}
export {countFilesRecursively, getRemoteTmpDir, generateParamFileName, getMsgDispatcherCommand, getCheckpointDir, export {countFilesRecursively, getRemoteTmpDir, generateParamFileName, getMsgDispatcherCommand, getCheckpointDir,
getLogDir, getExperimentRootDir, getJobCancelStatus, getDefaultDatabaseDir, getIPV4Address, getLogDir, getExperimentRootDir, getJobCancelStatus, getDefaultDatabaseDir, getIPV4Address, unixPathJoin,
mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect, getLogLevel, getVersion, getCmdPy, getTunerProc, isAlive, killPid, getNewLine }; mkDirP, delay, prepareUnitTest, parseArg, cleanupUnitTest, uniqueString, randomSelect, getLogLevel, getVersion, getCmdPy, getTunerProc, isAlive, killPid, getNewLine };
{
"kind": "CustomResourceDefinition",
"spec": {
"scope": "Namespaced",
"version": "v1beta2",
"group": "kubeflow.org",
"names": {
"kind": "PyTorchJob",
"plural": "pytorchjobs",
"singular": "pytorchjob"
}
},
"apiVersion": "apiextensions.k8s.io/v1beta2",
"metadata": {
"name": "pytorchjobs.kubeflow.org"
}
}
{
"kind": "CustomResourceDefinition",
"spec": {
"scope": "Namespaced",
"version": "v1beta2",
"group": "kubeflow.org",
"names": {
"kind": "TFJob",
"plural": "tfjobs",
"singular": "tfjob"
}
},
"apiVersion": "apiextensions.k8s.io/v1beta2",
"metadata": {
"name": "tfjobs.kubeflow.org"
}
}
...@@ -25,6 +25,7 @@ import { EventEmitter } from 'events'; ...@@ -25,6 +25,7 @@ import { EventEmitter } from 'events';
import { Readable, Writable } from 'stream'; import { Readable, Writable } from 'stream';
import { NNIError } from '../common/errors'; import { NNIError } from '../common/errors';
import { getLogger, Logger } from '../common/log'; import { getLogger, Logger } from '../common/log';
import { getLogDir } from '../common/utils';
import * as CommandType from './commands'; import * as CommandType from './commands';
const ipcOutgoingFd: number = 3; const ipcOutgoingFd: number = 3;
...@@ -106,7 +107,10 @@ class IpcInterface { ...@@ -106,7 +107,10 @@ class IpcInterface {
this.logger.warning('Commands jammed in buffer!'); this.logger.warning('Commands jammed in buffer!');
} }
} catch (err) { } catch (err) {
throw NNIError.FromError(err, 'Dispatcher Error: '); throw NNIError.FromError(
err,
`Dispatcher Error, please check this dispatcher log file for more detailed information: ${getLogDir()}/dispatcher.log . `
);
} }
} }
......
...@@ -152,7 +152,16 @@ mkDirP(getLogDir()) ...@@ -152,7 +152,16 @@ mkDirP(getLogDir())
console.error(`Failed to create log dir: ${err.stack}`); console.error(`Failed to create log dir: ${err.stack}`);
}); });
process.on('SIGTERM', async () => { function getStopSignal(): any {
if (process.platform === "win32") {
return 'SIGBREAK';
}
else{
return 'SIGTERM';
}
}
process.on(getStopSignal(), async () => {
const log: Logger = getLogger(); const log: Logger = getLogger();
let hasError: boolean = false; let hasError: boolean = false;
try { try {
......
...@@ -29,20 +29,35 @@ abstract class KubeflowOperatorClient extends KubernetesCRDClient{ ...@@ -29,20 +29,35 @@ abstract class KubeflowOperatorClient extends KubernetesCRDClient{
*/ */
public static generateOperatorClient(kubeflowOperator: KubeflowOperator, public static generateOperatorClient(kubeflowOperator: KubeflowOperator,
operatorApiVersion: string): KubernetesCRDClient { operatorApiVersion: string): KubernetesCRDClient {
if(kubeflowOperator === 'tf-operator') { switch(kubeflowOperator) {
if(operatorApiVersion == 'v1alpha2') { case 'tf-operator': {
return new TFOperatorClientV1Alpha2(); switch(operatorApiVersion) {
} else if(operatorApiVersion == 'v1beta1') { case 'v1alpha2': {
return new TFOperatorClientV1Beta1(); return new TFOperatorClientV1Alpha2();
} }
} else if(kubeflowOperator === 'pytorch-operator') { case 'v1beta1': {
if(operatorApiVersion == 'v1alpha2') { return new TFOperatorClientV1Beta1();
return new PytorchOperatorClientV1Alpha2(); }
} else if(operatorApiVersion == 'v1beta1') { case 'v1beta2': {
return new PytorchOperatorClientV1Beta1(); return new TFOperatorClientV1Beta2();
}
}
break;
} }
case 'pytorch-operator': {
switch(operatorApiVersion) {
case 'v1alpha2': {
return new PyTorchOperatorClientV1Alpha2();
}
case 'v1beta1': {
return new PyTorchOperatorClientV1Beta1();
}
case 'v1beta2': {
return new PyTorchOperatorClientV1Beta2();
}
}
}
} }
throw new Error(`Invalid operator ${kubeflowOperator} or apiVersion ${operatorApiVersion}`); throw new Error(`Invalid operator ${kubeflowOperator} or apiVersion ${operatorApiVersion}`);
} }
} }
...@@ -85,7 +100,26 @@ class TFOperatorClientV1Beta1 extends KubernetesCRDClient { ...@@ -85,7 +100,26 @@ class TFOperatorClientV1Beta1 extends KubernetesCRDClient {
} }
} }
class PytorchOperatorClientV1Alpha2 extends KubeflowOperatorClient { class TFOperatorClientV1Beta2 extends KubernetesCRDClient {
/**
* constructor, to initialize tfjob CRD definition
*/
public constructor() {
super();
this.crdSchema = JSON.parse(fs.readFileSync('./config/kubeflow/tfjob-crd-v1beta2.json', 'utf8'));
this.client.addCustomResourceDefinition(this.crdSchema);
}
protected get operator(): any {
return this.client.apis["kubeflow.org"].v1beta2.namespaces('default').tfjobs;
}
public get containerName(): string {
return 'tensorflow';
}
}
class PyTorchOperatorClientV1Alpha2 extends KubeflowOperatorClient {
/** /**
* constructor, to initialize tfjob CRD definition * constructor, to initialize tfjob CRD definition
*/ */
...@@ -104,7 +138,7 @@ class PytorchOperatorClientV1Alpha2 extends KubeflowOperatorClient { ...@@ -104,7 +138,7 @@ class PytorchOperatorClientV1Alpha2 extends KubeflowOperatorClient {
} }
} }
class PytorchOperatorClientV1Beta1 extends KubernetesCRDClient { class PyTorchOperatorClientV1Beta1 extends KubernetesCRDClient {
/** /**
* constructor, to initialize tfjob CRD definition * constructor, to initialize tfjob CRD definition
*/ */
...@@ -123,5 +157,24 @@ class PytorchOperatorClientV1Beta1 extends KubernetesCRDClient { ...@@ -123,5 +157,24 @@ class PytorchOperatorClientV1Beta1 extends KubernetesCRDClient {
} }
} }
class PyTorchOperatorClientV1Beta2 extends KubernetesCRDClient {
/**
* constructor, to initialize tfjob CRD definition
*/
public constructor() {
super();
this.crdSchema = JSON.parse(fs.readFileSync('./config/kubeflow/pytorchjob-crd-v1beta2.json', 'utf8'));
this.client.addCustomResourceDefinition(this.crdSchema);
}
protected get operator(): any {
return this.client.apis["kubeflow.org"].v1beta2.namespaces('default').pytorchjobs;
}
public get containerName(): string {
return 'pytorch';
}
}
export { KubeflowOperatorClient, GeneralK8sClient }; export { KubeflowOperatorClient, GeneralK8sClient };
...@@ -28,7 +28,7 @@ import { MethodNotImplementedError } from '../../../common/errors'; ...@@ -28,7 +28,7 @@ import { MethodNotImplementedError } from '../../../common/errors';
export type KubeflowOperator = 'tf-operator' | 'pytorch-operator' ; export type KubeflowOperator = 'tf-operator' | 'pytorch-operator' ;
export type DistTrainRole = 'worker' | 'ps' | 'master'; export type DistTrainRole = 'worker' | 'ps' | 'master';
export type KubeflowJobStatus = 'Created' | 'Running' | 'Failed' | 'Succeeded'; export type KubeflowJobStatus = 'Created' | 'Running' | 'Failed' | 'Succeeded';
export type OperatorApiVersion = 'v1alpha2' | 'v1beta1'; export type OperatorApiVersion = 'v1alpha2' | 'v1beta1' | 'v1beta2';
export class KubeflowClusterConfig extends KubernetesClusterConfig { export class KubeflowClusterConfig extends KubernetesClusterConfig {
public readonly operator: KubeflowOperator; public readonly operator: KubeflowOperator;
......
...@@ -22,6 +22,7 @@ import * as fs from 'fs'; ...@@ -22,6 +22,7 @@ import * as fs from 'fs';
import { Deferred } from 'ts-deferred'; import { Deferred } from 'ts-deferred';
import { getExperimentId } from '../../common/experimentStartupInfo'; import { getExperimentId } from '../../common/experimentStartupInfo';
import { getLogger } from '../../common/log'; import { getLogger } from '../../common/log';
import { unixPathJoin } from '../../common/utils'
/** /**
* HDFS client utility, including copy file/directory * HDFS client utility, including copy file/directory
...@@ -32,7 +33,7 @@ export namespace HDFSClientUtility { ...@@ -32,7 +33,7 @@ export namespace HDFSClientUtility {
* @param hdfsUserName HDFS user name * @param hdfsUserName HDFS user name
*/ */
function hdfsExpRootDir(hdfsUserName: string): string { function hdfsExpRootDir(hdfsUserName: string): string {
return path.join('/', hdfsUserName, 'nni', 'experiments', getExperimentId()); return '/' + unixPathJoin(hdfsUserName, 'nni', 'experiments', getExperimentId());
} }
/** /**
...@@ -40,7 +41,7 @@ export namespace HDFSClientUtility { ...@@ -40,7 +41,7 @@ export namespace HDFSClientUtility {
* @param hdfsUserName HDFS user name * @param hdfsUserName HDFS user name
*/ */
export function getHdfsExpCodeDir(hdfsUserName: string): string { export function getHdfsExpCodeDir(hdfsUserName: string): string {
return path.join(hdfsExpRootDir(hdfsUserName), 'codeDir'); return unixPathJoin(hdfsExpRootDir(hdfsUserName), 'codeDir');
} }
/** /**
...@@ -49,7 +50,9 @@ export namespace HDFSClientUtility { ...@@ -49,7 +50,9 @@ export namespace HDFSClientUtility {
* @param trialId NNI trial ID * @param trialId NNI trial ID
*/ */
export function getHdfsTrialWorkDir(hdfsUserName: string, trialId: string): string { export function getHdfsTrialWorkDir(hdfsUserName: string, trialId: string): string {
return path.join(hdfsExpRootDir(hdfsUserName), 'trials', trialId); let root = hdfsExpRootDir(hdfsUserName)
console.log(root)
return unixPathJoin(root, 'trials', trialId);
} }
/** /**
......
...@@ -40,7 +40,8 @@ import { delay, generateParamFileName, ...@@ -40,7 +40,8 @@ import { delay, generateParamFileName,
getExperimentRootDir, getIPV4Address, getVersion, uniqueString } from '../../common/utils'; getExperimentRootDir, getIPV4Address, getVersion, uniqueString } from '../../common/utils';
import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData'; import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData';
import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey'; import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey';
import { validateCodeDir } from '../common/util'; import { validateCodeDir, execMkdir } from '../common/util';
import { unixPathJoin } from '../../common/utils'
import { HDFSClientUtility } from './hdfsClientUtility'; import { HDFSClientUtility } from './hdfsClientUtility';
import { NNIPAITrialConfig, PAIClusterConfig, PAIJobConfig, PAITaskRole } from './paiConfig'; import { NNIPAITrialConfig, PAIClusterConfig, PAIJobConfig, PAITaskRole } from './paiConfig';
import { PAI_LOG_PATH_FORMAT, PAI_OUTPUT_DIR_FORMAT, PAI_TRIAL_COMMAND_FORMAT, PAITrialJobDetail } from './paiData'; import { PAI_LOG_PATH_FORMAT, PAI_OUTPUT_DIR_FORMAT, PAI_TRIAL_COMMAND_FORMAT, PAITrialJobDetail } from './paiData';
...@@ -406,12 +407,12 @@ class PAITrainingService implements TrainingService { ...@@ -406,12 +407,12 @@ class PAITrainingService implements TrainingService {
} }
// Step 1. Prepare PAI job configuration // Step 1. Prepare PAI job configuration
const hdfsOutputDir : string = path.join(this.hdfsBaseDir, this.experimentId, trialJobId); const hdfsOutputDir : string = unixPathJoin(this.hdfsBaseDir, this.experimentId, trialJobId);
const hdfsCodeDir: string = HDFSClientUtility.getHdfsTrialWorkDir(this.paiClusterConfig.userName, trialJobId); const hdfsCodeDir: string = HDFSClientUtility.getHdfsTrialWorkDir(this.paiClusterConfig.userName, trialJobId);
const trialLocalTempFolder: string = path.join(getExperimentRootDir(), 'trials-local', trialJobId); const trialLocalTempFolder: string = path.join(getExperimentRootDir(), 'trials-local', trialJobId);
//create tmp trial working folder locally. //create tmp trial working folder locally.
await cpp.exec(`mkdir -p ${trialLocalTempFolder}`); await execMkdir(trialLocalTempFolder);
const runScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT; const runScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT;
// Write NNI installation file to local tmp files // Write NNI installation file to local tmp files
......
...@@ -192,17 +192,21 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -192,17 +192,21 @@ class Overview extends React.Component<{}, OverviewState> {
method: 'GET' method: 'GET'
}) })
.then(res => { .then(res => {
if (res.status === 200 && this._isMounted) { if (res.status === 200) {
const errors = res.data.errors; const errors = res.data.errors;
if (errors.length !== 0) { if (errors.length !== 0) {
this.setState({ if (this._isMounted) {
status: res.data.status, this.setState({
errorStr: res.data.errors[0] status: res.data.status,
}); errorStr: res.data.errors[0]
});
}
} else { } else {
this.setState({ if (this._isMounted) {
status: res.data.status, this.setState({
}); status: res.data.status,
});
}
} }
} }
}); });
...@@ -254,7 +258,8 @@ class Overview extends React.Component<{}, OverviewState> { ...@@ -254,7 +258,8 @@ class Overview extends React.Component<{}, OverviewState> {
case 'SUCCEEDED': case 'SUCCEEDED':
profile.succTrial += 1; profile.succTrial += 1;
const desJobDetail: Parameters = { const desJobDetail: Parameters = {
parameters: {} parameters: {},
intermediate: []
}; };
const duration = (tableData[item].endTime - tableData[item].startTime) / 1000; const duration = (tableData[item].endTime - tableData[item].startTime) / 1000;
const acc = getFinal(tableData[item].finalMetricData); const acc = getFinal(tableData[item].finalMetricData);
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment