Unverified Commit af198888 authored by hao-howard-zhang's avatar hao-howard-zhang Committed by GitHub
Browse files

Support Custom Kubernetes Namespace in AdaptDL Mode (#3176)

parent c1e926b9
...@@ -52,6 +52,7 @@ trialConcurrency: 2 ...@@ -52,6 +52,7 @@ trialConcurrency: 2
maxTrialNum: 2 maxTrialNum: 2
trial: trial:
namespace: <k8s_namespace>
adaptive: false # optional. adaptive: false # optional.
image: <image_tag> image: <image_tag>
imagePullSecrets: # optional imagePullSecrets: # optional
...@@ -66,7 +67,7 @@ trial: ...@@ -66,7 +67,7 @@ trial:
path: / path: /
containerMountPath: /nfs containerMountPath: /nfs
checkpoint: # optional checkpoint: # optional
storageClass: microk8s-hostpath storageClass: dfs
storageSize: 1Gi storageSize: 1Gi
``` ```
...@@ -79,6 +80,7 @@ IP address of the machine with NNI manager (NNICTL) that launches NNI experiment ...@@ -79,6 +80,7 @@ IP address of the machine with NNI manager (NNICTL) that launches NNI experiment
* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http. * **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners). * **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**: It defines the specs of an `adl` trial. * **trial**: It defines the specs of an `adl` trial.
* **namespace**: (*Optional*) Kubernetes namespace to launch the trials. Default to `default` namespace.
* **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive. * **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
* **image**: Docker image for the trial * **image**: Docker image for the trial
* **imagePullSecret**: (*Optional*) If you are using a private registry, * **imagePullSecret**: (*Optional*) If you are using a private registry,
...@@ -90,7 +92,10 @@ IP address of the machine with NNI manager (NNICTL) that launches NNI experiment ...@@ -90,7 +92,10 @@ IP address of the machine with NNI manager (NNICTL) that launches NNI experiment
* **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes * **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
[default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory). [default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
* **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph. * **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users. * **checkpoint**: (*Optional*) storage settings for model checkpoints.
* **storageClass**: check [Kubernetes storage documentation](https://kubernetes.io/docs/concepts/storage/storage-classes/) for how to use the appropriate `storageClass`.
* **storageSize**: this value should be large enough to fit your model's checkpoints, or it could cause disk quota exceeded error.
### NFS Storage ### NFS Storage
......
...@@ -72,7 +72,7 @@ Here is a template configuration specification to use AdaptDL as a training serv ...@@ -72,7 +72,7 @@ Here is a template configuration specification to use AdaptDL as a training serv
path: / path: /
containerMountPath: /nfs containerMountPath: /nfs
checkpoint: # optional checkpoint: # optional
storageClass: microk8s-hostpath storageClass: dfs
storageSize: 1Gi storageSize: 1Gi
Those configs not mentioned below, are following the Those configs not mentioned below, are following the
...@@ -86,6 +86,7 @@ Those configs not mentioned below, are following the ...@@ -86,6 +86,7 @@ Those configs not mentioned below, are following the
* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners). * **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**\ : It defines the specs of an ``adl`` trial. * **trial**\ : It defines the specs of an ``adl`` trial.
* **namespace**\: (*Optional*\ ) Kubernetes namespace to launch the trials. Default to ``default`` namespace.
* **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive. * **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
* **image**\ : Docker image for the trial * **image**\ : Docker image for the trial
* **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry, * **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
...@@ -97,7 +98,10 @@ Those configs not mentioned below, are following the ...@@ -97,7 +98,10 @@ Those configs not mentioned below, are following the
* **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes * **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
`default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__. `default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
* **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph. * **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*\ ) `storage settings <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for AdaptDL internal checkpoints. You can keep it optional if you are not dev users. * **checkpoint** (*Optional*\ ) storage settings for model checkpoints.
* **storageClass**\ : check `Kubernetes storage documentation <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for how to use the appropriate ``storageClass``.
* **storageSize**\ : this value should be large enough to fit your model's checkpoints, or it could cause "disk quota exceeded" error.
NFS Storage NFS Storage
^^^^^^^^^^^ ^^^^^^^^^^^
......
# Dockerfile for building AdaptDL-enabled CIFAR10 image
# Set docker build context to current folder
FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime
RUN pip install nni adaptdl tensorboard
COPY ./ /cifar10
...@@ -17,10 +17,11 @@ tuner: ...@@ -17,10 +17,11 @@ tuner:
#choice: maximize, minimize #choice: maximize, minimize
optimize_mode: maximize optimize_mode: maximize
trial: trial:
namespace: default
command: python3 /cifar10/main_adl.py command: python3 /cifar10/main_adl.py
codeDir: /cifar10 codeDir: /cifar10
gpuNum: 1 gpuNum: 1
image: {replace_with_the_image_that_has_adaptdl_installed} image: {image_built_by_adl.Dockerfile}
# optional # optional
imagePullSecrets: imagePullSecrets:
- name: {secret} - name: {secret}
......
...@@ -268,6 +268,7 @@ adl_trial_schema = { ...@@ -268,6 +268,7 @@ adl_trial_schema = {
'command': setType('command', str), 'command': setType('command', str),
'gpuNum': setNumberRange('gpuNum', int, 0, 99999), 'gpuNum': setNumberRange('gpuNum', int, 0, 99999),
'image': setType('image', str), 'image': setType('image', str),
Optional('namespace'): setType('namespace', str),
Optional('imagePullSecrets'): [{ Optional('imagePullSecrets'): [{
'name': setType('name', str) 'name': setType('name', str)
}], }],
......
...@@ -101,6 +101,7 @@ export namespace ValidationSchemas { ...@@ -101,6 +101,7 @@ export namespace ValidationSchemas {
name: joi.string().min(1).required() name: joi.string().min(1).required()
}), }),
// ############## adl ############### // ############## adl ###############
namespace: joi.string(),
adaptive: joi.boolean(), adaptive: joi.boolean(),
checkpoint: joi.object({ checkpoint: joi.object({
storageClass: joi.string().min(1).required(), storageClass: joi.string().min(1).required(),
......
...@@ -13,14 +13,17 @@ class AdlClientV1 extends KubernetesCRDClient { ...@@ -13,14 +13,17 @@ class AdlClientV1 extends KubernetesCRDClient {
/** /**
* constructor, to initialize adl CRD definition * constructor, to initialize adl CRD definition
*/ */
public constructor() { protected readonly namespace: string;
public constructor(namespace: string) {
super(); super();
this.namespace = namespace;
this.crdSchema = JSON.parse(fs.readFileSync('./config/adl/adaptdl-crd-v1.json', 'utf8')); this.crdSchema = JSON.parse(fs.readFileSync('./config/adl/adaptdl-crd-v1.json', 'utf8'));
this.client.addCustomResourceDefinition(this.crdSchema); this.client.addCustomResourceDefinition(this.crdSchema);
} }
protected get operator(): any { protected get operator(): any {
return this.client.apis['adaptdl.petuum.com'].v1.namespaces('default').adaptdljobs; return this.client.apis['adaptdl.petuum.com'].v1.namespaces(this.namespace).adaptdljobs;
} }
public get containerName(): string { public get containerName(): string {
...@@ -29,7 +32,7 @@ class AdlClientV1 extends KubernetesCRDClient { ...@@ -29,7 +32,7 @@ class AdlClientV1 extends KubernetesCRDClient {
public async getKubernetesPods(jobName: string): Promise<any> { public async getKubernetesPods(jobName: string): Promise<any> {
let result: Promise<any>; let result: Promise<any>;
const response = await this.client.api.v1.namespaces('default').pods const response = await this.client.api.v1.namespaces(this.namespace).pods
.get({ qs: { labelSelector: `adaptdl/job=${jobName}` } }); .get({ qs: { labelSelector: `adaptdl/job=${jobName}` } });
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(response.body); result = Promise.resolve(response.body);
...@@ -47,8 +50,8 @@ class AdlClientFactory { ...@@ -47,8 +50,8 @@ class AdlClientFactory {
/** /**
* Factory method to generate operator client * Factory method to generate operator client
*/ */
public static createClient(): KubernetesCRDClient { public static createClient(namespace: string): KubernetesCRDClient {
return new AdlClientV1(); return new AdlClientV1(namespace);
} }
} }
......
...@@ -58,6 +58,8 @@ export class AdlTrialConfig extends KubernetesTrialConfig { ...@@ -58,6 +58,8 @@ export class AdlTrialConfig extends KubernetesTrialConfig {
public readonly image: string; public readonly image: string;
public readonly namespace?: string;
public readonly imagePullSecrets?: ImagePullSecretConfig[]; public readonly imagePullSecrets?: ImagePullSecretConfig[];
public readonly nfs?: NFSConfig; public readonly nfs?: NFSConfig;
...@@ -72,7 +74,8 @@ export class AdlTrialConfig extends KubernetesTrialConfig { ...@@ -72,7 +74,8 @@ export class AdlTrialConfig extends KubernetesTrialConfig {
constructor(codeDir: string, constructor(codeDir: string,
command: string, gpuNum: number, command: string, gpuNum: number,
image: string, imagePullSecrets?: ImagePullSecretConfig[], image: string, namespace?: string,
imagePullSecrets?: ImagePullSecretConfig[],
nfs?: NFSConfig, checkpoint?: CheckpointConfig, nfs?: NFSConfig, checkpoint?: CheckpointConfig,
cpuNum?: number, memorySize?: string, cpuNum?: number, memorySize?: string,
adaptive?: boolean adaptive?: boolean
...@@ -81,6 +84,7 @@ export class AdlTrialConfig extends KubernetesTrialConfig { ...@@ -81,6 +84,7 @@ export class AdlTrialConfig extends KubernetesTrialConfig {
this.command = command; this.command = command;
this.gpuNum = gpuNum; this.gpuNum = gpuNum;
this.image = image; this.image = image;
this.namespace = namespace;
this.imagePullSecrets = imagePullSecrets; this.imagePullSecrets = imagePullSecrets;
this.nfs = nfs; this.nfs = nfs;
this.checkpoint = checkpoint; this.checkpoint = checkpoint;
......
...@@ -16,21 +16,21 @@ export class AdlJobInfoCollector extends KubernetesJobInfoCollector { ...@@ -16,21 +16,21 @@ export class AdlJobInfoCollector extends KubernetesJobInfoCollector {
super(jobMap); super(jobMap);
} }
protected async retrieveSingleTrialJobInfo(kubernetesCRDClient: AdlClientV1 | undefined, protected async retrieveSingleTrialJobInfo(adlClient: AdlClientV1 | undefined,
kubernetesTrialJob: KubernetesTrialJobDetail): Promise<void> { kubernetesTrialJob: KubernetesTrialJobDetail): Promise<void> {
if (!this.statusesNeedToCheck.includes(kubernetesTrialJob.status)) { if (!this.statusesNeedToCheck.includes(kubernetesTrialJob.status)) {
return Promise.resolve(); return Promise.resolve();
} }
if (kubernetesCRDClient === undefined) { if (adlClient === undefined) {
return Promise.reject('kubernetesCRDClient is undefined'); return Promise.reject('AdlClient is undefined');
} }
let kubernetesJobInfo: any; let kubernetesJobInfo: any;
let kubernetesPodsInfo: any; let kubernetesPodsInfo: any;
try { try {
kubernetesJobInfo = await kubernetesCRDClient.getKubernetesJob(kubernetesTrialJob.kubernetesJobName); kubernetesJobInfo = await adlClient.getKubernetesJob(kubernetesTrialJob.kubernetesJobName);
kubernetesPodsInfo = await kubernetesCRDClient.getKubernetesPods(kubernetesTrialJob.kubernetesJobName); kubernetesPodsInfo = await adlClient.getKubernetesPods(kubernetesTrialJob.kubernetesJobName);
} catch (error) { } catch (error) {
// Notice: it maynot be a 'real' error since cancel trial job can also cause getKubernetesJob failed. // Notice: it maynot be a 'real' error since cancel trial job can also cause getKubernetesJob failed.
this.log.error(`Get job ${kubernetesTrialJob.kubernetesJobName} info failed, error is ${error}`); this.log.error(`Get job ${kubernetesTrialJob.kubernetesJobName} info failed, error is ${error}`);
......
...@@ -39,7 +39,6 @@ class AdlTrainingService extends KubernetesTrainingService implements Kubernetes ...@@ -39,7 +39,6 @@ class AdlTrainingService extends KubernetesTrainingService implements Kubernetes
super(); super();
this.adlJobInfoCollector = new AdlJobInfoCollector(this.trialJobsMap); this.adlJobInfoCollector = new AdlJobInfoCollector(this.trialJobsMap);
this.experimentId = getExperimentId(); this.experimentId = getExperimentId();
this.kubernetesCRDClient = AdlClientFactory.createClient();
this.configmapTemplateStr = fs.readFileSync( this.configmapTemplateStr = fs.readFileSync(
'./config/adl/adaptdl-nni-configmap-template.json', 'utf8'); './config/adl/adaptdl-nni-configmap-template.json', 'utf8');
this.jobTemplateStr = fs.readFileSync('./config/adl/adaptdljob-template.json', 'utf8'); this.jobTemplateStr = fs.readFileSync('./config/adl/adaptdljob-template.json', 'utf8');
...@@ -294,15 +293,34 @@ python3 -m nni.tools.trial_tool.trial_keeper --trial_command '{8}' \ ...@@ -294,15 +293,34 @@ python3 -m nni.tools.trial_tool.trial_keeper --trial_command '{8}' \
return Promise.resolve(runScript); return Promise.resolve(runScript);
} }
public async cleanUp(): Promise<void> {
super.cleanUp();
// Delete Tensorboard deployment
try {
await this.genericK8sClient.deleteDeployment("adaptdl-tensorboard-" + this.experimentId.toLowerCase());
this.log.info('tensorboard deployment deleted');
} catch (error) {
this.log.error(`tensorboard deployment deletion failed: ${error.message}`);
}
}
public async setClusterMetadata(key: string, value: string): Promise<void> { public async setClusterMetadata(key: string, value: string): Promise<void> {
this.log.info('SetCluster ' + key + ', ' +value); this.log.info('SetCluster ' + key + ', ' +value);
switch (key) { switch (key) {
case TrialConfigMetadataKey.NNI_MANAGER_IP: case TrialConfigMetadataKey.NNI_MANAGER_IP:
this.nniManagerIpConfig = <NNIManagerIpConfig>JSON.parse(value); this.nniManagerIpConfig = <NNIManagerIpConfig>JSON.parse(value);
break; break;
case TrialConfigMetadataKey.TRIAL_CONFIG: case TrialConfigMetadataKey.TRIAL_CONFIG: {
this.adlTrialConfig = <AdlTrialConfig>JSON.parse(value); this.adlTrialConfig = <AdlTrialConfig>JSON.parse(value);
let namespace: string = 'default';
if (this.adlTrialConfig.namespace !== undefined) {
namespace = this.adlTrialConfig.namespace;
}
this.genericK8sClient.setNamespace = namespace;
this.kubernetesCRDClient = AdlClientFactory.createClient(namespace);
break; break;
}
case TrialConfigMetadataKey.VERSION_CHECK: case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True'); this.versionCheck = (value === 'true' || value === 'True');
break; break;
......
...@@ -8,17 +8,22 @@ import { Client1_10, config } from 'kubernetes-client'; ...@@ -8,17 +8,22 @@ import { Client1_10, config } from 'kubernetes-client';
import { getLogger, Logger } from '../../common/log'; import { getLogger, Logger } from '../../common/log';
/** /**
* Generict Kubernetes client, target version >= 1.9 * Generic Kubernetes client, target version >= 1.9
*/ */
class GeneralK8sClient { class GeneralK8sClient {
protected readonly client: any; protected readonly client: any;
protected readonly log: Logger = getLogger(); protected readonly log: Logger = getLogger();
protected namespace: string = 'default';
constructor() { constructor() {
this.client = new Client1_10({ config: config.fromKubeconfig(), version: '1.9'}); this.client = new Client1_10({ config: config.fromKubeconfig(), version: '1.9'});
this.client.loadSpec(); this.client.loadSpec();
} }
public set setNamespace(namespace: string) {
this.namespace = namespace;
}
private matchStorageClass(response: any): string { private matchStorageClass(response: any): string {
const adlSupportedProvisioners: RegExp[] = [ const adlSupportedProvisioners: RegExp[] = [
new RegExp("microk8s.io/hostpath"), new RegExp("microk8s.io/hostpath"),
...@@ -60,7 +65,8 @@ class GeneralK8sClient { ...@@ -60,7 +65,8 @@ class GeneralK8sClient {
public async createDeployment(deploymentManifest: any): Promise<string> { public async createDeployment(deploymentManifest: any): Promise<string> {
let result: Promise<string>; let result: Promise<string>;
const response: any = await this.client.apis.apps.v1.namespaces('default').deployments.post({ body: deploymentManifest }) const response: any = await this.client.apis.apps.v1.namespaces(this.namespace)
.deployments.post({ body: deploymentManifest })
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(response.body.metadata.uid); result = Promise.resolve(response.body.metadata.uid);
} else { } else {
...@@ -72,7 +78,7 @@ class GeneralK8sClient { ...@@ -72,7 +78,7 @@ class GeneralK8sClient {
public async deleteDeployment(deploymentName: string): Promise<boolean> { public async deleteDeployment(deploymentName: string): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
// TODO: change this hard coded deployment name after demo // TODO: change this hard coded deployment name after demo
const response: any = await this.client.apis.apps.v1.namespaces('default') const response: any = await this.client.apis.apps.v1.namespaces(this.namespace)
.deployment(deploymentName).delete(); .deployment(deploymentName).delete();
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
...@@ -84,7 +90,7 @@ class GeneralK8sClient { ...@@ -84,7 +90,7 @@ class GeneralK8sClient {
public async createConfigMap(configMapManifest: any): Promise<boolean> { public async createConfigMap(configMapManifest: any): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces('default') const response: any = await this.client.api.v1.namespaces(this.namespace)
.configmaps.post({body: configMapManifest}); .configmaps.post({body: configMapManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
...@@ -97,7 +103,7 @@ class GeneralK8sClient { ...@@ -97,7 +103,7 @@ class GeneralK8sClient {
public async createPersistentVolumeClaim(pvcManifest: any): Promise<boolean> { public async createPersistentVolumeClaim(pvcManifest: any): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces('default') const response: any = await this.client.api.v1.namespaces(this.namespace)
.persistentvolumeclaims.post({body: pvcManifest}); .persistentvolumeclaims.post({body: pvcManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
...@@ -109,8 +115,8 @@ class GeneralK8sClient { ...@@ -109,8 +115,8 @@ class GeneralK8sClient {
public async createSecret(secretManifest: any): Promise<boolean> { public async createSecret(secretManifest: any): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces('default').secrets const response: any = await this.client.api.v1.namespaces(this.namespace)
.post({body: secretManifest}); .secrets.post({body: secretManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
} else { } else {
......
...@@ -209,13 +209,6 @@ abstract class KubernetesTrainingService { ...@@ -209,13 +209,6 @@ abstract class KubernetesTrainingService {
return Promise.reject(error); return Promise.reject(error);
} }
try {
await this.genericK8sClient.deleteDeployment("adaptdl-tensorboard-" + getExperimentId().toLowerCase())
this.log.info('tensorboard deployment deleted')
} catch (error) {
this.log.error(`tensorboard deployment deletion failed: ${error.message}`)
}
return Promise.resolve(); return Promise.resolve();
} }
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment