Unverified Commit af198888 authored by hao-howard-zhang's avatar hao-howard-zhang Committed by GitHub
Browse files

Support Custom Kubernetes Namespace in AdaptDL Mode (#3176)

parent c1e926b9
......@@ -52,6 +52,7 @@ trialConcurrency: 2
maxTrialNum: 2
trial:
namespace: <k8s_namespace>
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
......@@ -66,7 +67,7 @@ trial:
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageClass: dfs
storageSize: 1Gi
```
......@@ -79,6 +80,7 @@ IP address of the machine with NNI manager (NNICTL) that launches NNI experiment
* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**: It defines the specs of an `adl` trial.
* **namespace**: (*Optional*) Kubernetes namespace to launch the trials. Default to `default` namespace.
* **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
* **image**: Docker image for the trial
* **imagePullSecret**: (*Optional*) If you are using a private registry,
......@@ -90,7 +92,10 @@ IP address of the machine with NNI manager (NNICTL) that launches NNI experiment
* **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
[default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
* **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*) [storage settings](https://kubernetes.io/docs/concepts/storage/storage-classes/) for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
* **checkpoint**: (*Optional*) storage settings for model checkpoints.
* **storageClass**: check [Kubernetes storage documentation](https://kubernetes.io/docs/concepts/storage/storage-classes/) for how to use the appropriate `storageClass`.
* **storageSize**: this value should be large enough to fit your model's checkpoints, or it could cause disk quota exceeded error.
### NFS Storage
......
......@@ -72,7 +72,7 @@ Here is a template configuration specification to use AdaptDL as a training serv
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageClass: dfs
storageSize: 1Gi
Those configs not mentioned below, are following the
......@@ -86,6 +86,7 @@ Those configs not mentioned below, are following the
* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**\ : It defines the specs of an ``adl`` trial.
* **namespace**\: (*Optional*\ ) Kubernetes namespace to launch the trials. Default to ``default`` namespace.
* **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
* **image**\ : Docker image for the trial
* **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
......@@ -97,7 +98,10 @@ Those configs not mentioned below, are following the
* **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
`default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
* **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*\ ) `storage settings <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
* **checkpoint** (*Optional*\ ) storage settings for model checkpoints.
* **storageClass**\ : check `Kubernetes storage documentation <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for how to use the appropriate ``storageClass``.
* **storageSize**\ : this value should be large enough to fit your model's checkpoints, or it could cause "disk quota exceeded" error.
NFS Storage
^^^^^^^^^^^
......
# Dockerfile for building AdaptDL-enabled CIFAR10 image
# Set docker build context to current folder
FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime
RUN pip install nni adaptdl tensorboard
COPY ./ /cifar10
......@@ -17,10 +17,11 @@ tuner:
#choice: maximize, minimize
optimize_mode: maximize
trial:
namespace: default
command: python3 /cifar10/main_adl.py
codeDir: /cifar10
gpuNum: 1
image: {replace_with_the_image_that_has_adaptdl_installed}
image: {image_built_by_adl.Dockerfile}
# optional
imagePullSecrets:
- name: {secret}
......
......@@ -268,6 +268,7 @@ adl_trial_schema = {
'command': setType('command', str),
'gpuNum': setNumberRange('gpuNum', int, 0, 99999),
'image': setType('image', str),
Optional('namespace'): setType('namespace', str),
Optional('imagePullSecrets'): [{
'name': setType('name', str)
}],
......
......@@ -101,6 +101,7 @@ export namespace ValidationSchemas {
name: joi.string().min(1).required()
}),
// ############## adl ###############
namespace: joi.string(),
adaptive: joi.boolean(),
checkpoint: joi.object({
storageClass: joi.string().min(1).required(),
......
......@@ -13,14 +13,17 @@ class AdlClientV1 extends KubernetesCRDClient {
/**
* constructor, to initialize adl CRD definition
*/
public constructor() {
protected readonly namespace: string;
public constructor(namespace: string) {
super();
this.namespace = namespace;
this.crdSchema = JSON.parse(fs.readFileSync('./config/adl/adaptdl-crd-v1.json', 'utf8'));
this.client.addCustomResourceDefinition(this.crdSchema);
}
protected get operator(): any {
return this.client.apis['adaptdl.petuum.com'].v1.namespaces('default').adaptdljobs;
return this.client.apis['adaptdl.petuum.com'].v1.namespaces(this.namespace).adaptdljobs;
}
public get containerName(): string {
......@@ -29,7 +32,7 @@ class AdlClientV1 extends KubernetesCRDClient {
public async getKubernetesPods(jobName: string): Promise<any> {
let result: Promise<any>;
const response = await this.client.api.v1.namespaces('default').pods
const response = await this.client.api.v1.namespaces(this.namespace).pods
.get({ qs: { labelSelector: `adaptdl/job=${jobName}` } });
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(response.body);
......@@ -47,8 +50,8 @@ class AdlClientFactory {
/**
* Factory method to generate operator client
*/
public static createClient(): KubernetesCRDClient {
return new AdlClientV1();
public static createClient(namespace: string): KubernetesCRDClient {
return new AdlClientV1(namespace);
}
}
......
......@@ -58,6 +58,8 @@ export class AdlTrialConfig extends KubernetesTrialConfig {
public readonly image: string;
public readonly namespace?: string;
public readonly imagePullSecrets?: ImagePullSecretConfig[];
public readonly nfs?: NFSConfig;
......@@ -72,7 +74,8 @@ export class AdlTrialConfig extends KubernetesTrialConfig {
constructor(codeDir: string,
command: string, gpuNum: number,
image: string, imagePullSecrets?: ImagePullSecretConfig[],
image: string, namespace?: string,
imagePullSecrets?: ImagePullSecretConfig[],
nfs?: NFSConfig, checkpoint?: CheckpointConfig,
cpuNum?: number, memorySize?: string,
adaptive?: boolean
......@@ -81,6 +84,7 @@ export class AdlTrialConfig extends KubernetesTrialConfig {
this.command = command;
this.gpuNum = gpuNum;
this.image = image;
this.namespace = namespace;
this.imagePullSecrets = imagePullSecrets;
this.nfs = nfs;
this.checkpoint = checkpoint;
......
......@@ -16,21 +16,21 @@ export class AdlJobInfoCollector extends KubernetesJobInfoCollector {
super(jobMap);
}
protected async retrieveSingleTrialJobInfo(kubernetesCRDClient: AdlClientV1 | undefined,
protected async retrieveSingleTrialJobInfo(adlClient: AdlClientV1 | undefined,
kubernetesTrialJob: KubernetesTrialJobDetail): Promise<void> {
if (!this.statusesNeedToCheck.includes(kubernetesTrialJob.status)) {
return Promise.resolve();
}
if (kubernetesCRDClient === undefined) {
return Promise.reject('kubernetesCRDClient is undefined');
if (adlClient === undefined) {
return Promise.reject('AdlClient is undefined');
}
let kubernetesJobInfo: any;
let kubernetesPodsInfo: any;
try {
kubernetesJobInfo = await kubernetesCRDClient.getKubernetesJob(kubernetesTrialJob.kubernetesJobName);
kubernetesPodsInfo = await kubernetesCRDClient.getKubernetesPods(kubernetesTrialJob.kubernetesJobName);
kubernetesJobInfo = await adlClient.getKubernetesJob(kubernetesTrialJob.kubernetesJobName);
kubernetesPodsInfo = await adlClient.getKubernetesPods(kubernetesTrialJob.kubernetesJobName);
} catch (error) {
// Notice: it maynot be a 'real' error since cancel trial job can also cause getKubernetesJob failed.
this.log.error(`Get job ${kubernetesTrialJob.kubernetesJobName} info failed, error is ${error}`);
......
......@@ -39,7 +39,6 @@ class AdlTrainingService extends KubernetesTrainingService implements Kubernetes
super();
this.adlJobInfoCollector = new AdlJobInfoCollector(this.trialJobsMap);
this.experimentId = getExperimentId();
this.kubernetesCRDClient = AdlClientFactory.createClient();
this.configmapTemplateStr = fs.readFileSync(
'./config/adl/adaptdl-nni-configmap-template.json', 'utf8');
this.jobTemplateStr = fs.readFileSync('./config/adl/adaptdljob-template.json', 'utf8');
......@@ -294,15 +293,34 @@ python3 -m nni.tools.trial_tool.trial_keeper --trial_command '{8}' \
return Promise.resolve(runScript);
}
public async cleanUp(): Promise<void> {
super.cleanUp();
// Delete Tensorboard deployment
try {
await this.genericK8sClient.deleteDeployment("adaptdl-tensorboard-" + this.experimentId.toLowerCase());
this.log.info('tensorboard deployment deleted');
} catch (error) {
this.log.error(`tensorboard deployment deletion failed: ${error.message}`);
}
}
public async setClusterMetadata(key: string, value: string): Promise<void> {
this.log.info('SetCluster ' + key + ', ' +value);
switch (key) {
case TrialConfigMetadataKey.NNI_MANAGER_IP:
this.nniManagerIpConfig = <NNIManagerIpConfig>JSON.parse(value);
break;
case TrialConfigMetadataKey.TRIAL_CONFIG:
case TrialConfigMetadataKey.TRIAL_CONFIG: {
this.adlTrialConfig = <AdlTrialConfig>JSON.parse(value);
let namespace: string = 'default';
if (this.adlTrialConfig.namespace !== undefined) {
namespace = this.adlTrialConfig.namespace;
}
this.genericK8sClient.setNamespace = namespace;
this.kubernetesCRDClient = AdlClientFactory.createClient(namespace);
break;
}
case TrialConfigMetadataKey.VERSION_CHECK:
this.versionCheck = (value === 'true' || value === 'True');
break;
......
......@@ -8,17 +8,22 @@ import { Client1_10, config } from 'kubernetes-client';
import { getLogger, Logger } from '../../common/log';
/**
* Generict Kubernetes client, target version >= 1.9
* Generic Kubernetes client, target version >= 1.9
*/
class GeneralK8sClient {
protected readonly client: any;
protected readonly log: Logger = getLogger();
protected namespace: string = 'default';
constructor() {
this.client = new Client1_10({ config: config.fromKubeconfig(), version: '1.9'});
this.client.loadSpec();
}
public set setNamespace(namespace: string) {
this.namespace = namespace;
}
private matchStorageClass(response: any): string {
const adlSupportedProvisioners: RegExp[] = [
new RegExp("microk8s.io/hostpath"),
......@@ -60,7 +65,8 @@ class GeneralK8sClient {
public async createDeployment(deploymentManifest: any): Promise<string> {
let result: Promise<string>;
const response: any = await this.client.apis.apps.v1.namespaces('default').deployments.post({ body: deploymentManifest })
const response: any = await this.client.apis.apps.v1.namespaces(this.namespace)
.deployments.post({ body: deploymentManifest })
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(response.body.metadata.uid);
} else {
......@@ -72,7 +78,7 @@ class GeneralK8sClient {
public async deleteDeployment(deploymentName: string): Promise<boolean> {
let result: Promise<boolean>;
// TODO: change this hard coded deployment name after demo
const response: any = await this.client.apis.apps.v1.namespaces('default')
const response: any = await this.client.apis.apps.v1.namespaces(this.namespace)
.deployment(deploymentName).delete();
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true);
......@@ -84,7 +90,7 @@ class GeneralK8sClient {
public async createConfigMap(configMapManifest: any): Promise<boolean> {
let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces('default')
const response: any = await this.client.api.v1.namespaces(this.namespace)
.configmaps.post({body: configMapManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true);
......@@ -97,7 +103,7 @@ class GeneralK8sClient {
public async createPersistentVolumeClaim(pvcManifest: any): Promise<boolean> {
let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces('default')
const response: any = await this.client.api.v1.namespaces(this.namespace)
.persistentvolumeclaims.post({body: pvcManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true);
......@@ -109,8 +115,8 @@ class GeneralK8sClient {
public async createSecret(secretManifest: any): Promise<boolean> {
let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces('default').secrets
.post({body: secretManifest});
const response: any = await this.client.api.v1.namespaces(this.namespace)
.secrets.post({body: secretManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true);
} else {
......
......@@ -209,13 +209,6 @@ abstract class KubernetesTrainingService {
return Promise.reject(error);
}
try {
await this.genericK8sClient.deleteDeployment("adaptdl-tensorboard-" + getExperimentId().toLowerCase())
this.log.info('tensorboard deployment deleted')
} catch (error) {
this.log.error(`tensorboard deployment deletion failed: ${error.message}`)
}
return Promise.resolve();
}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment