Unverified Commit fbffbc7c authored by Markus Bauer's avatar Markus Bauer Committed by GitHub
Browse files

[WIP] Enable optional Pod Spec for FrameworkController platform (#3379)

parent 38c9a734
...@@ -11,7 +11,6 @@ ...@@ -11,7 +11,6 @@
/ts/nni_manager/metrics.json /ts/nni_manager/metrics.json
/ts/nni_manager/trial_jobs.json /ts/nni_manager/trial_jobs.json
# Logs # Logs
logs logs
*.log *.log
......
...@@ -28,6 +28,16 @@ Prerequisite for Azure Kubernetes Service ...@@ -28,6 +28,16 @@ Prerequisite for Azure Kubernetes Service
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files. #. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key. #. To access Azure storage service, NNI need the access key of the storage account, and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Prerequisite for PVC storage mode
-----------------------------------------
In order to use persistent volume claims instead of NFS or Azure storage, related storage must
be created manually, in the namespace your trials will run later. This restriction is due to the
fact, that persistent volume claims are hard to recycle and thus can quickly mess with a cluster's
storage management. Persistent volume claims can be created by e.g. using kubectl. Please refer
to the official Kubernetes documentation for `further information <https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims>`__.
Setup FrameworkController Setup FrameworkController
------------------------- -------------------------
...@@ -116,6 +126,37 @@ Trial configuration in frameworkcontroller mode have the following configuration ...@@ -116,6 +126,37 @@ Trial configuration in frameworkcontroller mode have the following configuration
* image: the docker image used to create pod and run the program. * image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps. * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
NNI also offers the possibility to include a customized frameworkcontroller template similar
to the aforementioned tensorflow example. A valid configuration the may look like:
.. code-block:: yaml
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 2
logLevel: trace
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
frameworkcontrollerConfig:
configPath: fc_template.yml
storage: pvc
namespace: twin-pipelines
pvc:
path: /mnt/data
Note that in this example a persistent volume claim has been used, that must be created manually in the specified namespace beforehand. Stick to the mnist-pytorch example (:githublink: `<examples/trials/mnist-pytorch>`__) for a more detailed config (:githublink: `<examples/trials/mnist-pytorch/config_frameworkcontroller_custom.yml>`__) and frameworkcontroller template (:githublink: `<examples/trials/fc_template.yml>`__).
How to run example How to run example
------------------ ------------------
......
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
logLevel: trace
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
frameworkcontrollerConfig:
configPath: fc_template.yml
storage: pvc
namespace: "default"
pvc:
path: "/tmp/mount"
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: pytorchcpu
namespace: default
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
taskRoles:
- name: worker
taskNumber: 1
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 3
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
podGracefulDeletionTimeoutSec: 1800
pod:
spec:
restartPolicy: Never
hostNetwork: false
containers:
- name: mnist-pytorch
image: msranni/nni:latest
command: ["python", "mnist.py"]
ports:
- containerPort: 5001
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
- name: data-volume
mountPath: /tmp/mount
serviceAccountName: frameworkbarrier
initContainers:
- name: frameworkbarrier
image: frameworkcontroller/frameworkbarrier
volumeMounts:
- name: frameworkbarrier-volume
mountPath: /mnt/frameworkbarrier
volumes:
- name: frameworkbarrier-volume
emptyDir: {}
- name: data-volume
persistentVolumeClaim:
claimName: nni-storage
...@@ -4,11 +4,17 @@ ...@@ -4,11 +4,17 @@
import json import json
import logging import logging
import os import os
import netifaces import netifaces
from schema import Schema, And, Optional, Regex, Or, SchemaError from nni.tools.package_utils import (
from nni.tools.package_utils import create_validator_instance, get_all_builtin_names, get_registered_algo_meta create_validator_instance,
from .constants import SCHEMA_TYPE_ERROR, SCHEMA_RANGE_ERROR, SCHEMA_PATH_ERROR get_all_builtin_names,
get_registered_algo_meta,
)
from schema import And, Optional, Or, Regex, Schema, SchemaError
from .common_utils import get_yml_content, print_warning from .common_utils import get_yml_content, print_warning
from .constants import SCHEMA_PATH_ERROR, SCHEMA_RANGE_ERROR, SCHEMA_TYPE_ERROR
def setType(key, valueType): def setType(key, valueType):
...@@ -183,9 +189,9 @@ pai_yarn_trial_schema = { ...@@ -183,9 +189,9 @@ pai_yarn_trial_schema = {
Optional('virtualCluster'): setType('virtualCluster', str), Optional('virtualCluster'): setType('virtualCluster', str),
Optional('nasMode'): setChoice('nasMode', 'classic_mode', 'enas_mode', 'oneshot_mode', 'darts_mode'), Optional('nasMode'): setChoice('nasMode', 'classic_mode', 'enas_mode', 'oneshot_mode', 'darts_mode'),
Optional('portList'): [{ Optional('portList'): [{
"label": setType('label', str), 'label': setType('label', str),
"beginAt": setType('beginAt', int), 'beginAt': setType('beginAt', int),
"portNumber": setType('portNumber', int) 'portNumber': setType('portNumber', int)
}] }]
} }
} }
...@@ -376,7 +382,7 @@ kubeflow_config_schema = { ...@@ -376,7 +382,7 @@ kubeflow_config_schema = {
frameworkcontroller_trial_schema = { frameworkcontroller_trial_schema = {
'trial': { 'trial': {
'codeDir': setPathCheck('codeDir'), 'codeDir': setPathCheck('codeDir'),
'taskRoles': [{ Optional('taskRoles'): [{
'name': setType('name', str), 'name': setType('name', str),
'taskNum': setType('taskNum', int), 'taskNum': setType('taskNum', int),
'frameworkAttemptCompletionPolicy': { 'frameworkAttemptCompletionPolicy': {
...@@ -395,14 +401,22 @@ frameworkcontroller_trial_schema = { ...@@ -395,14 +401,22 @@ frameworkcontroller_trial_schema = {
frameworkcontroller_config_schema = { frameworkcontroller_config_schema = {
'frameworkcontrollerConfig': Or({ 'frameworkcontrollerConfig': Or({
Optional('storage'): setChoice('storage', 'nfs', 'azureStorage'), Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'),
Optional('serviceAccountName'): setType('serviceAccountName', str), Optional('serviceAccountName'): setType('serviceAccountName', str),
'nfs': { 'nfs': {
'server': setType('server', str), 'server': setType('server', str),
'path': setType('path', str) 'path': setType('path', str)
} },
Optional('namespace'): setType('namespace', str),
Optional('configPath'): setType('configPath', str),
}, { }, {
Optional('storage'): setChoice('storage', 'nfs', 'azureStorage'), Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'),
Optional('serviceAccountName'): setType('serviceAccountName', str),
'configPath': setType('configPath', str),
'pvc': {'path': setType('server', str)},
Optional('namespace'): setType('namespace', str),
}, {
Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'),
Optional('serviceAccountName'): setType('serviceAccountName', str), Optional('serviceAccountName'): setType('serviceAccountName', str),
'keyVault': { 'keyVault': {
'vaultName': And(Regex('([0-9]|[a-z]|[A-Z]|-){1,127}'), 'vaultName': And(Regex('([0-9]|[a-z]|[A-Z]|-){1,127}'),
...@@ -416,7 +430,9 @@ frameworkcontroller_config_schema = { ...@@ -416,7 +430,9 @@ frameworkcontroller_config_schema = {
'azureShare': And(Regex('([0-9]|[a-z]|[A-Z]|-){3,63}'), 'azureShare': And(Regex('([0-9]|[a-z]|[A-Z]|-){3,63}'),
error='ERROR: azureShare format error, azureShare support using (0-9|a-z|A-Z|-)') error='ERROR: azureShare format error, azureShare support using (0-9|a-z|A-Z|-)')
}, },
Optional('uploadRetryCount'): setNumberRange('uploadRetryCount', int, 1, 99999) Optional('uploadRetryCount'): setNumberRange('uploadRetryCount', int, 1, 99999),
Optional('namespace'): setType('namespace', str),
Optional('configPath'): setType('configPath', str),
}) })
} }
...@@ -479,6 +495,7 @@ class NNIConfigSchema: ...@@ -479,6 +495,7 @@ class NNIConfigSchema:
self.validate_kubeflow_operators(experiment_config) self.validate_kubeflow_operators(experiment_config)
self.validate_eth0_device(experiment_config) self.validate_eth0_device(experiment_config)
self.validate_hybrid_platforms(experiment_config) self.validate_hybrid_platforms(experiment_config)
self.validate_frameworkcontroller_trial_config(experiment_config)
def validate_tuner_adivosr_assessor(self, experiment_config): def validate_tuner_adivosr_assessor(self, experiment_config):
if experiment_config.get('advisor'): if experiment_config.get('advisor'):
...@@ -588,7 +605,7 @@ class NNIConfigSchema: ...@@ -588,7 +605,7 @@ class NNIConfigSchema:
and not experiment_config.get('nniManagerIp') \ and not experiment_config.get('nniManagerIp') \
and 'eth0' not in netifaces.interfaces(): and 'eth0' not in netifaces.interfaces():
raise SchemaError('This machine does not contain eth0 network device, please set nniManagerIp in config file!') raise SchemaError('This machine does not contain eth0 network device, please set nniManagerIp in config file!')
def validate_hybrid_platforms(self, experiment_config): def validate_hybrid_platforms(self, experiment_config):
required_config_name_map = { required_config_name_map = {
'remote': 'machineList', 'remote': 'machineList',
...@@ -600,4 +617,25 @@ class NNIConfigSchema: ...@@ -600,4 +617,25 @@ class NNIConfigSchema:
config_name = required_config_name_map.get(platform) config_name = required_config_name_map.get(platform)
if config_name and not experiment_config.get(config_name): if config_name and not experiment_config.get(config_name):
raise SchemaError('Need to set {0} for {1} in hybrid mode!'.format(config_name, platform)) raise SchemaError('Need to set {0} for {1} in hybrid mode!'.format(config_name, platform))
\ No newline at end of file def validate_frameworkcontroller_trial_config(self, experiment_config):
if experiment_config.get('trainingServicePlatform') == 'frameworkcontroller':
if not experiment_config.get('trial').get('taskRoles'):
if not experiment_config.get('frameworkcontrollerConfig').get('configPath'):
raise SchemaError("""If no taskRoles are specified a valid custom frameworkcontroller config should
be set using the configPath attribute in frameworkcontrollerConfig!""")
config_content = get_yml_content(experiment_config.get('frameworkcontrollerConfig').get('configPath'))
if not config_content.get('spec').get('taskRoles') or not len(config_content.get('spec').get('taskRoles')):
raise SchemaError('Invalid frameworkcontroller config! No taskRoles were specified!')
if not config_content.get('spec').get('taskRoles')[0].get('task'):
raise SchemaError('Invalid frameworkcontroller config! No task was specified for taskRole!')
names = []
for taskRole in config_content.get('spec').get('taskRoles'):
if not "name" in taskRole:
raise SchemaError('Invalid frameworkcontroller config! Name is missing for taskRole!')
names.append(taskRole.get("name"))
if len(names) > len(set(names)):
raise SchemaError('Invalid frameworkcontroller config! Duplicate taskrole names!')
if not config_content.get('metadata').get('name'):
raise SchemaError('Invalid frameworkcontroller config! No experiment name was specified!')
...@@ -100,6 +100,11 @@ def parse_path(experiment_config, config_path): ...@@ -100,6 +100,11 @@ def parse_path(experiment_config, config_path):
if experiment_config['trial'].get('paiConfigPath'): if experiment_config['trial'].get('paiConfigPath'):
parse_relative_path(root_path, experiment_config['trial'], 'paiConfigPath') parse_relative_path(root_path, experiment_config['trial'], 'paiConfigPath')
# For frameworkcontroller a custom configuration path may be specified
if experiment_config.get('frameworkcontrollerConfig'):
if experiment_config['frameworkcontrollerConfig'].get('configPath'):
parse_relative_path(root_path, experiment_config['frameworkcontrollerConfig'], 'configPath')
def set_default_values(experiment_config): def set_default_values(experiment_config):
if experiment_config.get('maxExecDuration') is None: if experiment_config.get('maxExecDuration') is None:
experiment_config['maxExecDuration'] = '999d' experiment_config['maxExecDuration'] = '999d'
......
...@@ -63,14 +63,14 @@ export namespace ValidationSchemas { ...@@ -63,14 +63,14 @@ export namespace ValidationSchemas {
command: joi.string().min(1).required() command: joi.string().min(1).required()
}), }),
ps: joi.object({ ps: joi.object({
replicas: joi.number().min(1).required(), replicas: joi.number().min(1).required(),
image: joi.string().min(1), image: joi.string().min(1),
privateRegistryAuthPath: joi.string().min(1), privateRegistryAuthPath: joi.string().min(1),
outputDir: joi.string(), outputDir: joi.string(),
cpuNum: joi.number().min(1), cpuNum: joi.number().min(1),
memoryMB: joi.number().min(100), memoryMB: joi.number().min(100),
gpuNum: joi.number().min(0).required(), gpuNum: joi.number().min(0).required(),
command: joi.string().min(1).required() command: joi.string().min(1).required()
}), }),
master: joi.object({ master: joi.object({
replicas: joi.number().min(1).required(), replicas: joi.number().min(1).required(),
...@@ -152,6 +152,10 @@ export namespace ValidationSchemas { ...@@ -152,6 +152,10 @@ export namespace ValidationSchemas {
frameworkcontroller_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase frameworkcontroller_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
storage: joi.string().min(1), storage: joi.string().min(1),
serviceAccountName: joi.string().min(1), serviceAccountName: joi.string().min(1),
pvc: joi.object({
path: joi.string().min(1).required()
}),
configPath: joi.string().min(1),
nfs: joi.object({ nfs: joi.object({
server: joi.string().min(1).required(), server: joi.string().min(1).required(),
path: joi.string().min(1).required() path: joi.string().min(1).required()
...@@ -164,14 +168,15 @@ export namespace ValidationSchemas { ...@@ -164,14 +168,15 @@ export namespace ValidationSchemas {
accountName: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,31}$/), accountName: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,31}$/),
azureShare: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,63}$/) azureShare: joi.string().regex(/^([0-9]|[a-z]|[A-Z]|-){3,63}$/)
}), }),
uploadRetryCount: joi.number().min(1) uploadRetryCount: joi.number().min(1),
namespace: joi.string().min(1)
}), }),
dlts_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase dlts_config: joi.object({ // eslint-disable-line @typescript-eslint/camelcase
dashboard: joi.string().min(1), dashboard: joi.string().min(1),
cluster: joi.string().min(1), cluster: joi.string().min(1),
team: joi.string().min(1), team: joi.string().min(1),
email: joi.string().min(1), email: joi.string().min(1),
password: joi.string().min(1) password: joi.string().min(1)
}), }),
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
'use strict'; 'use strict';
import * as fs from 'fs'; import * as fs from 'fs';
import { GeneralK8sClient, KubernetesCRDClient } from '../kubernetesApiClient'; import {GeneralK8sClient, KubernetesCRDClient} from '../kubernetesApiClient';
/** /**
* FrameworkController ClientV1 * FrameworkController ClientV1
...@@ -13,14 +13,16 @@ class FrameworkControllerClientV1 extends KubernetesCRDClient { ...@@ -13,14 +13,16 @@ class FrameworkControllerClientV1 extends KubernetesCRDClient {
/** /**
* constructor, to initialize frameworkcontroller CRD definition * constructor, to initialize frameworkcontroller CRD definition
*/ */
public constructor() { public namespace: string;
public constructor(namespace?: string) {
super(); super();
this.namespace = namespace ? namespace : "default"
this.crdSchema = JSON.parse(fs.readFileSync('./config/frameworkcontroller/frameworkcontrollerjob-crd-v1.json', 'utf8')); this.crdSchema = JSON.parse(fs.readFileSync('./config/frameworkcontroller/frameworkcontrollerjob-crd-v1.json', 'utf8'));
this.client.addCustomResourceDefinition(this.crdSchema); this.client.addCustomResourceDefinition(this.crdSchema);
} }
protected get operator(): any { protected get operator(): any {
return this.client.apis['frameworkcontroller.microsoft.com'].v1.namespaces('default').frameworks; return this.client.apis['frameworkcontroller.microsoft.com'].v1.namespaces(this.namespace).frameworks;
} }
public get containerName(): string { public get containerName(): string {
...@@ -35,9 +37,9 @@ class FrameworkControllerClientFactory { ...@@ -35,9 +37,9 @@ class FrameworkControllerClientFactory {
/** /**
* Factory method to generate operator client * Factory method to generate operator client
*/ */
public static createClient(): KubernetesCRDClient { public static createClient(namespace?: string): KubernetesCRDClient {
return new FrameworkControllerClientV1(); return new FrameworkControllerClientV1(namespace);
} }
} }
export { FrameworkControllerClientFactory, GeneralK8sClient }; export {FrameworkControllerClientFactory, GeneralK8sClient};
...@@ -5,8 +5,10 @@ ...@@ -5,8 +5,10 @@
import * as assert from 'assert'; import * as assert from 'assert';
import { AzureStorage, KeyVaultConfig, KubernetesClusterConfig, KubernetesClusterConfigAzure, KubernetesClusterConfigNFS, import {
KubernetesStorageKind, KubernetesTrialConfig, KubernetesTrialConfigTemplate, NFSConfig, StorageConfig AzureStorage, KeyVaultConfig, KubernetesClusterConfig, KubernetesClusterConfigAzure, KubernetesClusterConfigNFS,
KubernetesStorageKind, KubernetesTrialConfig, KubernetesTrialConfigTemplate, NFSConfig, StorageConfig, KubernetesClusterConfigPVC,
PVCConfig,
} from '../kubernetesConfig'; } from '../kubernetesConfig';
export class FrameworkAttemptCompletionPolicy { export class FrameworkAttemptCompletionPolicy {
...@@ -26,8 +28,8 @@ export class FrameworkControllerTrialConfigTemplate extends KubernetesTrialConfi ...@@ -26,8 +28,8 @@ export class FrameworkControllerTrialConfigTemplate extends KubernetesTrialConfi
public readonly name: string; public readonly name: string;
public readonly taskNum: number; public readonly taskNum: number;
constructor(taskNum: number, command: string, gpuNum: number, constructor(taskNum: number, command: string, gpuNum: number,
cpuNum: number, memoryMB: number, image: string, cpuNum: number, memoryMB: number, image: string,
frameworkAttemptCompletionPolicy: FrameworkAttemptCompletionPolicy, privateRegistryFilePath?: string | undefined) { frameworkAttemptCompletionPolicy: FrameworkAttemptCompletionPolicy, privateRegistryFilePath?: string | undefined) {
super(command, gpuNum, cpuNum, memoryMB, image, privateRegistryFilePath); super(command, gpuNum, cpuNum, memoryMB, image, privateRegistryFilePath);
this.frameworkAttemptCompletionPolicy = frameworkAttemptCompletionPolicy; this.frameworkAttemptCompletionPolicy = frameworkAttemptCompletionPolicy;
this.name = name; this.name = name;
...@@ -47,60 +49,97 @@ export class FrameworkControllerTrialConfig extends KubernetesTrialConfig { ...@@ -47,60 +49,97 @@ export class FrameworkControllerTrialConfig extends KubernetesTrialConfig {
export class FrameworkControllerClusterConfig extends KubernetesClusterConfig { export class FrameworkControllerClusterConfig extends KubernetesClusterConfig {
public readonly serviceAccountName: string; public readonly serviceAccountName: string;
constructor(apiVersion: string, serviceAccountName: string) { constructor(apiVersion: string, serviceAccountName: string, configPath?: string, namespace?: string) {
super(apiVersion); super(apiVersion, undefined, namespace);
this.serviceAccountName = serviceAccountName; this.serviceAccountName = serviceAccountName;
} }
} }
export class FrameworkControllerClusterConfigPVC extends KubernetesClusterConfigPVC {
public readonly serviceAccountName: string;
public readonly configPath: string;
constructor(serviceAccountName: string, apiVersion: string, pvc: PVCConfig, configPath: string,
storage?: KubernetesStorageKind, namespace?: string) {
super(apiVersion, pvc, storage, namespace);
this.serviceAccountName = serviceAccountName;
this.configPath = configPath
}
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigPVC {
const kubernetesClusterConfigObjectPVC: FrameworkControllerClusterConfigPVC = <FrameworkControllerClusterConfigPVC>jsonObject;
assert(kubernetesClusterConfigObjectPVC !== undefined);
return new FrameworkControllerClusterConfigPVC(
kubernetesClusterConfigObjectPVC.serviceAccountName,
kubernetesClusterConfigObjectPVC.apiVersion,
kubernetesClusterConfigObjectPVC.pvc,
kubernetesClusterConfigObjectPVC.configPath,
kubernetesClusterConfigObjectPVC.storage,
kubernetesClusterConfigObjectPVC.namespace
);
}
}
export class FrameworkControllerClusterConfigNFS extends KubernetesClusterConfigNFS { export class FrameworkControllerClusterConfigNFS extends KubernetesClusterConfigNFS {
public readonly serviceAccountName: string; public readonly serviceAccountName: string;
public readonly configPath?: string;
constructor( constructor(
serviceAccountName: string, serviceAccountName: string,
apiVersion: string, apiVersion: string,
nfs: NFSConfig, nfs: NFSConfig,
storage?: KubernetesStorageKind storage?: KubernetesStorageKind,
) { namespace?: string,
super(apiVersion, nfs, storage); configPath?: string
) {
super(apiVersion, nfs, storage, namespace);
this.serviceAccountName = serviceAccountName; this.serviceAccountName = serviceAccountName;
this.configPath = configPath
} }
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigNFS { public static getInstance(jsonObject: object): FrameworkControllerClusterConfigNFS {
const kubeflowClusterConfigObjectNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>jsonObject; const kubernetesClusterConfigObjectNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>jsonObject;
assert (kubeflowClusterConfigObjectNFS !== undefined); assert(kubernetesClusterConfigObjectNFS !== undefined);
return new FrameworkControllerClusterConfigNFS( return new FrameworkControllerClusterConfigNFS(
kubeflowClusterConfigObjectNFS.serviceAccountName, kubernetesClusterConfigObjectNFS.serviceAccountName,
kubeflowClusterConfigObjectNFS.apiVersion, kubernetesClusterConfigObjectNFS.apiVersion,
kubeflowClusterConfigObjectNFS.nfs, kubernetesClusterConfigObjectNFS.nfs,
kubeflowClusterConfigObjectNFS.storage kubernetesClusterConfigObjectNFS.storage,
kubernetesClusterConfigObjectNFS.namespace
); );
} }
} }
export class FrameworkControllerClusterConfigAzure extends KubernetesClusterConfigAzure { export class FrameworkControllerClusterConfigAzure extends KubernetesClusterConfigAzure {
public readonly serviceAccountName: string; public readonly serviceAccountName: string;
public readonly configPath?: string;
constructor( constructor(
serviceAccountName: string, serviceAccountName: string,
apiVersion: string, apiVersion: string,
keyVault: KeyVaultConfig, keyVault: KeyVaultConfig,
azureStorage: AzureStorage, azureStorage: AzureStorage,
storage?: KubernetesStorageKind storage?: KubernetesStorageKind,
) { uploadRetryCount?: number,
super(apiVersion, keyVault, azureStorage, storage); namespace?: string,
configPath?: string
) {
super(apiVersion, keyVault, azureStorage, storage, uploadRetryCount, namespace);
this.serviceAccountName = serviceAccountName; this.serviceAccountName = serviceAccountName;
this.configPath = configPath
} }
public static getInstance(jsonObject: object): FrameworkControllerClusterConfigAzure { public static getInstance(jsonObject: object): FrameworkControllerClusterConfigAzure {
const kubeflowClusterConfigObjectAzure: FrameworkControllerClusterConfigAzure = <FrameworkControllerClusterConfigAzure>jsonObject; const kubernetesClusterConfigObjectAzure: FrameworkControllerClusterConfigAzure = <FrameworkControllerClusterConfigAzure>jsonObject;
return new FrameworkControllerClusterConfigAzure( return new FrameworkControllerClusterConfigAzure(
kubeflowClusterConfigObjectAzure.serviceAccountName, kubernetesClusterConfigObjectAzure.serviceAccountName,
kubeflowClusterConfigObjectAzure.apiVersion, kubernetesClusterConfigObjectAzure.apiVersion,
kubeflowClusterConfigObjectAzure.keyVault, kubernetesClusterConfigObjectAzure.keyVault,
kubeflowClusterConfigObjectAzure.azureStorage, kubernetesClusterConfigObjectAzure.azureStorage,
kubeflowClusterConfigObjectAzure.storage kubernetesClusterConfigObjectAzure.storage,
kubernetesClusterConfigObjectAzure.uploadRetryCount,
kubernetesClusterConfigObjectAzure.namespace
); );
} }
} }
...@@ -108,20 +147,22 @@ export class FrameworkControllerClusterConfigAzure extends KubernetesClusterConf ...@@ -108,20 +147,22 @@ export class FrameworkControllerClusterConfigAzure extends KubernetesClusterConf
export class FrameworkControllerClusterConfigFactory { export class FrameworkControllerClusterConfigFactory {
public static generateFrameworkControllerClusterConfig(jsonObject: object): FrameworkControllerClusterConfig { public static generateFrameworkControllerClusterConfig(jsonObject: object): FrameworkControllerClusterConfig {
const storageConfig: StorageConfig = <StorageConfig>jsonObject; const storageConfig: StorageConfig = <StorageConfig>jsonObject;
if (storageConfig === undefined) { if (storageConfig === undefined) {
throw new Error('Invalid json object as a StorageConfig instance'); throw new Error('Invalid json object as a StorageConfig instance');
} }
if (storageConfig.storage !== undefined && storageConfig.storage === 'azureStorage') { if (storageConfig.storage !== undefined && storageConfig.storage === 'azureStorage') {
return FrameworkControllerClusterConfigAzure.getInstance(jsonObject); return FrameworkControllerClusterConfigAzure.getInstance(jsonObject);
} else if (storageConfig.storage === undefined || storageConfig.storage === 'nfs') { } else if (storageConfig.storage === undefined || storageConfig.storage === 'nfs') {
return FrameworkControllerClusterConfigNFS.getInstance(jsonObject); return FrameworkControllerClusterConfigNFS.getInstance(jsonObject);
} } else if (storageConfig.storage !== undefined && storageConfig.storage === 'pvc') {
throw new Error(`Invalid json object ${jsonObject}`); return FrameworkControllerClusterConfigPVC.getInstance(jsonObject);
}
throw new Error(`Invalid json object ${jsonObject}`);
} }
} }
export type FrameworkControllerJobStatus = export type FrameworkControllerJobStatus =
'AttemptRunning' | 'Completed' | 'AttemptCreationPending' | 'AttemptCreationRequested' | 'AttemptPreparing' | 'AttemptCompleted'; 'AttemptRunning' | 'Completed' | 'AttemptCreationPending' | 'AttemptCreationRequested' | 'AttemptPreparing' | 'AttemptCompleted';
export type FrameworkControllerJobCompleteStatus = 'Succeeded' | 'Failed'; export type FrameworkControllerJobCompleteStatus = 'Succeeded' | 'Failed';
...@@ -8,22 +8,31 @@ import * as cpp from 'child-process-promise'; ...@@ -8,22 +8,31 @@ import * as cpp from 'child-process-promise';
import * as fs from 'fs'; import * as fs from 'fs';
import * as path from 'path'; import * as path from 'path';
import * as component from '../../../common/component'; import * as component from '../../../common/component';
import { getExperimentId } from '../../../common/experimentStartupInfo'; import {getExperimentId} from '../../../common/experimentStartupInfo';
import { import {
NNIManagerIpConfig, TrialJobApplicationForm, TrialJobDetail, TrialJobStatus NNIManagerIpConfig, TrialJobApplicationForm, TrialJobDetail, TrialJobStatus
} from '../../../common/trainingService'; } from '../../../common/trainingService';
import { delay, generateParamFileName, getExperimentRootDir, uniqueString } from '../../../common/utils'; import {delay, generateParamFileName, getExperimentRootDir, uniqueString} from '../../../common/utils';
import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../../common/containerJobData'; import {CONTAINER_INSTALL_NNI_SHELL_FORMAT} from '../../common/containerJobData';
import { TrialConfigMetadataKey } from '../../common/trialConfigMetadataKey'; import {TrialConfigMetadataKey} from '../../common/trialConfigMetadataKey';
import { validateCodeDir } from '../../common/util'; import {validateCodeDir} from '../../common/util';
import { NFSConfig } from '../kubernetesConfig'; import {NFSConfig} from '../kubernetesConfig';
import { KubernetesTrialJobDetail } from '../kubernetesData'; import {KubernetesTrialJobDetail} from '../kubernetesData';
import { KubernetesTrainingService } from '../kubernetesTrainingService'; import {KubernetesTrainingService} from '../kubernetesTrainingService';
import { FrameworkControllerClientFactory } from './frameworkcontrollerApiClient'; import {FrameworkControllerClientFactory} from './frameworkcontrollerApiClient';
import { FrameworkControllerClusterConfig, FrameworkControllerClusterConfigAzure, FrameworkControllerClusterConfigFactory, import {
FrameworkControllerClusterConfigNFS, FrameworkControllerTrialConfig} from './frameworkcontrollerConfig'; FrameworkControllerClusterConfig,
import { FrameworkControllerJobInfoCollector } from './frameworkcontrollerJobInfoCollector'; FrameworkControllerClusterConfigAzure,
import { FrameworkControllerJobRestServer } from './frameworkcontrollerJobRestServer'; FrameworkControllerClusterConfigFactory,
FrameworkControllerClusterConfigNFS,
FrameworkControllerTrialConfig,
FrameworkControllerTrialConfigTemplate,
FrameworkControllerClusterConfigPVC,
} from './frameworkcontrollerConfig';
import {FrameworkControllerJobInfoCollector} from './frameworkcontrollerJobInfoCollector';
import {FrameworkControllerJobRestServer} from './frameworkcontrollerJobRestServer';
const yaml = require('js-yaml');
/** /**
* Training Service implementation for frameworkcontroller * Training Service implementation for frameworkcontroller
...@@ -31,6 +40,7 @@ import { FrameworkControllerJobRestServer } from './frameworkcontrollerJobRestSe ...@@ -31,6 +40,7 @@ import { FrameworkControllerJobRestServer } from './frameworkcontrollerJobRestSe
@component.Singleton @component.Singleton
class FrameworkControllerTrainingService extends KubernetesTrainingService implements KubernetesTrainingService { class FrameworkControllerTrainingService extends KubernetesTrainingService implements KubernetesTrainingService {
private fcTrialConfig?: FrameworkControllerTrialConfig; // frameworkcontroller trial configuration private fcTrialConfig?: FrameworkControllerTrialConfig; // frameworkcontroller trial configuration
private fcTemplate: any = undefined; // custom frameworkcontroller template
private readonly fcJobInfoCollector: FrameworkControllerJobInfoCollector; // frameworkcontroller job info collector private readonly fcJobInfoCollector: FrameworkControllerJobInfoCollector; // frameworkcontroller job info collector
private readonly fcContainerPortMap: Map<string, number> = new Map<string, number>(); // store frameworkcontroller container port private readonly fcContainerPortMap: Map<string, number> = new Map<string, number>(); // store frameworkcontroller container port
private fcClusterConfig?: FrameworkControllerClusterConfig; private fcClusterConfig?: FrameworkControllerClusterConfig;
...@@ -59,8 +69,41 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -59,8 +69,41 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
} }
} }
} }
private parseCustomTaskRoles(customTaskRoles: any[]): FrameworkControllerTrialConfigTemplate[] {
const taskRoles: FrameworkControllerTrialConfigTemplate[] = []
customTaskRoles.map((x) => {
if (x.task === undefined ||
x.task.pod === undefined ||
x.task.pod.spec === undefined ||
x.task.pod.spec.containers === undefined) {
throw new Error('invalid custom frameworkcontroller configuration')
}
if (x.task.pod.spec.containers.length > 1) {
throw new Error('custom config may only define one non-init container for tasks')
}
const defaultAttempt = {
minFailedTaskCount: 1,
minSucceededTaskCount: -1
}
const trialConfig = <FrameworkControllerTrialConfigTemplate>{
name: x.name,
taskNum: x.taskNumber ? x.taskNumber : 1,
command: x.task.pod.spec.containers[0].command.join(" "),
gpuNum: x.task.gpuNum ? x.task.gpuNum : 0,
cpuNum: x.task.cpuNum ? x.task.cpuNum : 1,
memoryMB: x.task.memoryMB ? x.task.memoryMB : 8192,
image: x.task.pod.spec.containers[0].image,
frameworkAttemptCompletionPolicy: x.task.frameworkAttemptCompletionPolicy ?
x.task.frameworkAttemptCompletionPolicy :
defaultAttempt
}
taskRoles.push(trialConfig)
})
return taskRoles
}
public async submitTrialJob(form: TrialJobApplicationForm): Promise<TrialJobDetail> { public async submitTrialJob(form: TrialJobApplicationForm): Promise<TrialJobDetail> {
let configTaskRoles: any = undefined;
if (this.fcClusterConfig === undefined) { if (this.fcClusterConfig === undefined) {
throw new Error('frameworkcontrollerClusterConfig is not initialized'); throw new Error('frameworkcontrollerClusterConfig is not initialized');
} }
...@@ -68,6 +111,19 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -68,6 +111,19 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
throw new Error('kubernetesCRDClient is undefined'); throw new Error('kubernetesCRDClient is undefined');
} }
if (this.fcTemplate === undefined) {
if (this.fcTrialConfig === undefined) {
throw new Error(
'neither trialConfig nor fcTemplate is initialized'
);
}
configTaskRoles = this.fcTrialConfig.taskRoles;
} else {
configTaskRoles = this.parseCustomTaskRoles(this.fcTemplate.spec.taskRoles)
}
const namespace = this.fcClusterConfig.namespace ? this.fcClusterConfig.namespace : "default";
this.genericK8sClient.setNamespace = namespace;
if (this.kubernetesRestServerPort === undefined) { if (this.kubernetesRestServerPort === undefined) {
const restServer: FrameworkControllerJobRestServer = component.get(FrameworkControllerJobRestServer); const restServer: FrameworkControllerJobRestServer = component.get(FrameworkControllerJobRestServer);
this.kubernetesRestServerPort = restServer.clusterRestServerPort; this.kubernetesRestServerPort = restServer.clusterRestServerPort;
...@@ -82,10 +138,24 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -82,10 +138,24 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
// Set trial's NFS working folder // Set trial's NFS working folder
const trialWorkingFolder: string = path.join(this.CONTAINER_MOUNT_PATH, 'nni', getExperimentId(), trialJobId); const trialWorkingFolder: string = path.join(this.CONTAINER_MOUNT_PATH, 'nni', getExperimentId(), trialJobId);
const trialLocalTempFolder: string = path.join(getExperimentRootDir(), 'trials-local', trialJobId); const trialLocalTempFolder: string = path.join(getExperimentRootDir(), 'trials-local', trialJobId);
const frameworkcontrollerJobName: string = `nniexp${this.experimentId}trial${trialJobId}`.toLowerCase(); let frameworkcontrollerJobName: string = `nniexp${this.experimentId}trial${trialJobId}`.toLowerCase();
// Create frameworkcontroller job based on generated frameworkcontroller job resource config
let frameworkcontrollerJobConfig = JSON.parse(JSON.stringify(this.fcTemplate));
if (this.fcTemplate !== undefined) {
// add a custom name extension to the job name and apply it to the custom template
frameworkcontrollerJobName += "xx" + this.fcTemplate.metadata.name;
// Process custom task roles commands
configTaskRoles.map((x: any, i: number) => {
const scriptName = path.join(trialWorkingFolder, "run_" + x.name + ".sh")
frameworkcontrollerJobConfig.spec.taskRoles[i].task.pod.spec.containers[0].command = ["sh", scriptName]
})
}
//Generate the port used for taskRole //Generate the port used for taskRole
this.generateContainerPort(); this.generateContainerPort(configTaskRoles);
await this.prepareRunScript(trialLocalTempFolder, trialJobId, trialWorkingFolder, form); await this.prepareRunScript(trialLocalTempFolder, trialJobId, trialWorkingFolder, form, configTaskRoles);
//wait upload of script files to finish //wait upload of script files to finish
const trialJobOutputUrl: string = await this.uploadFolder(trialLocalTempFolder, `nni/${getExperimentId()}/${trialJobId}`); const trialJobOutputUrl: string = await this.uploadFolder(trialLocalTempFolder, `nni/${getExperimentId()}/${trialJobId}`);
...@@ -106,9 +176,18 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -106,9 +176,18 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
// Set trial job detail until create frameworkcontroller job successfully // Set trial job detail until create frameworkcontroller job successfully
this.trialJobsMap.set(trialJobId, trialJobDetail); this.trialJobsMap.set(trialJobId, trialJobDetail);
// Create frameworkcontroller job based on generated frameworkcontroller job resource config if (this.fcTemplate !== undefined) {
const frameworkcontrollerJobConfig: any = await this.prepareFrameworkControllerConfig( frameworkcontrollerJobConfig = {
trialJobId, trialWorkingFolder, frameworkcontrollerJobName); ...frameworkcontrollerJobConfig,
metadata: {...this.fcTemplate.metadata, name: frameworkcontrollerJobName}
};
} else {
frameworkcontrollerJobConfig = await this.prepareFrameworkControllerConfig(
trialJobId,
trialWorkingFolder,
frameworkcontrollerJobName
);
}
await this.kubernetesCRDClient.createKubernetesJob(frameworkcontrollerJobConfig); await this.kubernetesCRDClient.createKubernetesJob(frameworkcontrollerJobConfig);
// Set trial job detail until create frameworkcontroller job successfully // Set trial job detail until create frameworkcontroller job successfully
...@@ -124,26 +203,60 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -124,26 +203,60 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
break; break;
case TrialConfigMetadataKey.FRAMEWORKCONTROLLER_CLUSTER_CONFIG: { case TrialConfigMetadataKey.FRAMEWORKCONTROLLER_CLUSTER_CONFIG: {
const frameworkcontrollerClusterJsonObject: any = JSON.parse(value); const frameworkcontrollerClusterJsonObject: any = JSON.parse(value);
let namespace: string | undefined;
this.fcClusterConfig = FrameworkControllerClusterConfigFactory this.fcClusterConfig = FrameworkControllerClusterConfigFactory
.generateFrameworkControllerClusterConfig(frameworkcontrollerClusterJsonObject); .generateFrameworkControllerClusterConfig(frameworkcontrollerClusterJsonObject);
if (this.fcClusterConfig.storageType === 'azureStorage') { if (this.fcClusterConfig.storageType === 'azureStorage') {
const azureFrameworkControllerClusterConfig: FrameworkControllerClusterConfigAzure = const azureFrameworkControllerClusterConfig: FrameworkControllerClusterConfigAzure =
<FrameworkControllerClusterConfigAzure>this.fcClusterConfig; <FrameworkControllerClusterConfigAzure>this.fcClusterConfig;
this.azureStorageAccountName = azureFrameworkControllerClusterConfig.azureStorage.accountName; this.azureStorageAccountName = azureFrameworkControllerClusterConfig.azureStorage.accountName;
this.azureStorageShare = azureFrameworkControllerClusterConfig.azureStorage.azureShare; this.azureStorageShare = azureFrameworkControllerClusterConfig.azureStorage.azureShare;
if (azureFrameworkControllerClusterConfig.configPath !== undefined) {
this.fcTemplate = yaml.safeLoad(
fs.readFileSync(
azureFrameworkControllerClusterConfig.configPath,
'utf8'
)
);
}
await this.createAzureStorage( await this.createAzureStorage(
azureFrameworkControllerClusterConfig.keyVault.vaultName, azureFrameworkControllerClusterConfig.keyVault.vaultName,
azureFrameworkControllerClusterConfig.keyVault.name azureFrameworkControllerClusterConfig.keyVault.name
); );
namespace = azureFrameworkControllerClusterConfig.namespace;
} else if (this.fcClusterConfig.storageType === 'nfs') { } else if (this.fcClusterConfig.storageType === 'nfs') {
const nfsFrameworkControllerClusterConfig: FrameworkControllerClusterConfigNFS = const nfsFrameworkControllerClusterConfig: FrameworkControllerClusterConfigNFS =
<FrameworkControllerClusterConfigNFS>this.fcClusterConfig; <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
if (nfsFrameworkControllerClusterConfig.configPath !== undefined) {
this.fcTemplate = yaml.safeLoad(
fs.readFileSync(
nfsFrameworkControllerClusterConfig.configPath,
'utf8'
)
);
}
await this.createNFSStorage( await this.createNFSStorage(
nfsFrameworkControllerClusterConfig.nfs.server, nfsFrameworkControllerClusterConfig.nfs.server,
nfsFrameworkControllerClusterConfig.nfs.path nfsFrameworkControllerClusterConfig.nfs.path
); );
namespace = nfsFrameworkControllerClusterConfig.namespace
} else if (this.fcClusterConfig.storageType === 'pvc') {
const pvcFrameworkControllerClusterConfig: FrameworkControllerClusterConfigPVC =
<FrameworkControllerClusterConfigPVC>this.fcClusterConfig;
this.fcTemplate = yaml.safeLoad(
fs.readFileSync(
pvcFrameworkControllerClusterConfig.configPath,
'utf8'
)
);
await this.createPVCStorage(
pvcFrameworkControllerClusterConfig.pvc.path
);
namespace = pvcFrameworkControllerClusterConfig.namespace;
} }
this.kubernetesCRDClient = FrameworkControllerClientFactory.createClient(); namespace = namespace ? namespace : "default";
this.kubernetesCRDClient = FrameworkControllerClientFactory.createClient(namespace);
break; break;
} }
case TrialConfigMetadataKey.TRIAL_CONFIG: { case TrialConfigMetadataKey.TRIAL_CONFIG: {
...@@ -186,9 +299,11 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -186,9 +299,11 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
throw new Error('Kubeflow Cluster config is not initialized'); throw new Error('Kubeflow Cluster config is not initialized');
} }
assert(this.fcClusterConfig.storage === undefined assert(this.fcClusterConfig.storage === undefined ||
|| this.fcClusterConfig.storage === 'azureStorage' this.fcClusterConfig.storage === 'azureStorage' ||
|| this.fcClusterConfig.storage === 'nfs'); this.fcClusterConfig.storage === 'nfs' ||
this.fcClusterConfig.storage === 'pvc'
);
if (this.fcClusterConfig.storage === 'azureStorage') { if (this.fcClusterConfig.storage === 'azureStorage') {
if (this.azureStorageClient === undefined) { if (this.azureStorageClient === undefined) {
...@@ -197,11 +312,15 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -197,11 +312,15 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
const fcClusterConfigAzure: FrameworkControllerClusterConfigAzure = <FrameworkControllerClusterConfigAzure>this.fcClusterConfig; const fcClusterConfigAzure: FrameworkControllerClusterConfigAzure = <FrameworkControllerClusterConfigAzure>this.fcClusterConfig;
return await this.uploadFolderToAzureStorage(srcDirectory, destDirectory, fcClusterConfigAzure.uploadRetryCount); return await this.uploadFolderToAzureStorage(srcDirectory, destDirectory, fcClusterConfigAzure.uploadRetryCount);
} else if (this.fcClusterConfig.storage === 'nfs' || this.fcClusterConfig.storage === undefined) { } else if (this.fcClusterConfig.storage === 'nfs' || this.fcClusterConfig.storage === undefined) {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/${destDirectory}`); await cpp.exec(`mkdir -p ${this.trialLocalTempFolder}/${destDirectory}`);
await cpp.exec(`cp -r ${srcDirectory}/* ${this.trialLocalNFSTempFolder}/${destDirectory}/.`); await cpp.exec(`cp -r ${srcDirectory}/* ${this.trialLocalTempFolder}/${destDirectory}/.`);
const fcClusterConfigNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig; const fcClusterConfigNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
const nfsConfig: NFSConfig = fcClusterConfigNFS.nfs; const nfsConfig: NFSConfig = fcClusterConfigNFS.nfs;
return `nfs://${nfsConfig.server}:${destDirectory}`; return `nfs://${nfsConfig.server}:${destDirectory}`;
} else if (this.fcClusterConfig.storage === 'pvc') {
await cpp.exec(`mkdir -p ${this.trialLocalTempFolder}/${destDirectory}`);
await cpp.exec(`cp -r ${srcDirectory}/* ${this.trialLocalTempFolder}/${destDirectory}/.`);
return `${this.trialLocalTempFolder}/${destDirectory}`;
} }
return ''; return '';
} }
...@@ -211,48 +330,51 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -211,48 +330,51 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
* expose port and execute injector.sh before executing user's command * expose port and execute injector.sh before executing user's command
* @param command * @param command
*/ */
private generateCommandScript(command: string): string { private generateCommandScript(taskRoles: FrameworkControllerTrialConfigTemplate[], command: string): string {
let portScript: string = ''; let portScript: string = '';
if (this.fcTrialConfig === undefined) { for (const taskRole of taskRoles) {
throw new Error('frameworkcontroller trial config is not initialized'); portScript += `FB_${taskRole.name.toUpperCase()}_PORT=${this.fcContainerPortMap.get(
} taskRole.name
for (const taskRole of this.fcTrialConfig.taskRoles) { )} `;
portScript += `FB_${taskRole.name.toUpperCase()}_PORT=${this.fcContainerPortMap.get(taskRole.name)} `;
} }
return `${portScript} . /mnt/frameworkbarrier/injector.sh && ${command}`; return `${portScript} . /mnt/frameworkbarrier/injector.sh && ${command}`;
} }
private async prepareRunScript(trialLocalTempFolder: string, trialJobId: string, private async prepareRunScript(trialLocalTempFolder: string, trialJobId: string,
trialWorkingFolder: string, form: TrialJobApplicationForm): Promise<void> { trialWorkingFolder: string, form: TrialJobApplicationForm,
if (this.fcTrialConfig === undefined) { configTaskRoles: FrameworkControllerTrialConfigTemplate[]
throw new Error('frameworkcontroller trial config is not initialized'); ): Promise<void> {
if (configTaskRoles === undefined) {
throw new Error(
'neither frameworkcontroller trial config nor template is not initialized'
);
} }
await cpp.exec(`mkdir -p ${trialLocalTempFolder}`); await cpp.exec(`mkdir -p ${trialLocalTempFolder}`);
const installScriptContent: string = CONTAINER_INSTALL_NNI_SHELL_FORMAT; const installScriptContent: string = CONTAINER_INSTALL_NNI_SHELL_FORMAT;
// Write NNI installation file to local tmp files // Write NNI installation file to local tmp files
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), installScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), installScriptContent, {encoding: 'utf8'});
// Create tmp trial working folder locally. // Create tmp trial working folder locally.
for (const taskRole of this.fcTrialConfig.taskRoles) { for (const taskRole of configTaskRoles) {
const runScriptContent: string = const runScriptContent: string =
await this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder, await this.generateRunScript('frameworkcontroller', trialJobId, trialWorkingFolder,
this.generateCommandScript(taskRole.command), form.sequenceId.toString(), this.generateCommandScript(configTaskRoles, taskRole.command), form.sequenceId.toString(),
taskRole.name, taskRole.gpuNum); taskRole.name, taskRole.gpuNum ? taskRole.gpuNum : 0);
await fs.promises.writeFile(path.join(trialLocalTempFolder, `run_${taskRole.name}.sh`), runScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, `run_${taskRole.name}.sh`), runScriptContent, {encoding: 'utf8'});
} }
// Write file content ( parameter.cfg ) to local tmp folders // Write file content ( parameter.cfg ) to local tmp folders
if (form !== undefined) { if (form !== undefined) {
await fs.promises.writeFile(path.join(trialLocalTempFolder, generateParamFileName(form.hyperParameters)), await fs.promises.writeFile(path.join(trialLocalTempFolder, generateParamFileName(form.hyperParameters)),
form.hyperParameters.value, { encoding: 'utf8' }); form.hyperParameters.value, {encoding: 'utf8'});
} }
} }
private async prepareFrameworkControllerConfig(trialJobId: string, trialWorkingFolder: string, frameworkcontrollerJobName: string): private async prepareFrameworkControllerConfig(trialJobId: string, trialWorkingFolder: string, frameworkcontrollerJobName: string):
Promise<any> { Promise<any> {
if (this.fcTrialConfig === undefined) { if (this.fcTrialConfig === undefined) {
throw new Error('frameworkcontroller trial config is not initialized'); throw new Error('frameworkcontroller trial config is not initialized');
...@@ -267,19 +389,19 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -267,19 +389,19 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
} }
// Generate frameworkcontroller job resource config object // Generate frameworkcontroller job resource config object
const frameworkcontrollerJobConfig: any = const frameworkcontrollerJobConfig: any =
await this.generateFrameworkControllerJobConfig(trialJobId, trialWorkingFolder, frameworkcontrollerJobName, podResources); await this.generateFrameworkControllerJobConfig(trialJobId, trialWorkingFolder, frameworkcontrollerJobName, podResources);
return Promise.resolve(frameworkcontrollerJobConfig); return Promise.resolve(frameworkcontrollerJobConfig);
} }
private generateContainerPort(): void { private generateContainerPort(taskRoles: FrameworkControllerTrialConfigTemplate[]): void {
if (this.fcTrialConfig === undefined) { if (taskRoles === undefined) {
throw new Error('frameworkcontroller trial config is not initialized'); throw new Error('frameworkcontroller trial config is not initialized');
} }
let port: number = 4000; //The default port used in container let port: number = 4000; //The default port used in container
for (const index of this.fcTrialConfig.taskRoles.keys()) { for (const index of taskRoles.keys()) {
this.fcContainerPortMap.set(this.fcTrialConfig.taskRoles[index].name, port); this.fcContainerPortMap.set(taskRoles[index].name, port);
port += 1; port += 1;
} }
} }
...@@ -292,7 +414,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -292,7 +414,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
* @param podResources pod template * @param podResources pod template
*/ */
private async generateFrameworkControllerJobConfig(trialJobId: string, trialWorkingFolder: string, private async generateFrameworkControllerJobConfig(trialJobId: string, trialWorkingFolder: string,
frameworkcontrollerJobName: string, podResources: any): Promise<any> { frameworkcontrollerJobName: string, podResources: any): Promise<any> {
if (this.fcClusterConfig === undefined) { if (this.fcClusterConfig === undefined) {
throw new Error('frameworkcontroller Cluster config is not initialized'); throw new Error('frameworkcontroller Cluster config is not initialized');
} }
...@@ -307,7 +429,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -307,7 +429,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
if (containerPort === undefined) { if (containerPort === undefined) {
throw new Error('Container port is not initialized'); throw new Error('Container port is not initialized');
} }
const taskRole: any = this.generateTaskRoleConfig( const taskRole: any = this.generateTaskRoleConfig(
trialWorkingFolder, trialWorkingFolder,
this.fcTrialConfig.taskRoles[index].image, this.fcTrialConfig.taskRoles[index].image,
...@@ -332,7 +454,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -332,7 +454,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
kind: 'Framework', kind: 'Framework',
metadata: { metadata: {
name: frameworkcontrollerJobName, name: frameworkcontrollerJobName,
namespace: 'default', namespace: this.fcClusterConfig.namespace ? this.fcClusterConfig.namespace : "default",
labels: { labels: {
app: this.NNI_KUBERNETES_TRIAL_LABEL, app: this.NNI_KUBERNETES_TRIAL_LABEL,
expId: getExperimentId(), expId: getExperimentId(),
...@@ -346,8 +468,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -346,8 +468,8 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
}); });
} }
private generateTaskRoleConfig(trialWorkingFolder: string, replicaImage: string, runScriptFile: string, private generateTaskRoleConfig(trialWorkingFolder: string, replicaImage: string, runScriptFile: string,
podResources: any, containerPort: number, privateRegistrySecretName: string | undefined): any { podResources: any, containerPort: number, privateRegistrySecretName: string | undefined): any {
if (this.fcClusterConfig === undefined) { if (this.fcClusterConfig === undefined) {
throw new Error('frameworkcontroller Cluster config is not initialized'); throw new Error('frameworkcontroller Cluster config is not initialized');
} }
...@@ -359,31 +481,31 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -359,31 +481,31 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
const volumeSpecMap: Map<string, object> = new Map<string, object>(); const volumeSpecMap: Map<string, object> = new Map<string, object>();
if (this.fcClusterConfig.storageType === 'azureStorage') { if (this.fcClusterConfig.storageType === 'azureStorage') {
volumeSpecMap.set('nniVolumes', [ volumeSpecMap.set('nniVolumes', [
{ {
name: 'nni-vol', name: 'nni-vol',
azureFile: { azureFile: {
secretName: `${this.azureStorageSecretName}`, secretName: `${this.azureStorageSecretName}`,
shareName: `${this.azureStorageShare}`, shareName: `${this.azureStorageShare}`,
readonly: false readonly: false
} }
}, { }, {
name: 'frameworkbarrier-volume', name: 'frameworkbarrier-volume',
emptyDir: {} emptyDir: {}
}]); }]);
} else { } else {
const frameworkcontrollerClusterConfigNFS: FrameworkControllerClusterConfigNFS = const frameworkcontrollerClusterConfigNFS: FrameworkControllerClusterConfigNFS =
<FrameworkControllerClusterConfigNFS> this.fcClusterConfig; <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
volumeSpecMap.set('nniVolumes', [ volumeSpecMap.set('nniVolumes', [
{ {
name: 'nni-vol', name: 'nni-vol',
nfs: { nfs: {
server: `${frameworkcontrollerClusterConfigNFS.nfs.server}`, server: `${frameworkcontrollerClusterConfigNFS.nfs.server}`,
path: `${frameworkcontrollerClusterConfigNFS.nfs.path}` path: `${frameworkcontrollerClusterConfigNFS.nfs.path}`
} }
}, { }, {
name: 'frameworkbarrier-volume', name: 'frameworkbarrier-volume',
emptyDir: {} emptyDir: {}
}]); }]);
} }
const containers: any = [ const containers: any = [
...@@ -392,30 +514,30 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -392,30 +514,30 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
image: replicaImage, image: replicaImage,
command: ['sh', `${path.join(trialWorkingFolder, runScriptFile)}`], command: ['sh', `${path.join(trialWorkingFolder, runScriptFile)}`],
volumeMounts: [ volumeMounts: [
{ {
name: 'nni-vol', name: 'nni-vol',
mountPath: this.CONTAINER_MOUNT_PATH mountPath: this.CONTAINER_MOUNT_PATH
}, { }, {
name: 'frameworkbarrier-volume', name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier' mountPath: '/mnt/frameworkbarrier'
}], }],
resources: podResources, resources: podResources,
ports: [{ ports: [{
containerPort: containerPort containerPort: containerPort
}] }]
}]; }];
const initContainers: any = [ const initContainers: any = [
{ {
name: 'frameworkbarrier', name: 'frameworkbarrier',
image: 'frameworkcontroller/frameworkbarrier', image: 'frameworkcontroller/frameworkbarrier',
volumeMounts: [ volumeMounts: [
{ {
name: 'frameworkbarrier-volume', name: 'frameworkbarrier-volume',
mountPath: '/mnt/frameworkbarrier' mountPath: '/mnt/frameworkbarrier'
}] }]
}]; }];
const spec: any = { const spec: any = {
containers: containers, containers: containers,
initContainers: initContainers, initContainers: initContainers,
...@@ -423,12 +545,12 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -423,12 +545,12 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
volumes: volumeSpecMap.get('nniVolumes'), volumes: volumeSpecMap.get('nniVolumes'),
hostNetwork: false hostNetwork: false
}; };
if(privateRegistrySecretName) { if (privateRegistrySecretName) {
spec.imagePullSecrets = [ spec.imagePullSecrets = [
{ {
name: privateRegistrySecretName name: privateRegistrySecretName
} }
] ]
} }
if (this.fcClusterConfig.serviceAccountName !== undefined) { if (this.fcClusterConfig.serviceAccountName !== undefined) {
...@@ -443,4 +565,4 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple ...@@ -443,4 +565,4 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
} }
} }
export { FrameworkControllerTrainingService }; export {FrameworkControllerTrainingService};
...@@ -202,8 +202,8 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber ...@@ -202,8 +202,8 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
const azureKubeflowClusterConfig: KubeflowClusterConfigAzure = <KubeflowClusterConfigAzure>this.kubeflowClusterConfig; const azureKubeflowClusterConfig: KubeflowClusterConfigAzure = <KubeflowClusterConfigAzure>this.kubeflowClusterConfig;
return await this.uploadFolderToAzureStorage(srcDirectory, destDirectory, azureKubeflowClusterConfig.uploadRetryCount); return await this.uploadFolderToAzureStorage(srcDirectory, destDirectory, azureKubeflowClusterConfig.uploadRetryCount);
} else if (this.kubeflowClusterConfig.storage === 'nfs' || this.kubeflowClusterConfig.storage === undefined) { } else if (this.kubeflowClusterConfig.storage === 'nfs' || this.kubeflowClusterConfig.storage === undefined) {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/${destDirectory}`); await cpp.exec(`mkdir -p ${this.trialLocalTempFolder}/${destDirectory}`);
await cpp.exec(`cp -r ${srcDirectory}/* ${this.trialLocalNFSTempFolder}/${destDirectory}/.`); await cpp.exec(`cp -r ${srcDirectory}/* ${this.trialLocalTempFolder}/${destDirectory}/.`);
const nfsKubeflowClusterConfig: KubeflowClusterConfigNFS = <KubeflowClusterConfigNFS>this.kubeflowClusterConfig; const nfsKubeflowClusterConfig: KubeflowClusterConfigNFS = <KubeflowClusterConfigNFS>this.kubeflowClusterConfig;
const nfsConfig: NFSConfig = nfsKubeflowClusterConfig.nfs; const nfsConfig: NFSConfig = nfsKubeflowClusterConfig.nfs;
return `nfs://${nfsConfig.server}:${destDirectory}`; return `nfs://${nfsConfig.server}:${destDirectory}`;
...@@ -426,7 +426,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber ...@@ -426,7 +426,7 @@ class KubeflowTrainingService extends KubernetesTrainingService implements Kuber
}]); }]);
} }
// The config spec for container field // The config spec for container field
const containersSpecMap: Map<string, object> = new Map<string, object>(); const containersSpecMap: Map<string, object> = new Map<string, object>();
containersSpecMap.set('containers', [ containersSpecMap.set('containers', [
{ {
// Kubeflow tensorflow operator requires that containers' name must be tensorflow // Kubeflow tensorflow operator requires that containers' name must be tensorflow
......
...@@ -4,8 +4,8 @@ ...@@ -4,8 +4,8 @@
'use strict'; 'use strict';
// eslint-disable-next-line @typescript-eslint/camelcase // eslint-disable-next-line @typescript-eslint/camelcase
import { Client1_10, config } from 'kubernetes-client'; import {Client1_10, config} from 'kubernetes-client';
import { getLogger, Logger } from '../../common/log'; import {getLogger, Logger} from '../../common/log';
/** /**
* Generic Kubernetes client, target version >= 1.9 * Generic Kubernetes client, target version >= 1.9
...@@ -16,13 +16,16 @@ class GeneralK8sClient { ...@@ -16,13 +16,16 @@ class GeneralK8sClient {
protected namespace: string = 'default'; protected namespace: string = 'default';
constructor() { constructor() {
this.client = new Client1_10({ config: config.fromKubeconfig(), version: '1.9'}); this.client = new Client1_10({config: config.fromKubeconfig(), version: '1.9'});
this.client.loadSpec(); this.client.loadSpec();
} }
public set setNamespace(namespace: string) { public set setNamespace(namespace: string) {
this.namespace = namespace; this.namespace = namespace;
} }
public get getNamespace(): string {
return this.namespace;
}
private matchStorageClass(response: any): string { private matchStorageClass(response: any): string {
const adlSupportedProvisioners: RegExp[] = [ const adlSupportedProvisioners: RegExp[] = [
...@@ -32,7 +35,7 @@ class GeneralK8sClient { ...@@ -32,7 +35,7 @@ class GeneralK8sClient {
new RegExp("\\b" + "efs" + "\\b") new RegExp("\\b" + "efs" + "\\b")
] ]
const templateLen = adlSupportedProvisioners.length, const templateLen = adlSupportedProvisioners.length,
responseLen = response.items.length responseLen = response.items.length
let i = 0, let i = 0,
j = 0; j = 0;
for (; i < responseLen; i++) { for (; i < responseLen; i++) {
...@@ -66,7 +69,7 @@ class GeneralK8sClient { ...@@ -66,7 +69,7 @@ class GeneralK8sClient {
public async createDeployment(deploymentManifest: any): Promise<string> { public async createDeployment(deploymentManifest: any): Promise<string> {
let result: Promise<string>; let result: Promise<string>;
const response: any = await this.client.apis.apps.v1.namespaces(this.namespace) const response: any = await this.client.apis.apps.v1.namespaces(this.namespace)
.deployments.post({ body: deploymentManifest }) .deployments.post({body: deploymentManifest})
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(response.body.metadata.uid); result = Promise.resolve(response.body.metadata.uid);
} else { } else {
...@@ -79,7 +82,7 @@ class GeneralK8sClient { ...@@ -79,7 +82,7 @@ class GeneralK8sClient {
let result: Promise<boolean>; let result: Promise<boolean>;
// TODO: change this hard coded deployment name after demo // TODO: change this hard coded deployment name after demo
const response: any = await this.client.apis.apps.v1.namespaces(this.namespace) const response: any = await this.client.apis.apps.v1.namespaces(this.namespace)
.deployment(deploymentName).delete(); .deployment(deploymentName).delete();
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
} else { } else {
...@@ -91,7 +94,7 @@ class GeneralK8sClient { ...@@ -91,7 +94,7 @@ class GeneralK8sClient {
public async createConfigMap(configMapManifest: any): Promise<boolean> { public async createConfigMap(configMapManifest: any): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces(this.namespace) const response: any = await this.client.api.v1.namespaces(this.namespace)
.configmaps.post({body: configMapManifest}); .configmaps.post({body: configMapManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
} else { } else {
...@@ -104,7 +107,7 @@ class GeneralK8sClient { ...@@ -104,7 +107,7 @@ class GeneralK8sClient {
public async createPersistentVolumeClaim(pvcManifest: any): Promise<boolean> { public async createPersistentVolumeClaim(pvcManifest: any): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces(this.namespace) const response: any = await this.client.api.v1.namespaces(this.namespace)
.persistentvolumeclaims.post({body: pvcManifest}); .persistentvolumeclaims.post({body: pvcManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
} else { } else {
...@@ -116,7 +119,7 @@ class GeneralK8sClient { ...@@ -116,7 +119,7 @@ class GeneralK8sClient {
public async createSecret(secretManifest: any): Promise<boolean> { public async createSecret(secretManifest: any): Promise<boolean> {
let result: Promise<boolean>; let result: Promise<boolean>;
const response: any = await this.client.api.v1.namespaces(this.namespace) const response: any = await this.client.api.v1.namespaces(this.namespace)
.secrets.post({body: secretManifest}); .secrets.post({body: secretManifest});
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(true); result = Promise.resolve(true);
} else { } else {
...@@ -136,7 +139,7 @@ abstract class KubernetesCRDClient { ...@@ -136,7 +139,7 @@ abstract class KubernetesCRDClient {
protected crdSchema: any; protected crdSchema: any;
constructor() { constructor() {
this.client = new Client1_10({ config: config.fromKubeconfig() }); this.client = new Client1_10({config: config.fromKubeconfig()});
this.client.loadSpec(); this.client.loadSpec();
} }
...@@ -181,7 +184,7 @@ abstract class KubernetesCRDClient { ...@@ -181,7 +184,7 @@ abstract class KubernetesCRDClient {
public async getKubernetesJob(kubeflowJobName: string): Promise<any> { public async getKubernetesJob(kubeflowJobName: string): Promise<any> {
let result: Promise<any>; let result: Promise<any>;
const response: any = await this.operator(kubeflowJobName) const response: any = await this.operator(kubeflowJobName)
.get(); .get();
if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) { if (response.statusCode && (response.statusCode >= 200 && response.statusCode <= 299)) {
result = Promise.resolve(response.body); result = Promise.resolve(response.body);
} else { } else {
...@@ -195,16 +198,16 @@ abstract class KubernetesCRDClient { ...@@ -195,16 +198,16 @@ abstract class KubernetesCRDClient {
let result: Promise<boolean>; let result: Promise<boolean>;
// construct match query from labels for deleting tfjob // construct match query from labels for deleting tfjob
const matchQuery: string = Array.from(labels.keys()) const matchQuery: string = Array.from(labels.keys())
.map((labelKey: string) => `${labelKey}=${labels.get(labelKey)}`) .map((labelKey: string) => `${labelKey}=${labels.get(labelKey)}`)
.join(','); .join(',');
try { try {
const deleteResult: any = await this.operator() const deleteResult: any = await this.operator()
.delete({ .delete({
qs: { qs: {
labelSelector: matchQuery, labelSelector: matchQuery,
propagationPolicy: 'Background' propagationPolicy: 'Background'
} }
}); });
if (deleteResult.statusCode && deleteResult.statusCode >= 200 && deleteResult.statusCode <= 299) { if (deleteResult.statusCode && deleteResult.statusCode >= 200 && deleteResult.statusCode <= 299) {
result = Promise.resolve(true); result = Promise.resolve(true);
} else { } else {
...@@ -219,4 +222,4 @@ abstract class KubernetesCRDClient { ...@@ -219,4 +222,4 @@ abstract class KubernetesCRDClient {
} }
} }
export { KubernetesCRDClient, GeneralK8sClient }; export {KubernetesCRDClient, GeneralK8sClient};
...@@ -3,16 +3,18 @@ ...@@ -3,16 +3,18 @@
'use strict'; 'use strict';
export type KubernetesStorageKind = 'nfs' | 'azureStorage'; export type KubernetesStorageKind = 'nfs' | 'azureStorage' | 'pvc';
import { MethodNotImplementedError } from '../../common/errors'; import {MethodNotImplementedError} from '../../common/errors';
export abstract class KubernetesClusterConfig { export abstract class KubernetesClusterConfig {
public readonly storage?: KubernetesStorageKind; public readonly storage?: KubernetesStorageKind;
public readonly apiVersion: string; public readonly apiVersion: string;
public readonly namespace?: string;
constructor(apiVersion: string, storage?: KubernetesStorageKind) { constructor(apiVersion: string, storage?: KubernetesStorageKind, namespace?: string) {
this.storage = storage; this.storage = storage;
this.apiVersion = apiVersion; this.apiVersion = apiVersion;
this.namespace = namespace
} }
public get storageType(): KubernetesStorageKind { public get storageType(): KubernetesStorageKind {
...@@ -32,11 +34,12 @@ export class KubernetesClusterConfigNFS extends KubernetesClusterConfig { ...@@ -32,11 +34,12 @@ export class KubernetesClusterConfigNFS extends KubernetesClusterConfig {
public readonly nfs: NFSConfig; public readonly nfs: NFSConfig;
constructor( constructor(
apiVersion: string, apiVersion: string,
nfs: NFSConfig, nfs: NFSConfig,
storage?: KubernetesStorageKind storage?: KubernetesStorageKind,
) { namespace?: string
super(apiVersion, storage); ) {
super(apiVersion, storage, namespace);
this.nfs = nfs; this.nfs = nfs;
} }
...@@ -50,7 +53,8 @@ export class KubernetesClusterConfigNFS extends KubernetesClusterConfig { ...@@ -50,7 +53,8 @@ export class KubernetesClusterConfigNFS extends KubernetesClusterConfig {
return new KubernetesClusterConfigNFS( return new KubernetesClusterConfigNFS(
kubernetesClusterConfigObjectNFS.apiVersion, kubernetesClusterConfigObjectNFS.apiVersion,
kubernetesClusterConfigObjectNFS.nfs, kubernetesClusterConfigObjectNFS.nfs,
kubernetesClusterConfigObjectNFS.storage kubernetesClusterConfigObjectNFS.storage,
kubernetesClusterConfigObjectNFS.namespace
); );
} }
} }
...@@ -61,13 +65,15 @@ export class KubernetesClusterConfigAzure extends KubernetesClusterConfig { ...@@ -61,13 +65,15 @@ export class KubernetesClusterConfigAzure extends KubernetesClusterConfig {
public readonly uploadRetryCount: number | undefined; public readonly uploadRetryCount: number | undefined;
constructor( constructor(
apiVersion: string, apiVersion: string,
keyVault: KeyVaultConfig, keyVault: KeyVaultConfig,
azureStorage: AzureStorage, azureStorage: AzureStorage,
storage?: KubernetesStorageKind, storage?: KubernetesStorageKind,
uploadRetryCount?: number uploadRetryCount?: number,
) { namespace?: string,
super(apiVersion, storage);
) {
super(apiVersion, storage, namespace);
this.keyVault = keyVault; this.keyVault = keyVault;
this.azureStorage = azureStorage; this.azureStorage = azureStorage;
this.uploadRetryCount = uploadRetryCount; this.uploadRetryCount = uploadRetryCount;
...@@ -85,24 +91,54 @@ export class KubernetesClusterConfigAzure extends KubernetesClusterConfig { ...@@ -85,24 +91,54 @@ export class KubernetesClusterConfigAzure extends KubernetesClusterConfig {
kubernetesClusterConfigObjectAzure.keyVault, kubernetesClusterConfigObjectAzure.keyVault,
kubernetesClusterConfigObjectAzure.azureStorage, kubernetesClusterConfigObjectAzure.azureStorage,
kubernetesClusterConfigObjectAzure.storage, kubernetesClusterConfigObjectAzure.storage,
kubernetesClusterConfigObjectAzure.uploadRetryCount kubernetesClusterConfigObjectAzure.uploadRetryCount,
kubernetesClusterConfigObjectAzure.namespace
); );
} }
} }
export class KubernetesClusterConfigFactory { export class KubernetesClusterConfigPVC extends KubernetesClusterConfig {
public readonly pvc: PVCConfig;
constructor(
apiVersion: string,
pvc: PVCConfig,
storage?: KubernetesStorageKind,
namespace?: string,
) {
super(apiVersion, storage, namespace);
this.pvc = pvc;
}
public get storageType(): KubernetesStorageKind {
return 'pvc';
}
public static getInstance(jsonObject: object): KubernetesClusterConfigPVC {
const kubernetesClusterConfigObjectPVC: KubernetesClusterConfigPVC =
<KubernetesClusterConfigPVC>jsonObject;
return new KubernetesClusterConfigPVC(
kubernetesClusterConfigObjectPVC.apiVersion,
kubernetesClusterConfigObjectPVC.pvc,
kubernetesClusterConfigObjectPVC.storage,
kubernetesClusterConfigObjectPVC.namespace
);
}
}
export class KubernetesClusterConfigFactory {
public static generateKubernetesClusterConfig(jsonObject: object): KubernetesClusterConfig { public static generateKubernetesClusterConfig(jsonObject: object): KubernetesClusterConfig {
const storageConfig: StorageConfig = <StorageConfig>jsonObject; const storageConfig: StorageConfig = <StorageConfig>jsonObject;
switch (storageConfig.storage) { switch (storageConfig.storage) {
case 'azureStorage': case 'azureStorage':
return KubernetesClusterConfigAzure.getInstance(jsonObject); return KubernetesClusterConfigAzure.getInstance(jsonObject);
case 'pvc':
return KubernetesClusterConfigPVC.getInstance(jsonObject);
case 'nfs': case 'nfs':
case undefined: case undefined:
return KubernetesClusterConfigNFS.getInstance(jsonObject); return KubernetesClusterConfigNFS.getInstance(jsonObject);
default: default:
throw new Error(`Invalid json object ${jsonObject}`); throw new Error(`Invalid json object ${jsonObject}`);
} }
} }
} }
...@@ -121,6 +157,18 @@ export class NFSConfig { ...@@ -121,6 +157,18 @@ export class NFSConfig {
} }
} }
/**
* PVC configuration to store Kubernetes job related files
*/
export class PVCConfig {
// Path of the mounted pvc
public readonly path: string;
constructor(path: string) {
this.path = path;
}
}
/** /**
* KeyVault configuration to store the key of Azure Storage Service * KeyVault configuration to store the key of Azure Storage Service
* Refer https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2 * Refer https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2
...@@ -175,7 +223,7 @@ export class KubernetesTrialConfigTemplate { ...@@ -175,7 +223,7 @@ export class KubernetesTrialConfigTemplate {
public readonly gpuNum: number; public readonly gpuNum: number;
constructor(command: string, gpuNum: number, constructor(command: string, gpuNum: number,
cpuNum: number, memoryMB: number, image: string, privateRegistryAuthPath?: string) { cpuNum: number, memoryMB: number, image: string, privateRegistryAuthPath?: string) {
this.command = command; this.command = command;
this.gpuNum = gpuNum; this.gpuNum = gpuNum;
this.cpuNum = cpuNum; this.cpuNum = cpuNum;
......
...@@ -7,21 +7,21 @@ import * as cpp from 'child-process-promise'; ...@@ -7,21 +7,21 @@ import * as cpp from 'child-process-promise';
import * as path from 'path'; import * as path from 'path';
import * as azureStorage from 'azure-storage'; import * as azureStorage from 'azure-storage';
import { EventEmitter } from 'events'; import {EventEmitter} from 'events';
import { Base64 } from 'js-base64'; import {Base64} from 'js-base64';
import { String } from 'typescript-string-operations'; import {String} from 'typescript-string-operations';
import { getExperimentId } from '../../common/experimentStartupInfo'; import {getExperimentId} from '../../common/experimentStartupInfo';
import { getLogger, Logger } from '../../common/log'; import {getLogger, Logger} from '../../common/log';
import { MethodNotImplementedError } from '../../common/errors'; import {MethodNotImplementedError} from '../../common/errors';
import { import {
NNIManagerIpConfig, TrialJobDetail, TrialJobMetric, LogType NNIManagerIpConfig, TrialJobDetail, TrialJobMetric, LogType
} from '../../common/trainingService'; } from '../../common/trainingService';
import { delay, getExperimentRootDir, getIPV4Address, getJobCancelStatus, getVersion, uniqueString } from '../../common/utils'; import {delay, getExperimentRootDir, getIPV4Address, getJobCancelStatus, getVersion, uniqueString} from '../../common/utils';
import { AzureStorageClientUtility } from './azureStorageClientUtils'; import {AzureStorageClientUtility} from './azureStorageClientUtils';
import { GeneralK8sClient, KubernetesCRDClient } from './kubernetesApiClient'; import {GeneralK8sClient, KubernetesCRDClient} from './kubernetesApiClient';
import { KubernetesClusterConfig } from './kubernetesConfig'; import {KubernetesClusterConfig} from './kubernetesConfig';
import { kubernetesScriptFormat, KubernetesTrialJobDetail } from './kubernetesData'; import {kubernetesScriptFormat, KubernetesTrialJobDetail} from './kubernetesData';
import { KubernetesJobRestServer } from './kubernetesJobRestServer'; import {KubernetesJobRestServer} from './kubernetesJobRestServer';
const fs = require('fs'); const fs = require('fs');
...@@ -34,7 +34,7 @@ abstract class KubernetesTrainingService { ...@@ -34,7 +34,7 @@ abstract class KubernetesTrainingService {
protected readonly metricsEmitter: EventEmitter; protected readonly metricsEmitter: EventEmitter;
protected readonly trialJobsMap: Map<string, KubernetesTrialJobDetail>; protected readonly trialJobsMap: Map<string, KubernetesTrialJobDetail>;
// experiment root dir in NFS // experiment root dir in NFS
protected readonly trialLocalNFSTempFolder: string; protected readonly trialLocalTempFolder: string;
protected stopping: boolean = false; protected stopping: boolean = false;
protected experimentId!: string; protected experimentId!: string;
protected kubernetesRestServerPort?: number; protected kubernetesRestServerPort?: number;
...@@ -57,7 +57,7 @@ abstract class KubernetesTrainingService { ...@@ -57,7 +57,7 @@ abstract class KubernetesTrainingService {
this.log = getLogger(); this.log = getLogger();
this.metricsEmitter = new EventEmitter(); this.metricsEmitter = new EventEmitter();
this.trialJobsMap = new Map<string, KubernetesTrialJobDetail>(); this.trialJobsMap = new Map<string, KubernetesTrialJobDetail>();
this.trialLocalNFSTempFolder = path.join(getExperimentRootDir(), 'trials-nfs-tmp'); this.trialLocalTempFolder = path.join(getExperimentRootDir(), 'trials-nfs-tmp');
this.experimentId = getExperimentId(); this.experimentId = getExperimentId();
this.CONTAINER_MOUNT_PATH = '/tmp/mount'; this.CONTAINER_MOUNT_PATH = '/tmp/mount';
this.expContainerCodeFolder = path.join(this.CONTAINER_MOUNT_PATH, 'nni', this.experimentId, 'nni-code'); this.expContainerCodeFolder = path.join(this.CONTAINER_MOUNT_PATH, 'nni', this.experimentId, 'nni-code');
...@@ -124,7 +124,7 @@ abstract class KubernetesTrainingService { ...@@ -124,7 +124,7 @@ abstract class KubernetesTrainingService {
} }
public async cancelTrialJob(trialJobId: string, isEarlyStopped: boolean = false): Promise<void> { public async cancelTrialJob(trialJobId: string, isEarlyStopped: boolean = false): Promise<void> {
const trialJobDetail: KubernetesTrialJobDetail | undefined = this.trialJobsMap.get(trialJobId); const trialJobDetail: KubernetesTrialJobDetail | undefined = this.trialJobsMap.get(trialJobId);
if (trialJobDetail === undefined) { if (trialJobDetail === undefined) {
const errorMessage: string = `CancelTrialJob: trial job id ${trialJobId} not found`; const errorMessage: string = `CancelTrialJob: trial job id ${trialJobId} not found`;
this.log.error(errorMessage); this.log.error(errorMessage);
...@@ -168,7 +168,7 @@ abstract class KubernetesTrainingService { ...@@ -168,7 +168,7 @@ abstract class KubernetesTrainingService {
try { try {
await this.cancelTrialJob(trialJobId); await this.cancelTrialJob(trialJobId);
} catch (error) { } catch (error) {
// DONT throw error during cleanup // DONT throw error during cleanup
} }
kubernetesTrialJob.status = 'SYS_CANCELED'; kubernetesTrialJob.status = 'SYS_CANCELED';
} }
...@@ -191,9 +191,9 @@ abstract class KubernetesTrainingService { ...@@ -191,9 +191,9 @@ abstract class KubernetesTrainingService {
// Unmount NFS // Unmount NFS
try { try {
await cpp.exec(`sudo umount ${this.trialLocalNFSTempFolder}`); await cpp.exec(`sudo umount ${this.trialLocalTempFolder}`);
} catch (error) { } catch (error) {
this.log.error(`Unmount ${this.trialLocalNFSTempFolder} failed, error is ${error}`); this.log.error(`Unmount ${this.trialLocalTempFolder} failed, error is ${error}`);
} }
// Stop kubernetes rest server // Stop kubernetes rest server
...@@ -230,14 +230,16 @@ abstract class KubernetesTrainingService { ...@@ -230,14 +230,16 @@ abstract class KubernetesTrainingService {
await AzureStorageClientUtility.createShare(this.azureStorageClient, this.azureStorageShare); await AzureStorageClientUtility.createShare(this.azureStorageClient, this.azureStorageShare);
//create sotrage secret //create sotrage secret
this.azureStorageSecretName = String.Format('nni-secret-{0}', uniqueString(8) this.azureStorageSecretName = String.Format('nni-secret-{0}', uniqueString(8)
.toLowerCase()); .toLowerCase());
const namespace = this.genericK8sClient.getNamespace ? this.genericK8sClient.getNamespace : "default"
await this.genericK8sClient.createSecret( await this.genericK8sClient.createSecret(
{ {
apiVersion: 'v1', apiVersion: 'v1',
kind: 'Secret', kind: 'Secret',
metadata: { metadata: {
name: this.azureStorageSecretName, name: this.azureStorageSecretName,
namespace: 'default', namespace: namespace,
labels: { labels: {
app: this.NNI_KUBERNETES_TRIAL_LABEL, app: this.NNI_KUBERNETES_TRIAL_LABEL,
expId: getExperimentId() expId: getExperimentId()
...@@ -267,7 +269,7 @@ abstract class KubernetesTrainingService { ...@@ -267,7 +269,7 @@ abstract class KubernetesTrainingService {
* @param trialSequenceId sequence id * @param trialSequenceId sequence id
*/ */
protected async generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string, protected async generateRunScript(platform: string, trialJobId: string, trialWorkingFolder: string,
command: string, trialSequenceId: string, roleName: string, gpuNum: number): Promise<string> { command: string, trialSequenceId: string, roleName: string, gpuNum: number): Promise<string> {
let nvidiaScript: string = ''; let nvidiaScript: string = '';
// Nvidia devcie plugin for K8S has a known issue that requesting zero GPUs allocates all GPUs // Nvidia devcie plugin for K8S has a known issue that requesting zero GPUs allocates all GPUs
// Refer https://github.com/NVIDIA/k8s-device-plugin/issues/61 // Refer https://github.com/NVIDIA/k8s-device-plugin/issues/61
...@@ -297,11 +299,11 @@ abstract class KubernetesTrainingService { ...@@ -297,11 +299,11 @@ abstract class KubernetesTrainingService {
return Promise.resolve(runScript); return Promise.resolve(runScript);
} }
protected async createNFSStorage(nfsServer: string, nfsPath: string): Promise<void> { protected async createNFSStorage(nfsServer: string, nfsPath: string): Promise<void> {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}`); await cpp.exec(`mkdir -p ${this.trialLocalTempFolder}`);
try { try {
await cpp.exec(`sudo mount ${nfsServer}:${nfsPath} ${this.trialLocalNFSTempFolder}`); await cpp.exec(`sudo mount ${nfsServer}:${nfsPath} ${this.trialLocalTempFolder}`);
} catch (error) { } catch (error) {
const mountError: string = `Mount NFS ${nfsServer}:${nfsPath} to ${this.trialLocalNFSTempFolder} failed, error is ${error}`; const mountError: string = `Mount NFS ${nfsServer}:${nfsPath} to ${this.trialLocalTempFolder} failed, error is ${error}`;
this.log.error(mountError); this.log.error(mountError);
return Promise.reject(mountError); return Promise.reject(mountError);
...@@ -309,21 +311,35 @@ abstract class KubernetesTrainingService { ...@@ -309,21 +311,35 @@ abstract class KubernetesTrainingService {
return Promise.resolve(); return Promise.resolve();
} }
protected async createPVCStorage(pvcPath: string): Promise<void> {
try {
await cpp.exec(`mkdir -p ${pvcPath}`);
await cpp.exec(`sudo ln -s ${pvcPath} ${this.trialLocalTempFolder}`);
} catch (error) {
const linkError: string = `Linking ${pvcPath} to ${this.trialLocalTempFolder} failed, error is ${error}`;
this.log.error(linkError);
return Promise.reject(linkError);
}
return Promise.resolve();
}
protected async createRegistrySecret(filePath: string | undefined): Promise<string | undefined> { protected async createRegistrySecret(filePath: string | undefined): Promise<string | undefined> {
if(filePath === undefined || filePath === '') { if (filePath === undefined || filePath === '') {
return undefined; return undefined;
} }
const body = fs.readFileSync(filePath).toString('base64'); const body = fs.readFileSync(filePath).toString('base64');
const registrySecretName = String.Format('nni-secret-{0}', uniqueString(8) const registrySecretName = String.Format('nni-secret-{0}', uniqueString(8)
.toLowerCase()); .toLowerCase());
const namespace = this.genericK8sClient.getNamespace ? this.genericK8sClient.getNamespace : "default"
await this.genericK8sClient.createSecret( await this.genericK8sClient.createSecret(
{ {
apiVersion: 'v1', apiVersion: 'v1',
kind: 'Secret', kind: 'Secret',
metadata: { metadata: {
name: registrySecretName, name: registrySecretName,
namespace: 'default', namespace: namespace,
labels: { labels: {
app: this.NNI_KUBERNETES_TRIAL_LABEL, app: this.NNI_KUBERNETES_TRIAL_LABEL,
expId: getExperimentId() expId: getExperimentId()
...@@ -337,7 +353,7 @@ abstract class KubernetesTrainingService { ...@@ -337,7 +353,7 @@ abstract class KubernetesTrainingService {
); );
return registrySecretName; return registrySecretName;
} }
/** /**
* upload local directory to azureStorage * upload local directory to azureStorage
* @param srcDirectory the source directory of local folder * @param srcDirectory the source directory of local folder
...@@ -349,7 +365,7 @@ abstract class KubernetesTrainingService { ...@@ -349,7 +365,7 @@ abstract class KubernetesTrainingService {
throw new Error('azureStorageClient is not initialized'); throw new Error('azureStorageClient is not initialized');
} }
let retryCount: number = 1; let retryCount: number = 1;
if(uploadRetryCount) { if (uploadRetryCount) {
retryCount = uploadRetryCount; retryCount = uploadRetryCount;
} }
let uploadSuccess: boolean = false; let uploadSuccess: boolean = false;
...@@ -358,7 +374,7 @@ abstract class KubernetesTrainingService { ...@@ -358,7 +374,7 @@ abstract class KubernetesTrainingService {
do { do {
uploadSuccess = await AzureStorageClientUtility.uploadDirectory( uploadSuccess = await AzureStorageClientUtility.uploadDirectory(
this.azureStorageClient, this.azureStorageClient,
`${destDirectory}`, `${destDirectory}`,
this.azureStorageShare, this.azureStorageShare,
`${srcDirectory}`); `${srcDirectory}`);
if (!uploadSuccess) { if (!uploadSuccess) {
...@@ -378,4 +394,4 @@ abstract class KubernetesTrainingService { ...@@ -378,4 +394,4 @@ abstract class KubernetesTrainingService {
return Promise.resolve(folderUriInAzure); return Promise.resolve(folderUriInAzure);
} }
} }
export { KubernetesTrainingService }; export {KubernetesTrainingService};
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment