"...composable_kernel_rocm.git" did not exist on "c87aa6c8b4c098d5581b801cbe719f82ee5dc21b"
Unverified Commit 971e5055 authored by SparkSnail's avatar SparkSnail Committed by GitHub
Browse files

Add kubeflow and frameworkcontroller hybrid example (#4343)

parent 357ec6ef
......@@ -6,7 +6,7 @@ Run NNI on hybrid mode means that NNI will run trials jobs in multiple kinds of
Setup environment
-----------------
NNI has supported `local <./LocalMode.rst>`__\ , `remote <./RemoteMachineMode.rst>`__\ , `PAI <./PaiMode.rst>`__\ , and `AML <./AMLMode.rst>`__ for hybrid training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
NNI has supported `local <./LocalMode.rst>`__\ , `remote <./RemoteMachineMode.rst>`__\ , `PAI <./PaiMode.rst>`__\ , `AML <./AMLMode.rst>`__, `Kubeflow <./KubeflowMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ ,for hybrid training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Run an experiment
-----------------
......@@ -36,4 +36,4 @@ Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's con
- platform: local
To use hybrid training services, users should set training service configurations as a list in `trainingService` field.
Currently, hybrid support setting `local`, `remote`, `pai` and `aml` training services.
Currently, hybrid support setting `local`, `remote`, `pai`, `aml`, `kubeflow` and `frameworkcontroller` training services.
......@@ -21,3 +21,41 @@ trainingService:
resourceGroup: ${your resource group}
workspaceName: ${your workspace name}
computeTarget: ${your compute target}
- platform: kubeflow
reuseMode: true
worker:
command:
code_directory:
dockerImage: msranni/nni
cpuNumber:
gpuNumber:
memorySize:
replicas:
operator: tf-operator
storage:
storageType:
azureAccount:
azureShare:
keyVaultName:
keyVaultKey:
apiVersion: v1
- platform: frameworkcontroller
reuseMode: true
serviceAccountName:
taskRoles:
- name: worker
dockerImage: 'msranni/nni:latest'
taskNumber:
command:
gpuNumber:
cpuNumber:
memorySize:
framework_attempt_completion_policy:
min_failed_task_count: 1
minSucceedTaskCount: 1
storage:
storageType:
azureAccount:
azureShare:
keyVaultName:
keyVaultKey:
......@@ -56,7 +56,7 @@ jobs:
cd test
python3 nni_test/nnitest/generate_ts_config.py \
--ts frameworkcontroller \
--keyvault_vaultname $(keyvault_vaultname) \
--keyvault_vaultname $(keyvault_vaultname) \
--keyvault_name $(keyvault_name) \
--azs_account $(azs_account) \
--azs_share $(azs_share) \
......@@ -64,5 +64,5 @@ jobs:
--nni_manager_ip $(manager_ip) \
--reuse_mode True \
--config_version v2
python3 nni_test/nnitest/run_tests.py --config config/integration_tests.yml --ts frameworkcontroller --reuse_mode True --exclude multi-phase,multi-thread
python3 nni_test/nnitest/run_tests.py --config config/integration_tests_config_v2.yml --ts frameworkcontroller --reuse_mode True --exclude multi-phase,multi-thread
displayName: Integration test (reuse mode)
......@@ -68,7 +68,7 @@ jobs:
az login --service-principal -u $(client_id) -p $(client_secret) --tenant $(tenant_id)
python3 nni_test/nnitest/generate_ts_config.py \
--ts kubeflow \
--keyvault_vaultname $(keyvault_vaultname) \``
--keyvault_vaultname $(keyvault_vaultname) \
--keyvault_name $(keyvault_name) \
--azs_account $(azs_account) \
--azs_share $(azs_share) \
......@@ -76,5 +76,5 @@ jobs:
--nni_manager_ip $(manager_ip) \
--reuse_mode True \
--config_version v2
python3 nni_test/nnitest/run_tests.py --config config/integration_tests.yml --ts kubeflow --reuse_mode True --exclude multi-phase,multi-thread
python3 nni_test/nnitest/run_tests.py --config config/integration_tests_config_v2.yml --ts kubeflow --reuse_mode True --exclude multi-phase,multi-thread
displayName: Integration test (reuse mode)
......@@ -13,7 +13,7 @@ hybrid:
workspaceName:
computeTarget:
kubeflow:
trialGpuNumber: 0
trialGpuNumber: 1
trialConcurrency: 2
maxTrialNumber: 2
nniManagerIp:
......@@ -37,7 +37,7 @@ kubeflow:
keyVaultKey:
apiVersion: v1
frameworkcontroller:
trialGpuNumber: 0
trialGpuNumber: 1
trialConcurrency: 2
maxTrialNumber: 2
nniManagerIp:
......
......@@ -55,8 +55,8 @@ def update_training_service_config(args):
config[args.ts]['trainingService']['worker']['dockerImage'] = args.nni_docker_image
config[args.ts]['trainingService']['storage']['azureAccount'] = args.azs_account
config[args.ts]['trainingService']['storage']['azureShare'] = args.azs_share
config[args.ts]['trainingService']['storage']['keyVaultName'] = args.keyvault_name
config[args.ts]['trainingService']['storage']['keyVaultKey'] = args.keyvault_vaultname
config[args.ts]['trainingService']['storage']['keyVaultName'] = args.keyvault_vaultname
config[args.ts]['trainingService']['storage']['keyVaultKey'] = args.keyvault_name
config[args.ts]['nni_manager_ip'] = args.nni_manager_ip
dump_yml_content(TRAINING_SERVICE_FILE_V2, config)
elif args.ts == 'frameworkcontroller' and args.reuse_mode == 'False':
......@@ -79,8 +79,8 @@ def update_training_service_config(args):
config[args.ts]['trainingService']['taskRoles'][0]['dockerImage'] = args.nni_docker_image
config[args.ts]['trainingService']['storage']['azureAccount'] = args.azs_account
config[args.ts]['trainingService']['storage']['azureShare'] = args.azs_share
config[args.ts]['trainingService']['storage']['keyVaultName'] = args.keyvault_name
config[args.ts]['trainingService']['storage']['keyVaultKey'] = args.keyvault_vaultname
config[args.ts]['trainingService']['storage']['keyVaultName'] = args.keyvault_vaultname
config[args.ts]['trainingService']['storage']['keyVaultKey'] = args.keyvault_name
config[args.ts]['nni_manager_ip'] = args.nni_manager_ip
dump_yml_content(TRAINING_SERVICE_FILE_V2, config)
elif args.ts == 'remote':
......
......@@ -25,7 +25,6 @@ it_variables = {}
def update_training_service_config(config, training_service, config_file_path, nni_source_dir, reuse_mode='False'):
it_ts_config = get_yml_content(os.path.join('config', 'training_service.yml'))
# hack for kubeflow trial config
if training_service == 'kubeflow' and reuse_mode == 'False':
it_ts_config[training_service]['trial']['worker']['command'] = config['trial']['command']
......@@ -34,16 +33,18 @@ def update_training_service_config(config, training_service, config_file_path, n
config['trial'].pop('gpuNum')
elif training_service == 'kubeflow' and reuse_mode == 'True':
it_ts_config = get_yml_content(os.path.join('config', 'training_service_v2.yml'))
it_ts_config['trainingService']['worker']['command'] = config['trialCommand']
print(it_ts_config)
it_ts_config[training_service]['trainingService']['worker']['command'] = config['trialCommand']
it_ts_config[training_service]['trainingService']['worker']['code_directory'] = config['trialCodeDirectory']
if training_service == 'frameworkcontroller' and reuse_mode == 'False':
if training_service == 'frameworkcontroller' and reuse_mode == 'False':
it_ts_config[training_service]['trial']['taskRoles'][0]['command'] = config['trial']['command']
config['trial'].pop('command')
if 'gpuNum' in config['trial']:
config['trial'].pop('gpuNum')
elif training_service == 'frameworkcontroller' and reuse_mode == 'True':
elif training_service == 'frameworkcontroller' and reuse_mode == 'True':
it_ts_config = get_yml_content(os.path.join('config', 'training_service_v2.yml'))
it_ts_config['trainingService']['taskRoles'][0]['command'] = config['trialCommand']
it_ts_config[training_service]['trainingService']['taskRoles'][0]['command'] = config['trialCommand']
if training_service == 'adl':
# hack for adl trial config, codeDir in adl mode refers to path in container
......@@ -74,7 +75,7 @@ def update_training_service_config(config, training_service, config_file_path, n
if training_service == 'hybrid':
it_ts_config = get_yml_content(os.path.join('config', 'training_service_v2.yml'))
else:
elif reuse_mode != 'True':
deep_update(config, it_ts_config['all'])
deep_update(config, it_ts_config[training_service])
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment