"test/vscode:/vscode.git/clone" did not exist on "b955ac99a46094d2d701d447e9df07509767cc32"
FrameworkControllerMode.md 6.55 KB
Newer Older
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
1
2
**Run an Experiment on FrameworkController**
===
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
3
NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install kubeflow for specific deeplearning framework like tf-operator or pytorch-operator. Now you can use frameworkcontroller as the training service to run NNI experiment.
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

## Prerequisite for on-premises Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. 
3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. 
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
    ```
    apt-get install nfs-common 
    ```     

6. Install **NNI**, follow the install guide [here](GetStarted.md).

## Prerequisite for Azure Kubernetes Service
1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__.  Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
22
23
24


## Set up FrameworkController
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
25
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode.
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
26
27

## Design
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
28
Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar.
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

## Example

The frameworkcontroller config file format is:
```
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist/search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
50
51
52
53
54
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
  gpuNum: 0
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
trial:
  codeDir: ~/nni/examples/trials/mnist
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 1
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 1
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    server: {your_nfs_server}
    path: {your_nfs_server_exported_path}
```
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
74
If you use Azure Kubernetes Service, you should  set `frameworkcontrollerConfig` in your config yaml file as follows:
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
75
76
77
78
79
80
81
82
83
84
```
frameworkcontrollerConfig:
  storage: azureStorage
  keyVault:
    vaultName: {your_vault_name}
    name: {your_secert_name}
  azureStorage:
    accountName: {your_storage_account_name}
    azureShare: {your_azure_share_name}
```
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
85
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. 
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
86

Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
87
The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.  
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
88
89
90
91
92
93
94
95
96
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
   * name: the name of task role specified, like "worker", "ps", "master".
   * taskNum: the replica number of the task role.
   * command: the users' command to be used in the container.
   * gpuNum: the number of gpu device used in container.
   * cpuNum: the number of cpu device used in container.
   * memoryMB: the memory limitaion to be specified in container.
   * image: the docker image used to create pod and run the program.
97
   * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, this completionpolicy could helps stop ps.
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
98
99

## How to run example
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
100
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please refer the [document](./KubeflowMode.md) for more information.