FrameworkControllerMode.md 6.67 KB
Newer Older
1
2
# Run an Experiment on FrameworkController

3
===
4
NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
5
6

## Prerequisite for on-premises Kubernetes Service
7

8
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
9
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
10
3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
11
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copies files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
12
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
13
14

    ```bash
15
16
    apt-get install nfs-common
    ```
17

xuehui's avatar
xuehui committed
18
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
19
20

## Prerequisite for Azure Kubernetes Service
21
22

1. NNI support Kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
23
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__.  Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
24
25
3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, NNI need the access key of the storage account, and NNI uses [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
26

27
## Setup FrameworkController
28

29
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
30
31

## Design
32

xuehui's avatar
xuehui committed
33
Please refer the design of [Kubeflow training service](KubeflowMode.md), FrameworkController training service pipeline is similar.
34
35
36

## Example

37
38
39
The FrameworkController config file format is:

```yaml
40
41
42
43
44
45
46
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
47
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
  gpuNum: 0
trial:
62
  codeDir: ~/nni/examples/trials/mnist-tfv1
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 1
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 1
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    server: {your_nfs_server}
    path: {your_nfs_server_exported_path}
```
80

81
If you use Azure Kubernetes Service, you should  set `frameworkcontrollerConfig` in your config YAML file as follows:
82
83

```yaml
84
85
86
87
88
89
90
91
92
frameworkcontrollerConfig:
  storage: azureStorage
  keyVault:
    vaultName: {your_vault_name}
    name: {your_secert_name}
  azureStorage:
    accountName: {your_storage_account_name}
    azureShare: {your_azure_share_name}
```
93

94
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
95

96
97
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the [Tensorflow example of FrameworkController](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.

98
Trial configuration in frameworkcontroller mode have the following configuration keys:
99
100
101
102
103
104
105
106
107
108

* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in Kubernetes cluster.
  * name: the name of task role specified, like "worker", "ps", "master".
  * taskNum: the replica number of the task role.
  * command: the users' command to be used in the container.
  * gpuNum: the number of gpu device used in container.
  * cpuNum: the number of cpu device used in container.
  * memoryMB: the memory limitaion to be specified in container.
  * image: the docker image used to create pod and run the program.
  * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
109
110

## How to run example
111

xuehui's avatar
xuehui committed
112
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on FrameworkController is similar to Kubeflow, please refer the [document](KubeflowMode.md) for more information.
113
114

## version check
115
116

NNI support version check feature in since version 0.6, [refer](PaiMode.md)