KubeflowMode.md 11.3 KB
Newer Older
1
2
# Run an Experiment on Kubeflow

Lee's avatar
Lee committed
3
===
4
5

Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Lee's avatar
Lee committed
6
7

## Prerequisite for on-premises Kubernetes Service
8

Lee's avatar
Lee committed
9
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
10
11
2. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to setup Kubeflow.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
Lee's avatar
Lee committed
12
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
13
5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
Lee's avatar
Lee committed
14
15
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
    ```
16
17
    apt-get install nfs-common
    ```
Lee's avatar
Lee committed
18

xuehui's avatar
xuehui committed
19
7. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
Lee's avatar
Lee committed
20
21

## Prerequisite for Azure Kubernetes Service
22
23

1. NNI support Kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
24
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__.  Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
25
3. Deploy Kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/).
26
27
4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account, and NNI use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
Lee's avatar
Lee committed
28

29
## Design
30

xuehui's avatar
xuehui committed
31
![](../../img/kubeflow_training_design.png)
32
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
33

34
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumes: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volume into the job's pod. Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
Lee's avatar
Lee committed
35

36
## Supported operator
37
38

NNI only support tf-operator and pytorch-operator of Kubeflow, other operators is not tested.
39
40
Users could set operator type in config file.
The setting of tf-operator:
41
42

```yaml
43
44
45
kubeflowConfig:
  operator: tf-operator
```
46

47
The setting of pytorch-operator:
48
49

```yaml
50
51
52
kubeflowConfig:
  operator: pytorch-operator
```
53

54
If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config.
55

Chi Song's avatar
Chi Song committed
56
## Supported storage type
57

58
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
59

60
The setting for NFS storage are as follows:
61
62

```yaml
63
64
65
66
67
68
69
70
kubeflowConfig:
  storage: nfs
  nfs:
    # Your NFS server IP, like 10.10.10.10
    server: {your_nfs_server_ip}
    # Your NFS server export path, like /var/nfs/nni
    path: {your_nfs_server_export_path}
```
71

72
If you use Azure storage, you should  set `kubeflowConfig` in your config YAML file as follows:
73
74

```yaml
75
76
77
78
79
80
81
82
83
84
kubeflowConfig:
  storage: azureStorage
  keyVault:
    vaultName: {your_vault_name}
    name: {your_secert_name}
  azureStorage:
    accountName: {your_storage_account_name}
    azureShare: {your_azure_share_name}
```

Lee's avatar
Lee committed
85
## Run an experiment
86

87
Use `examples/trials/mnist-tfv1` as an example. This is a tensorflow job, and use tf-operator of Kubeflow. The NNI config YAML file's content is like:
88
89

```yaml
90
authorName: default
Lee's avatar
Lee committed
91
experimentName: example_mnist
92
93
94
95
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 20
#choice: local, remote, pai, kubeflow
Lee's avatar
Lee committed
96
trainingServicePlatform: kubeflow
97
98
searchSpacePath: search_space.json
#choice: true, false
Lee's avatar
Lee committed
99
100
useAnnotation: false
tuner:
101
  #choice: TPE, Random, Anneal, Evolution
Lee's avatar
Lee committed
102
103
104
105
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
106
107
108
109
110
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
  gpuNum: 0
Lee's avatar
Lee committed
111
trial:
112
113
114
115
116
  codeDir: .
  worker:
    replicas: 2
    command: python3 dist_mnist.py
    gpuNum: 1
Lee's avatar
Lee committed
117
118
    cpuNum: 1
    memoryMB: 8196
119
120
121
122
123
    image: msranni/nni:latest
  ps:
    replicas: 1
    command: python3 dist_mnist.py
    gpuNum: 0
Lee's avatar
Lee committed
124
125
    cpuNum: 1
    memoryMB: 8196
126
    image: msranni/nni:latest
Lee's avatar
Lee committed
127
128
kubeflowConfig:
  operator: tf-operator
129
  apiVersion: v1alpha2
Lee's avatar
Lee committed
130
131
  storage: nfs
  nfs:
132
133
134
135
    # Your NFS server IP, like 10.10.10.10
    server: {your_nfs_server_ip}
    # Your NFS server export path, like /var/nfs/nni
    path: {your_nfs_server_export_path}
Lee's avatar
Lee committed
136
```
137

138
Note: You should explicitly set `trainingServicePlatform: kubeflow` in NNI config YAML file if you want to start experiment in kubeflow mode.
139

140
If you want to run PyTorch jobs, you could set your config files as follow:
141
142

```yaml
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
authorName: default
experimentName: example_mnist_distributed_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: minimize
trial:
  codeDir: .
  master:
    replicas: 1
    command: python3 dist_mnist.py
    gpuNum: 1
    cpuNum: 1
    memoryMB: 2048
    image: msranni/nni:latest
  worker:
    replicas: 1
    command: python3 dist_mnist.py
    gpuNum: 0
    cpuNum: 1
    memoryMB: 2048
    image: msranni/nni:latest
Lee's avatar
Lee committed
175
kubeflowConfig:
176
177
178
179
180
181
182
  operator: pytorch-operator
  apiVersion: v1alpha2
  nfs:
    # Your NFS server IP, like 10.10.10.10
    server: {your_nfs_server_ip}
    # Your NFS server export path, like /var/nfs/nni
    path: {your_nfs_server_export_path}
Lee's avatar
Lee committed
183
184
185
```

Trial configuration in kubeflow mode have the following configuration keys:
186

Lee's avatar
Lee committed
187
* codeDir
188
  * code directory, where you put training code and config files
Lee's avatar
Lee committed
189
* worker (required). This config section is used to configure tensorflow worker role
190
191
192
193
194
195
196
197
198
199
200
  * replicas
    * Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
  * command
    * Required key. Command to launch your trial job, like ```python mnist.py```
  * memoryMB
    * Required key. Should be positive number based on your trial program's memory requirement
  * cpuNum
  * gpuNum
  * image
    * Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run.
    * We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/master/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
201
202
  * privateRegistryAuthPath
    * Optional field, specify `config.json` file path that holds an authorization token of docker registry, used to pull image from private registry. [Refer](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/).
203
204
205
206
  * apiVersion
    * Required key. The API version of your Kubeflow.
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
Lee's avatar
Lee committed
207

208
Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
209
210

```bash
211
nnictl create --config exp_kubeflow.yml
Lee's avatar
Lee committed
212
```
213

214
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
215
You can see the Kubeflow tfjob created by NNI in your Kubernetes dashboard.
Lee's avatar
Lee committed
216

217
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Lee's avatar
Lee committed
218

219
Once a trial job is completed, you can go to NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Lee's avatar
Lee committed
220

221
## version check
222

223
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
224

225
Any problems when using NNI in Kubeflow mode, please create issues on [NNI Github repo](https://github.com/Microsoft/nni).