FrameworkControllerMode.md 4 KB
Newer Older
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
1
2
**Run an Experiment on FrameworkController**
===
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
3
NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you could use frameworkcontroller as a training service to run your experiment.
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
4
5
6
7
8
9
10
11

## Set up Kubernetes Service and kubeconfig
FrameworkController has same prerequisites as kubeflow mode except that you don't need to install kubeflow. Please refer the [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni.

## Set up FrameworkController
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up the frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode.

## Design
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
12
Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar to kubeflow training service.
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

## Example

The frameworkcontroller config file format is:
```
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist/search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
34
35
36
37
38
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
  gpuNum: 0
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
trial:
  codeDir: ~/nni/examples/trials/mnist
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 1
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 1
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    server: {your_nfs_server}
    path: {your_nfs_server_exported_path}
```
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
58
If you use Azure Kubernetes Service, you should  set `frameworkcontrollerConfig` in your config yaml file as follows:
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
59
60
61
62
63
64
65
66
67
68
```
frameworkcontrollerConfig:
  storage: azureStorage
  keyVault:
    vaultName: {your_vault_name}
    name: {your_secert_name}
  azureStorage:
    accountName: {your_storage_account_name}
    azureShare: {your_azure_share_name}
```
Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
69
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in frameworkcontrollerConfig mode. 
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
70

Shinai Yang (FA TALENT)'s avatar
update  
Shinai Yang (FA TALENT) committed
71
The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example of frameworkcontroller](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deeply understanding.  
Shinai Yang (FA TALENT)'s avatar
Shinai Yang (FA TALENT) committed
72
73
74
75
76
77
78
79
80
81
82
83
84
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
   * name: the name of task role specified, like "worker", "ps", "master".
   * taskNum: the replica number of the task role.
   * command: the users' command to be used in the container.
   * gpuNum: the number of gpu device used in container.
   * cpuNum: the number of cpu device used in container.
   * memoryMB: the memory limitaion to be specified in container.
   * image: the docker image used to create pod and run the program.
   * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md) to get the specific information.

## How to run example
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please the [document](./KubeflowMode.md) for more information.