add document

7cb03f99 · Shinai Yang (FA TALENT) · b7e97992 · 7cb03f99 · 7cb03f99 · 7cb03f99
Commit 7cb03f99 authored Dec 27, 2018 by Shinai Yang (FA TALENT)
Show whitespace changes
Inline Side-by-side

Showing with 91 additions and 6 deletions

README.md README.md +6 -5

docs/FrameworkControllerMode.md docs/FrameworkControllerMode.md +84 -0

docs/KubeflowMode.md docs/KubeflowMode.md +1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -76,11 +76,12 @@ You can use these commands to get more information about the experiment
         commands                       description
 1. nnictl experiment show        show the information of experiments
 2. nnictl trial ls               list all of trial jobs
-3. nnictl log stderr             show stderr log content
-4. nnictl log stdout             show stdout log content
-5. nnictl stop                   stop an experiment
-6. nnictl trial kill             kill a trial job by id
-7. nnictl --help                 get help information about nnictl
+3. nnictl top                    monitor the status of running experiments
+4. nnictl log stderr             show stderr log content
+5. nnictl log stdout             show stdout log content
+6. nnictl stop                   stop an experiment
+7. nnictl trial kill             kill a trial job by id
+8. nnictl --help                 get help information about nnictl
 -----------------------------------------------------------------------
 ```


--- a/docs/FrameworkControllerMode.md
+++ b/docs/FrameworkControllerMode.md
+**Run an Experiment on FrameworkController**
+===
+NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, and you have to set a kubernetes cluster before using frameworkcontroller.
+
+## Set up Kubernetes Service and kubeconfig
+FrameworkController has same prerequisites as kubeflow mode except that you don't need to install kubeflow. Please refer the [document](./KubeflowMode.md) to set up your kubernetes cluster and other prerequisites for nni.
+
+## Set up FrameworkController
+Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up the frameworkcontroller in the kubernetes cluster, nni support frameworkcontroller by the statefulset mode.
+
+## Design
+Please refer the design of [kubeflow training service](./KubeflowMode.md), frameworkcontroller training service pipeline is similar with kubeflow training service.
+
+## Example
+
+The frameworkcontroller config file format is:
+```
+authorName: default
+experimentName: example_mnist
+trialConcurrency: 1
+maxExecDuration: 10h
+maxTrialNum: 100
+#choice: local, remote, pai, kubeflow, frameworkcontroller
+trainingServicePlatform: frameworkcontroller
+searchSpacePath: ~/nni/examples/trials/mnist/search_space.json
+#choice: true, false
+useAnnotation: false
+tuner:
+  #choice: TPE, Random, Anneal, Evolution
+  builtinTunerName: TPE
+  classArgs:
+    #choice: maximize, minimize
+    optimize_mode: maximize
+#assessor:
+#  builtinAssessorName: Medianstop
+#  classArgs:
+#    optimize_mode: maximize
+#  gpuNum: 0
+trial:
+  codeDir: ~/nni/examples/trials/mnist
+  taskRoles:
+    - name: worker
+      taskNum: 1
+      command: python3 mnist.py
+      gpuNum: 1
+      cpuNum: 1
+      memoryMB: 8192
+      image: msranni/nni:latest
+      frameworkAttemptCompletionPolicy:
+        minFailedTaskCount: 1
+        minSucceededTaskCount: 1
+frameworkcontrollerConfig:
+  storage: nfs
+  nfs:
+    server: {your_nfs_server}
+    path: {your_nfs_server_exported_path}
+```
+If you use Azure Kubernetes Service, you should  set `kubeflowConfig` in your config yaml file as follows:
+```
+frameworkcontrollerConfig:
+  storage: azureStorage
+  keyVault:
+    vaultName: {your_vault_name}
+    name: {your_secert_name}
+  azureStorage:
+    accountName: {your_storage_account_name}
+    azureShare: {your_azure_share_name}
+```
+Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in nni config yaml file if you want to start experiment in kubeflow mode. 
+
+The trial's config format for nni frameworkcontroller mode is a simple version of frameworkcontroller's offical config, you could refer the [tensorflow example](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deeply understanding.  
+Trial configuration in frameworkcontroller mode have the following configuration keys:
+* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in kubernetes cluster.
+   * name: the name of task role specified, like "worker", "ps", "master".
+   * taskNum: the replica number of the task role.
+   * command: the users' command to be used in the container.
+   * gpuNum: the number of gpu device used in container.
+   * cpuNum: the number of cpu device used in container.
+   * memoryMB: the memory limitaion to be specified in container.
+   * image: the docker image used to create pod and run the program.
+   * frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md) to get the specific information.
+
+## How to run example
+After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on frameworkcontroller is similar to kubeflow, please the [document](./KubeflowMode.md) for more information.
\ No newline at end of file
--- a/docs/KubeflowMode.md
+++ b/docs/KubeflowMode.md
@@ -100,7 +100,7 @@ Trial configuration in kubeflow mode have the following configuration keys:
    * gpuNum
    * image
        * Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run. 
-        * We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it.
+        * We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](../deployment/Dockerfile.build.base). You can either use this image directly in your config file, or build your own image based on it.
 * ps (optional). This config section is used to configure tensorflow parameter server role.

 Once complete to fill nni experiment config file and save (for example, save as exp_kubeflow.yaml), then run the following command