[V0.4.1 Release] Merge v0.4.1 branch back to Master (#509)

* Update nnictl.py Fix the issue that nnictl --version via pip installation doesn't work * Update kubeflow training service document (#494) * Remove kubectl related document, add messages for kubeconfig * Add design section for kubeflow training service * Move the image files for PAI training service doc into img folder. * Update KubeflowMode.md (#498) Update KubeflowMode.md, small terms change * [V0.4.1 bug fix] Cannot run kubeflow training service due to trial_keeper change (#503) * Update kubeflow training service document * fix bug a that kubeflow trial job cannot run * upgrade version number (#499) * [V0.4.1 bug fix] Support read K8S config from KUBECONFIG environment variable (#507) * Add KUBCONFIG env variable support * In main.ts, throw cached error to make sure nnictl can show the error in stderr

[V0.4.1 Release] Merge v0.4.1 branch back to Master (#509)
* Update nnictl.py Fix the issue that nnictl --version via pip installation doesn't work * Update kubeflow training service document (#494) * Remove kubectl related document, add messages for kubeconfig * Add design section for kubeflow training service * Move the image files for PAI training service doc into img folder. * Update KubeflowMode.md (#498) Update KubeflowMode.md, small terms change * [V0.4.1 bug fix] Cannot run kubeflow training service due to trial_keeper change (#503) * Update kubeflow training service document * fix bug a that kubeflow trial job cannot run * upgrade version number (#499) * [V0.4.1 bug fix] Support read K8S config from KUBECONFIG environment variable (#507) * Add KUBCONFIG env variable support * In main.ts, throw cached error to make sure nnictl can show the error in stderr
ff834cea · fishyds · Yan Ni · 102faea1 · ff834cea · ff834cea
Commit ff834cea authored Dec 20, 2018 by fishyds Committed by Yan Ni Dec 20, 2018
13 changed files
--- a/README.md
+++ b/README.md
@@ -39,7 +39,7 @@ The tool dispatches and runs trial jobs generated by tuning algorithms to search
 * We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) in our current stage. 
 * Run the following commands in an environment that has `python >= 3.5`, `git` and `wget`.
 ```bash	
-    git clone -b v0.4 https://github.com/Microsoft/nni.git
+    git clone -b v0.4.1 https://github.com/Microsoft/nni.git
    cd nni	
    source install.sh	
 ```
@@ -51,7 +51,7 @@ For the system requirements of NNI, please refer to [Install NNI](docs/Installat
 The following example is an experiment built on TensorFlow. Make sure you have **TensorFlow installed** before running it.	
 * Download the examples via clone the source code.	
 ```bash	
-    git clone -b v0.4 https://github.com/Microsoft/nni.git
+    git clone -b v0.4.1 https://github.com/Microsoft/nni.git
 ```
 * Run the mnist example.
 ```bash

--- a/docs/GetStarted.md
+++ b/docs/GetStarted.md
@@ -18,7 +18,7 @@
 * __Install NNI through source code__
-      git clone -b v0.4 https://github.com/Microsoft/nni.git
+      git clone -b v0.4.1 https://github.com/Microsoft/nni.git
      cd nni
      source install.sh

--- a/docs/Installation.md
+++ b/docs/Installation.md
@@ -18,7 +18,7 @@ Currently we only support installation on Linux & Mac.
 * __Install NNI through source code__
-      git clone -b v0.4 https://github.com/Microsoft/nni.git
+      git clone -b v0.4.1 https://github.com/Microsoft/nni.git
      cd nni
      source install.sh

--- a/docs/KubeflowMode.md
+++ b/docs/KubeflowMode.md
 **Run an Experiment on Kubeflow**
 ===
-Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) is installed and configured to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a goot start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster. 
+Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a kubernetes cluster, either on-prem or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your kubernetes cluster. If you are not familiar with kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a goot start. In kubeflow mode, your trial program will run as kubeflow job in kubernetes cluster. 
 ## Prerequisite for on-premises Kubernetes Service
 1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
 2. Download, set up, and deploy **Kubelow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to set up Kubeflow
-3. Install **kubectl**, and configure to connect to your Kubernetes API server. Follow this [guideline](https://kubernetes.io/docs/tasks/tools/install-kubectl/) to install kubectl on Ubuntu
+3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig. 
 4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
-5. Install **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. 
+5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when nni copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**. 
 6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
    ```
    apt-get install nfs-common 
@@ -17,13 +17,16 @@ Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/ku
 ## Prerequisite for Azure Kubernetes Service
 1. NNI support kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
-2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__.  Use `az login` to set azure account, and connect kubectl client to AKS, [refer](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
+2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__.  Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
 3. Deploy kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/).
 4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, nni need Azure Storage Service to store code files and the output files.
 5. To access Azure storage service, nni need the access key of the storage account, and nni use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
 ## Design 
-TODO
+![](./img/kubeflow_training_design.png)
+Kubeflow training service instantiates a kubernetes rest client to interact with your K8s cluster's API server. 
+For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yaml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumns: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in nni config yaml file. After files are prepared, Kubeflow training service will call K8S rest API to create kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volumn into the job's pod. Output files of kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files. 
 ## Run an experiment
 Use `examples/trials/mnist` as an example. The nni config yaml file's content is like: 

--- a/docs/PAIMode.md
+++ b/docs/PAIMode.md
@@ -60,17 +60,17 @@ nnictl create --config exp_pai.yaml
 ```
 to start the experiment in pai mode. NNI will create OpanPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`. 
 You can see the pai jobs created by NNI in your OpenPAI cluster's web portal, like:
-![](./nni_pai_joblist.jpg)
+![](./img/nni_pai_joblist.jpg)
 Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic. 
 Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information. 
 Expand a trial information in trial list view, click the logPath link like:
-![](./nni_webui_joblist.jpg)
+![](./img/nni_webui_joblist.jpg)
 And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
-![](./nni_trial_hdfs_output.jpg)
+![](./img/nni_trial_hdfs_output.jpg)
 You can see there're three fils in output folder: stderr, stdout, and trial.log

--- a/docs/img/kubeflow_training_design.png
+++ b/docs/img/kubeflow_training_design.png
--- a/docs/nni_pai_joblist.jpg
+++ b/docs/nni_pai_joblist.jpg
--- a/docs/nni_trial_hdfs_output.jpg
+++ b/docs/nni_trial_hdfs_output.jpg
--- a/docs/nni_webui_joblist.jpg
+++ b/docs/nni_webui_joblist.jpg
--- a/src/nni_manager/main.ts
+++ b/src/nni_manager/main.ts
@@ -109,6 +109,7 @@ mkDirP(getLogDir()).then(async () => {
        log.info(`Rest server listening on: ${restServer.endPoint}`);
    } catch (err) {
        log.error(`${err.stack}`);
+        throw err;
    }
 }).catch((err: Error) => {
    console.error(`Failed to create log dir: ${err.stack}`);

--- a/src/nni_manager/training_service/kubeflow/kubernetesApiClient.ts
+++ b/src/nni_manager/training_service/kubeflow/kubernetesApiClient.ts
@@ -36,7 +36,7 @@ class GeneralK8sClient {
    protected readonly log: Logger = getLogger();
    constructor() {
-        this.client = new K8SClient({ config: K8SConfig.fromKubeconfig(path.join(os.homedir(), '.kube', 'config')), version: '1.9'});
+        this.client = new K8SClient({ config: K8SConfig.fromKubeconfig(), version: '1.9'});
        this.client.loadSpec();
    }
@@ -58,7 +58,7 @@ abstract class KubeflowOperatorClient {
    protected crdSchema: any;
    constructor() {
-        this.client = new K8SClient({ config: K8SConfig.fromKubeconfig(path.join(os.homedir(), '.kube', 'config'))});
+        this.client = new K8SClient({ config: K8SConfig.fromKubeconfig() });
        this.client.loadSpec();
    }

--- a/tools/nni_cmd/nnictl.py
+++ b/tools/nni_cmd/nnictl.py
@@ -30,7 +30,7 @@ from .tensorboard_utils import *
 def nni_info(*args):
    if args[0].version:
-        print(pkg_resources.get_distribution('nnictl').version)
+        print(pkg_resources.get_distribution('nni').version)
    else:
        print('please run "nnictl {positional argument} --help" to see nnictl guidance')

--- a/tools/nni_trial_tool/trial_keeper.py
+++ b/tools/nni_trial_tool/trial_keeper.py
@@ -43,13 +43,13 @@ def main_loop(args):
    stdout_file = open(STDOUT_FULL_PATH, 'a+')
    stderr_file = open(STDERR_FULL_PATH, 'a+')
-    try:
+    if args.pai_hdfs_host is not None and args.nni_hdfs_exp_dir is not None:
-        hdfs_client = HdfsClient(hosts='{0}:{1}'.format(args.pai_hdfs_host, '50070'), user_name=args.pai_user_name, timeout=5)
+        try:
-    except Exception as e:
+            hdfs_client = HdfsClient(hosts='{0}:{1}'.format(args.pai_hdfs_host, '50070'), user_name=args.pai_user_name, timeout=5)
-        nni_log(LogType.Error, 'Create HDFS client error: ' + str(e))
+        except Exception as e:
-        raise e
+            nni_log(LogType.Error, 'Create HDFS client error: ' + str(e))
+            raise e
-    copyHdfsDirectoryToLocal(args.nni_hdfs_exp_dir, os.getcwd(), hdfs_client)
+        copyHdfsDirectoryToLocal(args.nni_hdfs_exp_dir, os.getcwd(), hdfs_client)
    # Notice: We don't appoint env, which means subprocess wil inherit current environment and that is expected behavior
    process = Popen(args.trial_command, shell = True, stdout = stdout_file, stderr = stderr_file)
@@ -62,7 +62,7 @@ def main_loop(args):
        if retCode is not None:
            nni_log(LogType.Info, 'subprocess terminated. Exit code is {}. Quit'.format(retCode))
-            if NNI_PLATFORM == 'pai':
+            if args.pai_hdfs_output_dir is not None:
                # Copy local directory to hdfs for OpenPAI
                nni_local_output_dir = os.environ['NNI_OUTPUT_DIR']
                try: