Commit df6145a2 authored by Yuge Zhang's avatar Yuge Zhang
Browse files

Merge branch 'master' of https://github.com/microsoft/nni into dev-retiarii

parents 0f0c6288 f8424a9f
**Run an Experiment on Azure Machine Learning**
===
NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.
## Setup environment
Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).
Step 2. Create an Azure account/subscription using this [link](https://azure.microsoft.com/en-us/free/services/machine-learning/). If you already have an Azure account/subscription, skip this step.
Step 3. Install the Azure CLI on your machine, follow the install guide [here](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest).
Step 4. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
```
az login
```
Step 5. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download `config.json` which will be used later.
![](../../img/aml_workspace.png)
Step 6. Create an AML cluster as the computeTarget.
![](../../img/aml_cluster.png)
Step 7. Open a command line and install AML package environment.
```
python3 -m pip install azureml
python3 -m pip install azureml-sdk
```
## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: aml
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
image: msranni/nni
gpuNum: 1
amlConfig:
subscriptionId: ${replace_to_your_subscriptionId}
resourceGroup: ${replace_to_your_resourceGroup}
workspaceName: ${replace_to_your_workspaceName}
computeTarget: ${replace_to_your_computeTarget}
```
Note: You should set `trainingServicePlatform: aml` in NNI config YAML file if you want to start experiment in aml mode.
Compared with [LocalMode](LocalMode.md) trial configuration in aml mode have these additional keys:
* image
* required key. The docker image name used in job. NNI support image `msranni/nni` for running aml jobs.
```
Note: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
```
amlConfig:
* subscriptionId
* required key, the subscriptionId of your account
* resourceGroup
* required key, the resourceGroup of your account
* workspaceName
* required key, the workspaceName of your account
* computeTarget
* required key, the compute cluster name you want to use in your AML workspace. [refer](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target) See Step 6.
* maxTrialNumPerGpu
* optional key, default 1. Used to specify the max concurrency trial number on a GPU device.
* useActiveGpu
* optional key, default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU.
The required information of amlConfig could be found in the downloaded `config.json` in Step 5.
Run the following commands to start the example experiment:
```
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-tfv1
# modify config_aml.yml ...
nnictl create --config config_aml.yml
```
Replace `${NNI_VERSION}` with a released version name or branch name, e.g., `v1.9`.
# Run an Experiment on AdaptDL
Now NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl). Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
## Prerequisite for Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes [on Azure](https://azure.microsoft.com/en-us/services/kubernetes-service/), or [on-premise](https://kubernetes.io/docs/setup/) with [cephfs](https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd), or [microk8s with storage add-on enabled](https://microk8s.io/docs/addons).
2. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this [guideline](https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html) to setup AdaptDL scheduler.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
### Verify Prerequisites
```bash
nnictl --version
# Expected: <version_number>
```
```bash
kubectl version
# Expected that the kubectl client version matches the server version.
```
```bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
```
## Run an experiment
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under `examples/trials/cifar10_pytorch` folder. (`main_adl.py` and `config_adl.yaml`)
Here is a template configuration specification to use AdaptDL as a training service.
```yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
namespace: <k8s_namespace>
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: dfs
storageSize: 1Gi
```
Those configs not mentioned below, are following the
[default specs defined in the NNI doc](https://nni.readthedocs.io/en/latest/Tutorial/ExperimentConfig.html#configuration-spec).
* **trainingServicePlatform**: Choose `adl` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**: *Required* to get the correct info and metrics back from the cluster, for `adl` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**: *Recommended* to set as `http`. It will collect the trial logs on cluster back to your machine via http.
* **tuner**: It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**: It defines the specs of an `adl` trial.
* **namespace**: (*Optional*) Kubernetes namespace to launch the trials. Default to `default` namespace.
* **adaptive**: (*Optional*) Boolean for AdaptDL trainer. While `true`, it the job is preemptible and adaptive.
* **image**: Docker image for the trial
* **imagePullSecret**: (*Optional*) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**: the working directory of the container. `.` means the default working directory defined by the image.
* **command**: the bash command to start the trial
* **gpuNum**: the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**: (*Optional*) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**: (*Optional*) the size of memory requested for this trial. It must follow the Kubernetes
[default format](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory).
* **nfs**: (*Optional*) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint**: (*Optional*) storage settings for model checkpoints.
* **storageClass**: check [Kubernetes storage documentation](https://kubernetes.io/docs/concepts/storage/storage-classes/) for how to use the appropriate `storageClass`.
* **storageSize**: this value should be large enough to fit your model's checkpoints, or it could cause disk quota exceeded error.
### NFS Storage
As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
Note that `adl` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The `adl` training service can then mount it to the kubernetes for every trials, with the proper configurations:
* **server**: NFS server address, e.g. IP address or domain
* **path**: NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**: In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
## Monitor via Log Stream
Follow the log streaming of a certain trial:
```bash
nnictl log trial --trial_id=<trial_id>
```
```bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
```
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.
## Monitor via TensorBoard
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.
You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.
In the trial container you may have access to two environment variables:
* `ADAPTDL_TENSORBOARD_LOGDIR`: the TensorBoard logging directory for the current experiment,
* `NNI_TRIAL_JOB_ID`: the `trial` job id for the current trial.
It is recommended for to have them joined as the directory for trial,
for example in Python:
```python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
```
If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.
With the above setting, you can monitor the experiment easily
via TensorBoard by
```bash
nnictl tensorboard start
```
If having multiple experiment running at the same time, you may use
```bash
nnictl tensorboard start <experiment_id>
```
It will provide you the web url to access the tensorboard.
Note that you have the flexibility to set up the local `--port`
for the TensorBoard.
**Run an Experiment on DLTS**
===
NNI supports running an experiment on [DLTS](https://github.com/microsoft/DLWorkspace.git), called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
## Setup Environment
Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.
![Choose Cluster](../../img/dlts-step1.png)
Step 2. Prepare a NNI config YAML like the following:
```yaml
# Set this field to "dlts"
trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
trialConcurrency: 2
maxExecDuration: 3h
maxTrialNum: 100
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
# Configuration to access DLTS
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
```
Remember to fill the cluster dashboard URL to the last line.
Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.
![Copy Config](../../img/dlts-step3.png)
Step 4. Submit a NNI manager job to the specified cluster.
![Submit Job](../../img/dlts-step4.png)
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to check trial's information.
![View NNI WebUI](../../img/dlts-step5.png)
# Run an Experiment on FrameworkController
===
NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
## Prerequisite for on-premises Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
3. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
4. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copies files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
```bash
apt-get install nfs-common
```
6. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
## Prerequisite for Azure Kubernetes Service
1. NNI support Kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, NNI need the access key of the storage account, and NNI uses [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Setup FrameworkController
Follow the [guideline](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run) to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode. If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController, and then pass the name of the FrameworkController service account to the NNI Experiment Config. [refer](https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset)
## Design
Please refer the design of [Kubeflow training service](KubeflowMode.md), FrameworkController training service pipeline is similar.
## Example
The FrameworkController config file format is:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: ~/nni/examples/trials/mnist-tfv1
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
server: {your_nfs_server}
path: {your_nfs_server_exported_path}
```
If you use Azure Kubernetes Service, you should set `frameworkcontrollerConfig` in your config YAML file as follows:
```yaml
frameworkcontrollerConfig:
storage: azureStorage
serviceAccountName: {your_frameworkcontroller_service_account_name}
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
```
Note: You should explicitly set `trainingServicePlatform: frameworkcontroller` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the [Tensorflow example of FrameworkController](https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml) for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in Kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
* taskNum: the replica number of the task role.
* command: the users' command to be used in the container.
* gpuNum: the number of gpu device used in container.
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the [user-manual](https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy) to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
## How to run example
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on FrameworkController is similar to Kubeflow, please refer the [document](KubeflowMode.md) for more information.
## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
# How to Implement Training Service in NNI
## Overview
TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
## System architecture
![](../../img/NNIDesign.jpg)
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports [local platfrom](LocalMode.md), [remote platfrom](RemoteMachineMode.md), [PAI platfrom](PaiMode.md), [kubeflow platform](KubeflowMode.md) and [FrameworkController platfrom](FrameworkControllerMode.md).
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
## Folder structure of code
NNI's folder structure is shown below:
```
nni
|- deployment
|- docs
|- examaples
|- src
| |- nni_manager
| | |- common
| | |- config
| | |- core
| | |- coverage
| | |- dist
| | |- rest_server
| | |- training_service
| | | |- common
| | | |- kubernetes
| | | |- local
| | | |- pai
| | | |- remote_machine
| | | |- test
| |- sdk
| |- webui
|- test
|- tools
| |-nni_annotation
| |-nni_cmd
| |-nni_gpu_tool
| |-nni_trial_tool
```
`nni/src/` folder stores the most source code of NNI. The code in this folder is related to NNIManager, TrainingService, SDK, WebUI and other modules. Users could find the abstract class of TrainingService in `nni/src/nni_manager/common/trainingService.ts` file, and they should put their own implemented TrainingService in `nni/src/nni_manager/training_service` folder. If users have implemented their own TrainingService code, they should also supplement the unit test of the code, and place them in `nni/src/nni_manager/training_service/test` folder.
## Function annotation of TrainingService
```
abstract class TrainingService {
public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract get isMultiPhaseJobSupported(): boolean;
public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
public abstract setClusterMetadata(key: string, value: string): Promise<void>;
public abstract getClusterMetadata(key: string): Promise<string>;
public abstract cleanUp(): Promise<void>;
public abstract run(): Promise<void>;
}
```
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
__setClusterMetadata(key: string, value: string)__
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
```
export class RemoteMachineMeta {
public readonly ip : string;
public readonly port : number;
public readonly username : string;
public readonly passwd?: string;
public readonly sshKeyPath?: string;
public readonly passphrase?: string;
public gpuSummary : GPUSummary | undefined;
/* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
public gpuReservation : Map<number, string>;
constructor(ip : string, port : number, username : string, passwd : string,
sshKeyPath : string, passphrase : string) {
this.ip = ip;
this.port = port;
this.username = username;
this.passwd = passwd;
this.sshKeyPath = sshKeyPath;
this.passphrase = passphrase;
this.gpuReservation = new Map<number, string>();
}
}
```
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
__getClusterMetadata(key: string)__
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
__submitTrialJob(form: JobApplicationForm)__
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
```
interface TrialJobDetail {
readonly id: string;
readonly status: TrialJobStatus;
readonly submitTime: number;
readonly startTime?: number;
readonly endTime?: number;
readonly tags?: string[];
readonly url?: string;
readonly workingDirectory: string;
readonly form: JobApplicationForm;
readonly sequenceId: number;
isEarlyStopped?: boolean;
}
```
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
__cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)__
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
__updateTrialJob(trialJobId: string, form: JobApplicationForm)__
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to `RUNNING`, `SUCCEED`, `FAILED` etc.
__getTrialJob(trialJobId: string)__
This function returns a trialJob detail instance according to trialJobId.
__listTrialJobs()__
Users should put all of trial job detail information into a list, and return the list.
__addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
__removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)__
Close the EventEmitter.
__run()__
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
__cleanUp()__
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
## TrialKeeper tool
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in `nni/tools/nni_trial_tool`. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
![](../../img/trialkeeper.jpg)
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in `nni/src/nni_manager/training_service/common/clusterJobRestServer.ts`.
## Reference
For more information about how to debug, please [refer](../Tutorial/HowToDebug.md).
The guideline of how to contribute, please [refer](../Tutorial/Contributing.md).
# Run an Experiment on Kubeflow
===
Now NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
## Prerequisite for on-premises Kubernetes Service
1. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this [guideline](https://kubernetes.io/docs/setup/) to set up Kubernetes
2. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this [guideline](https://www.kubeflow.org/docs/started/getting-started/) to setup Kubeflow.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this [guideline]( https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig) to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this [guideline](https://github.com/NVIDIA/k8s-device-plugin) to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in `root_squash option`, otherwise permission issue may raise when NNI copy files to NFS. Refer this [page](https://linux.die.net/man/5/exports) to learn what root_squash option is), or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
```
apt-get install nfs-common
```
7. Install **NNI**, follow the install guide [here](../Tutorial/QuickStart.md).
## Prerequisite for Azure Kubernetes Service
1. NNI support Kubeflow based on Azure Kubernetes Service, follow the [guideline](https://azure.microsoft.com/en-us/services/kubernetes-service/) to set up Azure Kubernetes Service.
2. Install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) and __kubectl__. Use `az login` to set azure account, and connect kubectl client to AKS, refer this [guideline](https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster).
3. Deploy Kubeflow on Azure Kubernetes Service, follow the [guideline](https://www.kubeflow.org/docs/started/getting-started/).
4. Follow the [guideline](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal) to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account, and NNI use [Azure Key Vault](https://azure.microsoft.com/en-us/services/key-vault/) Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this [guideline](https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli) to store the access key.
## Design
![](../../img/kubeflow_training_design.png)
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumes: [nfs](https://en.wikipedia.org/wiki/Network_File_System) and [azure file storage](https://azure.microsoft.com/en-us/services/storage/files/), you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs ([tf-operator](https://github.com/kubeflow/tf-operator) job or [pytorch-operator](https://github.com/kubeflow/pytorch-operator) job) in K8S, and mount your storage volume into the job's pod. Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
## Supported operator
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators is not tested.
Users could set operator type in config file.
The setting of tf-operator:
```yaml
kubeflowConfig:
operator: tf-operator
```
The setting of pytorch-operator:
```yaml
kubeflowConfig:
operator: pytorch-operator
```
If users want to use tf-operator, he could set `ps` and `worker` in trial config. If users want to use pytorch-operator, he could set `master` and `worker` in trial config.
## Supported storage type
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
```yaml
kubeflowConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
```
If you use Azure storage, you should set `kubeflowConfig` in your config YAML file as follows:
```yaml
kubeflowConfig:
storage: azureStorage
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
```
## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. This is a tensorflow job, and use tf-operator of Kubeflow. The NNI config YAML file's content is like:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 20
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 2
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
ps:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
apiVersion: v1alpha2
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
```
Note: You should explicitly set `trainingServicePlatform: kubeflow` in NNI config YAML file if you want to start experiment in kubeflow mode.
If you want to run PyTorch jobs, you could set your config files as follow:
```yaml
authorName: default
experimentName: example_mnist_distributed_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
codeDir: .
master:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
worker:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
kubeflowConfig:
operator: pytorch-operator
apiVersion: v1alpha2
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
```
Trial configuration in kubeflow mode have the following configuration keys:
* codeDir
* code directory, where you put training code and config files
* worker (required). This config section is used to configure tensorflow worker role
* replicas
* Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
* command
* Required key. Command to launch your trial job, like ```python mnist.py```
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in [Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod/). This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image [msranni/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/v1.9/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* privateRegistryAuthPath
* Optional field, specify `config.json` file path that holds an authorization token of docker registry, used to pull image from private registry. [Refer](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/).
* apiVersion
* Required key. The API version of your Kubeflow.
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
```bash
nnictl create --config exp_kubeflow.yml
```
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the Kubeflow tfjob created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md)
Any problems when using NNI in Kubeflow mode, please create issues on [NNI Github repo](https://github.com/Microsoft/nni).
**Tutorial: Create and Run an Experiment on local with NNI API**
===
In this tutorial, we will use the example in [~/examples/trials/mnist-tfv1] to explain how to create and run an experiment on local with NNI API.
>Before starts
You have an implementation for MNIST classifer using convolutional layers, the Python code is in `mnist_before.py`.
>Step 1 - Update model codes
To enable NNI API, make the following changes:
~~~~
1.1 Declare NNI API
Include `import nni` in your trial code to use NNI APIs.
1.2 Get predefined parameters
Use the following code snippet:
RECEIVED_PARAMS = nni.get_next_parameter()
to get hyper-parameters' values assigned by tuner. `RECEIVED_PARAMS` is an object, for example:
{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}
1.3 Report NNI results
Use the API:
`nni.report_intermediate_result(accuracy)`
to send `accuracy` to assessor.
Use the API:
`nni.report_final_result(accuracy)`
to send `accuracy` to tuner.
~~~~
We had made the changes and saved it to `mnist.py`.
**NOTE**:
~~~~
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
tuner - The tuner will generate next parameters/architecture based on the explore history (final result of all trials).
~~~~
>Step 2 - Define SearchSpace
The hyper-parameters used in `Step 1.2 - Get predefined parameters` is defined in a `search_space.json` file like below:
```
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
```
Refer to [define search space](../Tutorial/SearchSpaceSpec.md) to learn more about search space.
>Step 3 - Define Experiment
>>3.1 enable NNI API mode
To enable NNI API mode, you need to set useAnnotation to *false* and provide the path of SearchSpace file (you just defined in step 1):
```
useAnnotation: false
searchSpacePath: /path/to/your/search_space.json
```
To run an experiment in NNI, you only needed:
* Provide a runnable trial
* Provide or choose a tuner
* Provide a YAML experiment configure file
* (optional) Provide or choose an assessor
**Prepare trial**:
>A set of examples can be found in ~/nni/examples after your installation, run `ls ~/nni/examples/trials` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run `ls ~/nni/examples/trials` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
python ~/nni/examples/trials/mnist-annotation/mnist.py
This command will be filled in the YAML configure file below. Please refer to [here](../TrialExample/Trials.md) for how to write your own trial.
**Prepare tuner**: NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to [here](../Tuner/CustomizeTuner.md)), but for simplicity, here we choose a tuner provided by NNI as below:
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found [here](../Tuner/BuiltinTuner.md)), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**: Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, `cat ~/nni/examples/trials/mnist-annotation/config.yml` to see it. Its content is basically shown below:
```yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 1
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote
trainingServicePlatform: local
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
```
Here *useAnnotation* is true because this trial example uses our python annotation (refer to [here](../Tutorial/AnnotationSpec.md) for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
With all these steps done, we can run the experiment with the following command:
nnictl create --config ~/nni/examples/trials/mnist-annotation/config.yml
You can refer to [here](../Tutorial/Nnictl.md) for more usage guide of *nnictl* command line tool.
## View experiment results
The experiment has been running now. Other than *nnictl*, NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
## Using multiple local GPUs to speed up search
The following steps assume that you have 4 NVIDIA GPUs installed at local and [tensorflow with GPU support](https://www.tensorflow.org/install/gpu). The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**: NNI provides a demo configuration file for the setting above, `cat ~/nni/examples/trials/mnist-annotation/config_gpu.yml` to see it. The trailConcurrency and gpuNum are different from the basic configure file:
```
...
# how many trials could be concurrently running
trialConcurrency: 4
...
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 1
```
We can run the experiment with the following command:
nnictl create --config ~/nni/examples/trials/mnist-annotation/config_gpu.yml
You can use *nnictl* command line tool or WebUI to trace the training progress. *nvidia_smi* command line tool can also help you to monitor the GPU usage during training.
# Training Service
## What is Training Service?
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on [local machine](./LocalMode.md), [remote machines](./RemoteMachineMode.md), and on clusters like [PAI](./PaiMode.md), [Kubeflow](./KubeflowMode.md), [AdaptDL](./AdaptDLMode.md), [FrameworkController](./FrameworkControllerMode.md), [DLTS](./DLTSMode.md) and [AML](./AMLMode.md). These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "[how to implement training service](./HowToImplementTrainingService)" for details.
## How to use Training Service?
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, [reference](../Tutorial/ExperimentConfig) provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as `codeDir` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a `.nniignore` file that works like a `.gitignore` file. For more details on how to write this file, see [this example](https://github.com/Microsoft/nni/tree/v1.9/examples/trials/mnist-tfv1/.nniignore) and the [git documentation](https://git-scm.com/docs/gitignore#_pattern_format).
In case users intend to use large files in their experiment (like large-scaled datasets) and they are not using local mode, they can either: 1) download the data before each trial launches by putting it into trial command; or 2) use a shared storage that is accessible to worker nodes. Usually, training platforms are equipped with shared storage, and NNI allows users to easily use them. Refer to docs of each built-in training service for details.
## Built-in Training Services
|TrainingService|Brief Introduction|
|---|---|
|[__Local__](./LocalMode.md)|NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.|
|[__Remote__](./RemoteMachineMode.md)|NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.|
|[__PAI__](./PaiMode.md)|NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.|
|[__Kubeflow__](./KubeflowMode.md)|NNI supports running experiment on [Kubeflow](https://github.com/kubeflow/kubeflow), called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or [Azure Kubernetes Service(AKS)](https://azure.microsoft.com/en-us/services/kubernetes-service/), a Ubuntu machine on which [kubeconfig](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/) is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, [here](https://kubernetes.io/docs/tutorials/kubernetes-basics/) is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.|
|[__AdaptDL__](./AdaptDLMode.md)|NNI supports running experiment on [AdaptDL](https://github.com/petuum/adaptdl), called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.|
|[__FrameworkController__](./FrameworkControllerMode.md)|NNI supports running experiment using [FrameworkController](https://github.com/Microsoft/frameworkcontroller), called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.|
|[__DLTS__](./DLTSMode.md)|NNI supports running experiment using [DLTS](https://github.com/microsoft/DLWorkspace.git), which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.|
|[__AML__](./AMLMode.md)|NNI supports running an experiment on [AML](https://azure.microsoft.com/en-us/services/machine-learning/) , called aml mode.
## What does Training Service do?
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
</p>
According to the architecture shown in [Overview](../Overview), training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (`codeDir`) accessible to training platform.
```eval_rst
.. Note:: Different training services have different ways to handle ``codeDir``. For example, local training service directly runs trials in ``codeDir``. Remote training service packs ``codeDir`` into a zip and uploads it to each machine. K8S-based training services copy ``codeDir`` onto a shared storage, which is either provided by training platform itself, or configured by users in config file.
```
Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
```eval_rst
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
```
Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from `WAITING` to `RUNNING`, `RUNNING` to `SUCCEEDED`) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.
**Run an Experiment on OpenPAI**
===
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
[toc]
## Setup environment
**Step 1. Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).**
**Step 2. Get token.**
Open web portal of OpenPAI, and click `My profile` button in the top-right side.
<img src="../../img/pai_profile.jpg" style="zoom: 80%;" />
Click `copy` button in the page to copy a jwt token.
<img src="../../img/pai_token.jpg" style="zoom:67%;" />
**Step 3. Mount NFS storage to local machine.**
Click `Submit job` button in web portal.
<img src="../../img/pai_job_submission_page.jpg" style="zoom: 50%;" />
Find the data management region in job submission page.
<img src="../../img/pai_data_management_page.jpg" style="zoom: 33%;" />
The `Preview container paths` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage.
For example, use the following command:
```bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
```
Then the `/data` folder in container will be mounted to `/local/mnt` folder in your local machine.
You could use the following configuration in your NNI's config file:
```yaml
nniManagerNFSMountPath: /local/mnt
```
**Step 4. Get OpenPAI's storage config name and nniManagerMountPath**
The `Team share storage` field is storage configuration used to specify storage value in OpenPAI. You can get `paiStorageConfigName` and `containerNFSMountPath` field in `Team share storage`, for example:
```yaml
paiStorageConfigName: confignfs-data
containerNFSMountPath: /mnt/confignfs-data
```
## Run an experiment
Use `examples/trials/mnist-annotation` as an example. The NNI config YAML file's content is like:
```yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
virtualCluster: default
nniManagerNFSMountPath: /local/mnt
containerNFSMountPath: /mnt/confignfs-data
paiStorageConfigName: confignfs-data
# Configuration to access OpenPAI Cluster
paiConfig:
userName: your_pai_nni_user
token: your_pai_token
host: 10.1.1.1
# optional, experimental feature.
reuse: true
```
Note: You should set `trainingServicePlatform: pai` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like `10.10.5.1`, the default http protocol in NNI is `http`, if your PAI's cluster enabled https, please use the uri in `https://10.10.5.1` format.
### Trial configurations
Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), `trial` configuration in pai mode has the following additional keys:
* cpuNum
Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* memoryMB
Optional key. Should be positive number based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* image
Optional key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your trial will run.
We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/v1.9/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* virtualCluster
Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
* nniManagerNFSMountPath
Required key. Set the mount path in your nniManager machine.
* containerNFSMountPath
Required key. Set the mount path in your container used in OpenPAI.
* paiStorageConfigName:
Optional key. Set the storage name used in OpenPAI. If it is not set in trial configuration, it should be set in the config file specified in `paiConfigPath` field.
* command
Optional key. Set the commands used in OpenPAI container.
* paiConfigPath
Optional key. Set the file path of OpenPAI job configuration, the file is in yaml format.
If users set `paiConfigPath` in NNI's configuration file, no need to specify the fields `command`, `paiStorageConfigName`, `virtualCluster`, `image`, `memoryMB`, `cpuNum`, `gpuNum` in `trial` configuration. These fields will use the values from the config file specified by `paiConfigPath`.
Note:
1. The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni_exp_${this.experimentId}_trial_${trialJobId}.
2. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
### OpenPAI configurations
`paiConfig` includes OpenPAI specific configurations,
* userName
Required key. User name of OpenPAI platform.
* token
Required key. Authentication key of OpenPAI platform.
* host
Required key. The host of OpenPAI platform. It's OpenPAI's job submission page uri, like `10.10.5.1`, the default http protocol in NNI is `http`, if your OpenPAI cluster enabled https, please use the uri in `https://10.10.5.1` format.
* reuse (experimental feature)
Optional key, default is false. If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
```bash
nnictl create --config exp_pai.yml
```
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
![](../../img/nni_pai_joblist.jpg)
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
<img src="../../img/nni_webui_joblist.jpg" style="zoom: 30%;" />
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
<img src="../../img/nni_trial_hdfs_output.jpg" style="zoom: 80%;" />
You can see there're three fils in output folder: stderr, stdout, and trial.log
## data management
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by `paiStorageConfigName` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the `nniManagerNFSMountPath` field in configuration file, NNI will generate bash files and copy data in `codeDir` to the `nniManagerNFSMountPath` folder, then NNI will start a trial job. The data in `nniManagerNFSMountPath` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in `containerNFSMountPath`, NNI will enter this folder first, and then run scripts to start a trial job.
## version check
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
1. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
2. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
<img src="../../img/version_check.png" style="zoom: 80%;" />
**Run an Experiment on OpenpaiYarn**
===
The original `pai` mode is modificated to `paiYarn` mode, which is a distributed training platform based on Yarn.
## Setup environment
Install NNI, follow the install guide [here](../Tutorial/QuickStart.md).
## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like:
```yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai, paiYarn
trainingServicePlatform: paiYarn
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-tfv1
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
# Configuration to access OpenpaiYarn Cluster
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
```
Note: You should set `trainingServicePlatform: paiYarn` in NNI config YAML file if you want to start experiment in paiYarn mode.
Compared with [LocalMode](LocalMode.md) and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in paiYarn mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In paiYarn mode, your trial program will be scheduled by OpenpaiYarn to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image [nnimsra/nni](https://hub.docker.com/r/msranni/nni/) on [Docker Hub](https://hub.docker.com/). It contains NNI python packages, Node modules and javascript artifact files required to start experiment, and all of NNI dependencies. The docker file used to build this image can be found at [here](https://github.com/Microsoft/nni/tree/v1.9/deployment/docker/Dockerfile). You can either use this image directly in your config file, or build your own image based on it.
* virtualCluster
* Optional key. Set the virtualCluster of OpenpaiYarn. If omitted, the job will run on default virtual cluster.
* shmMB
* Optional key. Set the shmMB configuration of OpenpaiYarn, it set the shared memory for one task in the task role.
* authFile
* Optional key, Set the auth file path for private registry while using paiYarn mode, [Refer](https://github.com/microsoft/paiYarn/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.md#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpaiYarn-job), you can prepare the authFile and simply provide the local path of this file, NNI will upload this file to HDFS for you.
* portList
* Optional key. Set the portList configuration of OpenpaiYarn, it specifies a list of port used in container, [Refer](https://github.com/microsoft/paiYarn/blob/b2324866d0280a2d22958717ea6025740f71b9f0/docs/job_tutorial.md#specification).
The config schema in NNI is shown below:
```
portList:
- label: test
beginAt: 8080
portNumber: 2
```
Let's say you want to launch a tensorboard in the mnist example using the port. So the first step is to write a wrapper script `launch_paiYarn.sh` of `mnist.py`.
```bash
export TENSORBOARD_PORT=paiYarn_PORT_LIST_${paiYarn_CURRENT_TASK_ROLE_NAME}_0_tensorboard
tensorboard --logdir . --port ${!TENSORBOARD_PORT} &
python3 mnist.py
```
The config file of portList should be filled as following:
```yaml
trial:
command: bash launch_paiYarn.sh
portList:
- label: tensorboard
beginAt: 0
portNumber: 1
```
NNI support two kind of authorization method in paiYarn, including password and paiYarn token, [refer](https://github.com/microsoft/paiYarn/blob/b6bd2ab1c8890f91b7ac5859743274d2aa923c22/docs/rest-server/API.md#2-authentication). The authorization is configured in `paiYarnConfig` field.
For password authorization, the `paiYarnConfig` schema is:
```
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
```
For paiYarn token authorization, the `paiYarnConfig` schema is:
```
paiYarnConfig:
userName: your_paiYarn_nni_user
token: your_paiYarn_token
host: 10.1.1.1
```
Once complete to fill NNI experiment config file and save (for example, save as exp_paiYarn.yml), then run the following command
```
nnictl create --config exp_paiYarn.yml
```
to start the experiment in paiYarn mode. NNI will create OpenpaiYarn job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see jobs created by NNI in the OpenpaiYarn cluster's web portal, like:
![](../../img/nni_pai_joblist.jpg)
Notice: In paiYarn mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is `8080`, the rest server will listen on `8081`, to receive metrics from trial job running in Kubernetes. So you should `enable 8081` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
![](../../img/nni_webui_joblist.jpg)
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
![](../../img/nni_trial_hdfs_output.jpg)
You can see there're three fils in output folder: stderr, stdout, and trial.log
## data management
If your training data is not too large, it could be put into codeDir, and nni will upload the data to hdfs, or you could build your own docker image with the data. If you have large dataset, it's not appropriate to put the data in codeDir, and you could follow the [guidance](https://github.com/microsoft/paiYarn/blob/master/docs/user/storage.md) to mount the data folder in container.
If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS, the target path is `hdfs://host:port/{username}/nni/{experiments}/{experimentId}/trials/{trialId}/nnioutput`
## version check
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
1. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
2. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
3. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
![](../../img/version_check.png)
# Run an Experiment on Remote Machines
NNI can run one experiment on multiple remote machines through SSH, called `remote` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports `Linux`, `Windows 10`, and `Windows Server 2019`.
## Requirements
* Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into `command` field of NNI config.
* Make sure remote machines can be accessed through SSH from the machine which runs `nnictl` command. It supports both password and key authentication of SSH. For advanced usages, please refer to [machineList part of configuration](../Tutorial/ExperimentConfig.md).
* Make sure the NNI version on each machine is consistent.
* Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called `python3` on Linux, and `python` on Windows.
### Linux
* Follow [installation](../Tutorial/InstallationLinux.md) to install NNI on the remote machine.
### Windows
* Follow [installation](../Tutorial/InstallationWin.md) to install NNI on the remote machine.
* Install and start `OpenSSH Server`.
1. Open `Settings` app on Windows.
2. Click `Apps`, then click `Optional features`.
3. Click `Add a feature`, search and select `OpenSSH Server`, and then click `Install`.
4. Once it's installed, run below command to start and set to automatic start.
```bat
sc config sshd start=auto
net start sshd
```
* Make sure remote account is administrator, so that it can stop running trials.
* Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in `C:\dsvm\tools\setup\welcome.bat`.
The output like below is ok, when opening a new command window.
```text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
```
## Run an experiment
e.g. there are three machines, which can be logged in with username and password.
| IP | Username | Password |
| -------- | -------- | -------- |
| 10.1.1.1 | bob | bob123 |
| 10.1.1.2 | bob | bob123 |
| 10.1.1.3 | bob | bob123 |
Install and run NNI on one of those three machines or another machine, which has network access to them.
Use `examples/trials/mnist-annotation` as the example. Below is content of `examples/trials/mnist-annotation/config_remote.yml`:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
# search space file
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: 10.1.1.1
username: bob
passwd: bob123
#port can be skip if using default ssh port 22
#port: 22
- ip: 10.1.1.2
username: bob
passwd: bob123
- ip: 10.1.1.3
username: bob
passwd: bob123
```
Files in `codeDir` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
```bash
nnictl create --config examples/trials/mnist-annotation/config_remote.yml
```
### Configure python environment
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use __preCommand__ to specify a python environment on your remote machine.
Use `examples/trials/mnist-tfv2` as the example. Below is content of `examples/trials/mnist-tfv2/config_remote.yml`:
```yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: ${replace_to_your_remote_machine_ip}
username: ${replace_to_your_remote_machine_username}
sshKeyPath: ${replace_to_your_remote_machine_sshKeyPath}
# Pre-command will be executed before the remote machine executes other commands.
# Below is an example of specifying python environment.
# If you want to execute multiple commands, please use "&&" to connect them.
# preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
```
The __preCommand__ will be executed before the remote machine executes other commands. So you can configure python environment path like this:
```yaml
# Linux remote machine
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
# Windows remote machine
preCommand: set path=${replace_to_python_environment_path_in_your_remote_machine};%path%
```
Or if you want to activate the `virtualenv` environment:
```yaml
# Linux remote machine
preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# Windows remote machine
preCommand: ${replace_to_absolute_path_recommended_here}\\scripts\\activate
```
Or if you want to activate the `conda` environment:
```yaml
# Linux remote machine
preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
# Windows remote machine
preCommand: call activate ${replace_to_conda_env_name}
```
If you want multiple commands to be executed, you can use `&&` to connect these commands:
```yaml
preCommand: command1 && command2 && command3
```
__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`.
# CIFAR-10 examples
## Overview
[CIFAR-10][3] classification is a common benchmark problem in machine learning. The CIFAR-10 dataset is the collection of images. It is one of the most widely used datasets for machine learning research which contains 60,000 32x32 color images in 10 different classes. Thus, we use CIFAR-10 classification as an example to introduce NNI usage.
### **Goals**
As we all know, the choice of model optimizer is directly affects the performance of the final metrics. The goal of this tutorial is to **tune a better performace optimizer** to train a relatively small convolutional neural network (CNN) for recognizing images.
In this example, we have selected the following common deep learning optimizer:
> "SGD", "Adadelta", "Adagrad", "Adam", "Adamax"
### **Experimental**
#### Preparations
This example requires PyTorch. PyTorch install package should be chosen based on python version and cuda version.
Here is an example of the environment python==3.5 and cuda == 8.0, then using the following commands to install [PyTorch][2]:
```bash
python3 -m pip install http://download.pytorch.org/whl/cu80/torch-0.4.1-cp35-cp35m-linux_x86_64.whl
python3 -m pip install torchvision
```
#### CIFAR-10 with NNI
**Search Space**
As we stated in the target, we target to find out the best `optimizer` for training CIFAR-10 classification. When using different optimizers, we also need to adjust `learning rates` and `network structure` accordingly. so we chose these three parameters as hyperparameters and write the following search space.
```json
{
"lr":{"_type":"choice", "_value":[0.1, 0.01, 0.001, 0.0001]},
"optimizer":{"_type":"choice", "_value":["SGD", "Adadelta", "Adagrad", "Adam", "Adamax"]},
"model":{"_type":"choice", "_value":["vgg", "resnet18", "googlenet", "densenet121", "mobilenet", "dpn92", "senet18"]}
}
```
*Implemented code directory: [search_space.json][8]*
**Trial**
The code for CNN training of each hyperparameters set, paying particular attention to the following points are specific for NNI:
* Use `nni.get_next_parameter()` to get next training hyperparameter set.
* Use `nni.report_intermediate_result(acc)` to report the intermedian result after finish each epoch.
* Use `nni.report_final_result(acc)` to report the final result before the trial end.
*Implemented code directory: [main.py][9]*
You can also use your previous code directly, refer to [How to define a trial][5] for modify.
**Config**
Here is the example of running this experiment on local(with multiple GPUs):
code directory: [examples/trials/cifar10_pytorch/config.yml][6]
Here is the example of running this experiment on OpenPAI:
code directory: [examples/trials/cifar10_pytorch/config_pai.yml][7]
*The complete examples we have implemented: [examples/trials/cifar10_pytorch/][1]*
#### Launch the experiment
We are ready for the experiment, let's now **run the config.yml file from your command line to start the experiment**.
```bash
nnictl create --config nni/examples/trials/cifar10_pytorch/config.yml
```
[1]: https://github.com/Microsoft/nni/tree/v1.9/examples/trials/cifar10_pytorch
[2]: https://pytorch.org/
[3]: https://www.cs.toronto.edu/~kriz/cifar.html
[4]: https://github.com/Microsoft/nni/tree/v1.9/examples/trials/cifar10_pytorch
[5]: Trials.md
[6]: https://github.com/Microsoft/nni/blob/v1.9/examples/trials/cifar10_pytorch/config.yml
[7]: https://github.com/Microsoft/nni/blob/v1.9/examples/trials/cifar10_pytorch/config_pai.yml
[8]: https://github.com/Microsoft/nni/blob/v1.9/examples/trials/cifar10_pytorch/search_space.json
[9]: https://github.com/Microsoft/nni/blob/v1.9/examples/trials/cifar10_pytorch/main.py
# EfficientNet
[EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946)
Use Grid search to find the best combination of alpha, beta and gamma for EfficientNet-B1, as discussed in Section 3.3 in paper. Search space, tuner, configuration examples are provided here.
## Instructions
[Example code](https://github.com/microsoft/nni/tree/v1.9/examples/trials/efficientnet)
1. Set your working directory here in the example code directory.
2. Run `git clone https://github.com/ultmaster/EfficientNet-PyTorch` to clone the [ultmaster modified version](https://github.com/ultmaster/EfficientNet-PyTorch) of the original [EfficientNet-PyTorch](https://github.com/lukemelas/EfficientNet-PyTorch). The modifications were done to adhere to the original [Tensorflow version](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet) as close as possible (including EMA, label smoothing and etc.); also added are the part which gets parameters from tuner and reports intermediate/final results. Clone it into `EfficientNet-PyTorch`; the files like `main.py`, `train_imagenet.sh` will appear inside, as specified in the configuration files.
3. Run `nnictl create --config config_local.yml` (use `config_pai.yml` for OpenPAI) to find the best EfficientNet-B1. Adjust the training service (PAI/local/remote), batch size in the config files according to the environment.
For training on ImageNet, read `EfficientNet-PyTorch/train_imagenet.sh`. Download ImageNet beforehand and extract it adhering to [PyTorch format](https://pytorch.org/docs/stable/torchvision/datasets.html#imagenet) and then replace `/mnt/data/imagenet` in with the location of the ImageNet storage. This file should also be a good example to follow for mounting ImageNet into the container on OpenPAI.
## Results
The follow image is a screenshot, demonstrating the relationship between acc@1 and alpha, beta, gamma.
![](../../img/efficientnet_search_result.png)
# GBDT in nni
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree has many popular implementations, such as [lightgbm](https://github.com/Microsoft/LightGBM), [xgboost](https://github.com/dmlc/xgboost), and [catboost](https://github.com/catboost/catboost), etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
## 1. Search Space in GBDT
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):
> * For better accuracy
* `learning_rate`. The range of `learning rate` could be [0.001, 0.9].
* `num_leaves`. `num_leaves` is related to `max_depth`, you don't have to tune both of them.
* `bagging_freq`. `bagging_freq` could be [1, 2, 4, 8, 10]
* `num_iterations`. May larger if underfitting.
> * For speed up
* `bagging_fraction`. The range of `bagging_fraction` could be [0.7, 1.0].
* `feature_fraction`. The range of `feature_fraction` could be [0.6, 1.0].
* `max_bin`.
> * To avoid overfitting
* `min_data_in_leaf`. This depends on your dataset.
* `min_sum_hessian_in_leaf`. This depend on your dataset.
* `lambda_l1` and `lambda_l2`.
* `min_gain_to_split`.
* `num_leaves`.
Reference link:
[lightgbm](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html) and [autoxgoboost](https://github.com/ja-thomas/autoxgboost/blob/master/poster_2018.pdf)
## 2. Task description
Now we come back to our example "auto-gbdt" which run in lightgbm and nni. The data including [train data](https://github.com/Microsoft/nni/blob/v1.9/examples/trials/auto-gbdt/data/regression.train) and [test data](https://github.com/Microsoft/nni/blob/v1.9/examples/trials/auto-gbdt/data/regression.train).
Given the features and label in train data, we train a GBDT regression model and use it to predict.
## 3. How to run in nni
### 3.1 Install all the requirments
```
pip install lightgbm
pip install pandas
```
### 3.2 Prepare your trial code
You need to prepare a basic code as following:
```python
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
PARAMS = get_default_parameters()
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
```
### 3.3 Prepare your search space.
If you like to tune `num_leaves`, `learning_rate`, `bagging_fraction` and `bagging_freq`, you could write a [search_space.json](https://github.com/Microsoft/nni/blob/v1.9/examples/trials/auto-gbdt/search_space.json) as follow:
```json
{
"num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]},
"learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
"bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
"bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
```
More support variable type you could reference [here](../Tutorial/SearchSpaceSpec.md).
### 3.4 Add SDK of nni into your code.
```diff
+import nni
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
+ nni.report_final_result(rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
+ RECEIVED_PARAMS = nni.get_next_parameter()
PARAMS = get_default_parameters()
+ PARAMS.update(RECEIVED_PARAMS)
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
```
### 3.5 Write a config file and run it.
In the config file, you could set some settings including:
* Experiment setting: `trialConcurrency`, `maxExecDuration`, `maxTrialNum`, `trial gpuNum`, etc.
* Platform setting: `trainingServicePlatform`, etc.
* Path seeting: `searchSpacePath`, `trial codeDir`, etc.
* Algorithm setting: select `tuner` algorithm, `tuner optimize_mode`, etc.
An config.yml as follow:
```yaml
authorName: default
experimentName: example_auto-gbdt
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
command: python3 main.py
codeDir: .
gpuNum: 0
```
Run this experiment with command as follow:
```bash
nnictl create --config ./config.yml
```
Knowledge Distillation on NNI
===
## KnowledgeDistill
Knowledge distillation support, in [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531), the compressed model is trained to mimic a pre-trained, larger model. This training setting is also referred to as "teacher-student", where the large model is the teacher and the small model is the student.
![](../../img/distill.png)
### Usage
PyTorch code
```python
from knowledge_distill.knowledge_distill import KnowledgeDistill
kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
alpha = 1
beta = 0.8
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
# you only to add the following line to fine-tune with knowledge distillation
loss = alpha * loss + beta * kd.loss(data=data, student_out=output)
loss.backward()
```
#### User configuration for KnowledgeDistill
* **kd_teacher_model:** The pre-trained teacher model
* **kd_T:** Temperature for smoothing teacher model's output
The complete code can be found [here](https://github.com/microsoft/nni/tree/v1.3/examples/model_compress/knowledge_distill/)
# MNIST examples
CNN MNIST classifier for deep learning is similar to `hello world` for programming languages. Thus, we use MNIST as example to introduce different features of NNI. The examples are listed below:
- [MNIST with NNI API (TensorFlow v1.x)](#mnist-tfv1)
- [MNIST with NNI API (TensorFlow v2.x)](#mnist-tfv2)
- [MNIST with NNI annotation](#mnist-annotation)
- [MNIST in keras](#mnist-keras)
- [MNIST -- tuning with batch tuner](#mnist-batch)
- [MNIST -- tuning with hyperband](#mnist-hyperband)
- [MNIST -- tuning within a nested search space](#mnist-nested)
- [distributed MNIST (tensorflow) using kubeflow](#mnist-kubeflow-tf)
- [distributed MNIST (pytorch) using kubeflow](#mnist-kubeflow-pytorch)
<a name="mnist-tfv1"></a>
**MNIST with NNI API (TensorFlow v1.x)**
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer. We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc. It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random. We also provide an exmaple YAML file which enables assessor.
`code directory: examples/trials/mnist-tfv1/`
<a name="mnist-tfv2"></a>
**MNIST with NNI API (TensorFlow v2.x)**
Same network to the example above, but written in TensorFlow v2.x Keras API.
`code directory: examples/trials/mnist-tfv2/`
<a name="mnist-annotation"></a>
**MNIST with NNI annotation**
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
`code directory: examples/trials/mnist-annotation/`
<a name="mnist-keras"></a>
**MNIST in keras**
This example is implemented in keras. It is also a network for MNIST dataset, with two convolution layers, one pooling layer, and two fully connected layers.
`code directory: examples/trials/mnist-keras/`
<a name="mnist-batch"></a>
**MNIST -- tuning with batch tuner**
This example is to show how to use batch tuner. Users simply list all the configurations they want to try in the search space file. NNI will try all of them.
`code directory: examples/trials/mnist-batch-tune-keras/`
<a name="mnist-hyperband"></a>
**MNIST -- tuning with hyperband**
This example is to show how to use hyperband to tune the model. There is one more key `STEPS` in the received configuration for trials to control how long it can run (e.g., number of iterations).
`code directory: examples/trials/mnist-hyperband/`
<a name="mnist-nested"></a>
**MNIST -- tuning within a nested search space**
This example is to show that NNI also support nested search space. The search space file is an example of how to define nested search space.
`code directory: examples/trials/mnist-nested-search-space/`
<a name="mnist-kubeflow-tf"></a>
**distributed MNIST (tensorflow) using kubeflow**
This example is to show how to run distributed training on kubeflow through NNI. Users can simply provide distributed training code and a configure file which specifies the kubeflow mode. For example, what is the command to run ps and what is the command to run worker, and how many resources they consume. This example is implemented in tensorflow, thus, uses kubeflow tensorflow operator.
`code directory: examples/trials/mnist-distributed/`
<a name="mnist-kubeflow-pytorch"></a>
**distributed MNIST (pytorch) using kubeflow**
Similar to the previous example, the difference is that this example is implemented in pytorch, thus, it uses kubeflow pytorch operator.
`code directory: examples/trials/mnist-distributed-pytorch/`
# Tuning Tensor Operators on NNI
## Overview
Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graph-level optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardware-specific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for large-scale deployment and application of deep learning technologies in the real-world problems.
Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyper-parameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. Unlike many typical problems of this type, such as travelling salesman problem, the objective function of tensor operator optimization is a black box and expensive to sample. One has to compile a device code with a specific configuration and run it on real hardware to get the corresponding performance metric. Therefore, a desired method for optimizing tensor operators should find the best configuration with as few samples as possible.
The expensive objective function makes solving tensor operator optimization problem with traditional combinatorial optimization methods, for example, simulated annealing and evolutionary algorithms, almost impossible. Although these algorithms inherently support combinatorial search spaces, they do not take sample-efficiency into account,
thus thousands of or even more samples are usually needed, which is unacceptable when tuning tensor operators in product environments. On the other hand, sequential model based optimization (SMBO) methods are proved sample-efficient for optimizing black-box functions with continuous search spaces. However, when optimizing ones with combinatorial search spaces, SMBO methods are not as sample-efficient as their continuous counterparts, because there is lack of prior assumptions about the objective functions, such as continuity and differentiability in the case of continuous search spaces. For example, if one could assume an objective function with a continuous search space is infinitely differentiable, a Gaussian process with a radial basis function (RBF) kernel could be used to model the objective function. In this way, a sample provides not only a single value at a point but also the local properties of the objective function in its neighborhood or even global properties,
which results in a high sample-efficiency. In contrast, SMBO methods for combinatorial optimization suffer poor sample-efficiency due to the lack of proper prior assumptions and surrogate models which can leverage them.
OpEvo is recently proposed for solving this challenging problem. It efficiently explores the search spaces of tensor operators by introducing a topology-aware mutation operation based on q-random walk distribution to leverage the topological structures over the search spaces. Following this example, you can use OpEvo to tune three representative types of tensor operators selected from two popular neural networks, BERT and AlexNet. Three comparison baselines, AutoTVM, G-BFS and N-A2C, are also provided. Please refer to [OpEvo: An Evolutionary Method for Tensor Operator Optimization](https://arxiv.org/abs/2006.05664) for detailed explanation about these algorithms.
## Environment Setup
We prepared a dockerfile for setting up experiment environments. Before starting, please make sure the Docker daemon is running and the driver of your GPU accelerator is properly installed. Enter into the example folder `examples/trials/systems/opevo` and run below command to build and instantiate a Docker image from the dockerfile.
```bash
# if you are using Nvidia GPU
make cuda-env
# if you are using AMD GPU
make rocm-env
```
## Run Experiments:
Three representative kinds of tensor operators, **matrix multiplication**, **batched matrix multiplication** and **2D convolution**, are chosen from BERT and AlexNet, and tuned with NNI. The `Trial` code for all tensor operators is `/root/compiler_auto_tune_stable.py`, and `Search Space` files and `config` files for each tuning algorithm locate in `/root/experiments/`, which are categorized by tensor operators. Here `/root` refers to the root of the container.
For tuning the operators of matrix multiplication, please run below commands from `/root`:
```bash
# (N, K) x (K, M) represents a matrix of shape (N, K) multiplies a matrix of shape (K, M)
# (512, 1024) x (1024, 1024)
# tuning with OpEvo
nnictl create --config experiments/mm/N512K1024M1024/config_opevo.yml
# tuning with G-BFS
nnictl create --config experiments/mm/N512K1024M1024/config_gbfs.yml
# tuning with N-A2C
nnictl create --config experiments/mm/N512K1024M1024/config_na2c.yml
# tuning with AutoTVM
OP=matmul STEP=512 N=512 M=1024 K=1024 P=NN ./run.s
# (512, 1024) x (1024, 4096)
# tuning with OpEvo
nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
# tuning with G-BFS
nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
# tuning with N-A2C
nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml
# tuning with AutoTVM
OP=matmul STEP=512 N=512 M=1024 K=4096 P=NN ./run.sh
# (512, 4096) x (4096, 1024)
# tuning with OpEvo
nnictl create --config experiments/mm/N512K4096M1024/config_opevo.yml
# tuning with G-BFS
nnictl create --config experiments/mm/N512K4096M1024/config_gbfs.yml
# tuning with N-A2C
nnictl create --config experiments/mm/N512K4096M1024/config_na2c.yml
# tuning with AutoTVM
OP=matmul STEP=512 N=512 M=4096 K=1024 P=NN ./run.sh
```
For tuning the operators of batched matrix multiplication, please run below commands from `/root`:
```bash
# batched matrix with batch size 960 and shape of matrix (128, 128) multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
# tuning with OpEvo
nnictl create --config experiments/bmm/B960N128K128M64PNN/config_opevo.yml
# tuning with AutoTVM
OP=batch_matmul STEP=512 B=960 N=128 K=128 M=64 P=NN ./run.sh
# batched matrix with batch size 960 and shape of matrix (128, 128) is transposed first and then multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
# tuning with OpEvo
nnictl create --config experiments/bmm/B960N128K128M64PTN/config_opevo.yml
# tuning with AutoTVM
OP=batch_matmul STEP=512 B=960 N=128 K=128 M=64 P=TN ./run.sh
# batched matrix with batch size 960 and shape of matrix (128, 64) is transposed first and then right multiplies batched matrix with batch size 960 and shape of matrix (128, 64).
# tuning with OpEvo
nnictl create --config experiments/bmm/B960N128K64M128PNT/config_opevo.yml
# tuning with AutoTVM
OP=batch_matmul STEP=512 B=960 N=128 K=64 M=128 P=NT ./run.sh
```
For tuning the operators of 2D convolution, please run below commands from `/root`:
```bash
# image tensor of shape (512, 3, 227, 227) convolves with kernel tensor of shape (64, 3, 11, 11) with stride 4 and padding 0
# tuning with OpEvo
nnictl create --config experiments/conv/N512C3HW227F64K11ST4PD0/config_opevo.yml
# tuning with AutoTVM
OP=convfwd_direct STEP=512 N=512 C=3 H=227 W=227 F=64 K=11 ST=4 PD=0 ./run.sh
# image tensor of shape (512, 64, 27, 27) convolves with kernel tensor of shape (192, 64, 5, 5) with stride 1 and padding 2
# tuning with OpEvo
nnictl create --config experiments/conv/N512C64HW27F192K5ST1PD2/config_opevo.yml
# tuning with AutoTVM
OP=convfwd_direct STEP=512 N=512 C=64 H=27 W=27 F=192 K=5 ST=1 PD=2 ./run.sh
```
Please note that G-BFS and N-A2C are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces, thus not eligible to tune the operators of batched matrix multiplication and 2D convolution. Here, AutoTVM is implemented by its authors in the TVM project, so the tuning results are printed on the screen rather than reported to NNI manager. The port 8080 of the container is bind to the host on the same port, so one can access the NNI Web UI through `host_ip_addr:8080` and monitor tuning process as below screenshot.
<img src="../../../examples/trials/systems/opevo/screenshot.png" />
## Citing OpEvo
If you feel OpEvo is helpful, please consider citing the paper as follows:
```
@misc{gao2020opevo,
title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
author={Xiaotian Gao and Cui Wei and Lintao Zhang and Mao Yang},
year={2020},
eprint={2006.05664},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
# Tuning RocksDB on NNI
## Overview
[RocksDB](https://github.com/facebook/rocksdb) is a popular high performance embedded key-value database used in production systems at various web-scale enterprises including Facebook, Yahoo!, and LinkedIn.. It is a fork of [LevelDB](https://github.com/google/leveldb) by Facebook optimized to exploit many central processing unit (CPU) cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads.
The performance of RocksDB is highly contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. NNI can help to address this issue. NNI supports many kinds of tuning algorithms to search the best configuration of RocksDB, and support many kinds of environments like local machine, remote servers and cloud.
This example illustrates how to use NNI to search the best configuration of RocksDB for a `fillrandom` benchmark supported by a benchmark tool `db_bench`, which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and [`db_bench`](https://github.com/facebook/rocksdb/wiki/Benchmarking-tools) is in your `PATH`. Please refer to [here](../Tutorial/QuickStart.md) for detailed information about installation and preparing of NNI environment, and [here](https://github.com/facebook/rocksdb/blob/master/INSTALL.md) for compiling RocksDB as well as `db_bench`.
We also provide a simple script [`db_bench_installation.sh`](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/db_bench_installation.sh) helping to compile and install `db_bench` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
*code directory: [`example/trials/systems/rocksdb-fillrandom`](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom)*
## Experiment setup
There are mainly three steps to setup an experiment of tuning systems on NNI. Define search space with a `json` file, write a benchmark code, and start NNI experiment by passing a config file to NNI manager.
### Search Space
For simplicity, this example tunes three parameters, `write_buffer_size`, `min_write_buffer_num` and `level0_file_num_compaction_trigger`, for writing 16M keys with 20 Bytes of key size and 100 Bytes of value size randomly, based on writing operations per second (OPS). `write_buffer_size` sets the size of a single memtable. Once memtable exceeds this size, it is marked immutable and a new one is created. `min_write_buffer_num` is the minimum number of memtables to be merged before flushing to storage. Once the number of files in level 0 reaches `level0_file_num_compaction_trigger`, level 0 to level 1 compaction is triggered.
In this example, the search space is specified by a `search_space.json` file as shown below. Detailed explanation of search space could be found [here](../Tutorial/SearchSpaceSpec.md).
```json
{
"write_buffer_size": {
"_type": "quniform",
"_value": [2097152, 16777216, 1048576]
},
"min_write_buffer_number_to_merge": {
"_type": "quniform",
"_value": [2, 16, 1]
},
"level0_file_num_compaction_trigger": {
"_type": "quniform",
"_value": [2, 16, 1]
}
}
```
*code directory: [`example/trials/systems/rocksdb-fillrandom/search_space.json`](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/search_space.json)*
### Benchmark code
Benchmark code should receive a configuration from NNI manager, and report the corresponding benchmark result back. Following NNI APIs are designed for this purpose. In this example, writing operations per second (OPS) is used as a performance metric. Please refer to [here](Trials.md) for detailed information.
* Use `nni.get_next_parameter()` to get next system configuration.
* Use `nni.report_final_result(metric)` to report the benchmark result.
*code directory: [`example/trials/systems/rocksdb-fillrandom/main.py`](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/main.py)*
### Config file
One could start a NNI experiment with a config file. A config file for NNI is a `yaml` file usually including experiment settings (`trialConcurrency`, `maxExecDuration`, `maxTrialNum`, `trial gpuNum`, etc.), platform settings (`trainingServicePlatform`, etc.), path settings (`searchSpacePath`, `trial codeDir`, etc.) and tuner settings (`tuner`, `tuner optimize_mode`, etc.). Please refer to [here](../Tutorial/QuickStart.md) for more information.
Here is an example of tuning RocksDB with SMAC algorithm:
*code directory: [`example/trials/systems/rocksdb-fillrandom/config_smac.yml`](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/config_smac.yml)*
Here is an example of tuning RocksDB with TPE algorithm:
*code directory: [`example/trials/systems/rocksdb-fillrandom/config_tpe.yml`](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/config_tpe.yml)*
Other tuners can be easily adopted in the same way. Please refer to [here](../Tuner/BuiltinTuner.md) for more information.
Finally, we could enter the example folder and start the experiment using following commands:
```bash
# tuning RocksDB with SMAC tuner
nnictl create --config ./config_smac.yml
# tuning RocksDB with TPE tuner
nnictl create --config ./config_tpe.yml
```
## Experiment results
We ran these two examples on the same machine with following details:
* 16 * Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
* 465 GB of rotational hard drive with ext4 file system
* 128 GB of RAM
* Kernel version: 4.15.0-58-generic
* NNI version: v1.0-37-g1bd24577
* RocksDB version: 6.4
* RocksDB DEBUG_LEVEL: 0
The detailed experiment results are shown in the below figure. Horizontal axis is sequential order of trials. Vertical axis is the metric, write OPS in this example. Blue dots represent trials for tuning RocksDB with SMAC tuner, and orange dots stand for trials for tuning RocksDB with TPE tuner.
![image](https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/plot.png)
Following table lists the best trials and corresponding parameters and metric obtained by the two tuners. Unsurprisingly, both of them found the same optimal configuration for `fillrandom` benchmark.
| Tuner | Best trial | Best OPS | write_buffer_size | min_write_buffer_number_to_merge | level0_file_num_compaction_trigger |
| :---: | :--------: | :------: | :---------------: | :------------------------------: | :--------------------------------: |
| SMAC | 255 | 779289 | 2097152 | 7.0 | 7.0 |
| TPE | 169 | 761456 | 2097152 | 7.0 | 7.0 |
# Scikit-learn in NNI
[Scikit-learn](https://github.com/scikit-learn/scikit-learn) is a popular machine learning tool for data mining and data analysis. It supports many kinds of machine learning models like LinearRegression, LogisticRegression, DecisionTree, SVM etc. How to make the use of scikit-learn more efficiency is a valuable topic.
NNI supports many kinds of tuning algorithms to search the best models and/or hyper-parameters for scikit-learn, and support many kinds of environments like local machine, remote servers and cloud.
## 1. How to run the example
To start using NNI, you should install the NNI package, and use the command line tool `nnictl` to start an experiment. For more information about installation and preparing for the environment, please refer [here](../Tutorial/QuickStart.md).
After you installed NNI, you could enter the corresponding folder and start the experiment using following commands:
```bash
nnictl create --config ./config.yml
```
## 2. Description of the example
### 2.1 classification
This example uses the dataset of digits, which is made up of 1797 8x8 images, and each image is a hand-written digit, the goal is to classify these images into 10 classes.
In this example, we use SVC as the model, and choose some parameters of this model, including `"C", "kernel", "degree", "gamma" and "coef0"`. For more information of these parameters, please [refer](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
### 2.2 regression
This example uses the Boston Housing Dataset, this dataset consists of price of houses in various places in Boston and the information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE) etc., to predict the house price of Boston.
In this example, we tune different kinds of regression models including `"LinearRegression", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor"` and some parameters like `"svr_kernel", "knr_weights"`. You could get more details about these models from [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).
## 3. How to write scikit-learn code using NNI
It is easy to use NNI in your scikit-learn code, there are only a few steps.
* __step 1__
Prepare a search_space.json to storage your choose spaces.
For example, if you want to choose different models, you may try:
```json
{
"model_name":{"_type":"choice","_value":["LinearRegression", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor"]}
}
```
If you want to choose different models and parameters, you could put them together in a search_space.json file.
```json
{
"model_name":{"_type":"choice","_value":["LinearRegression", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor"]},
"svr_kernel": {"_type":"choice","_value":["linear", "poly", "rbf"]},
"knr_weights": {"_type":"choice","_value":["uniform", "distance"]}
}
```
Then you could read these values as a dict from your python code, please get into the step 2.
* __step 2__
At the beginning of your python code, you should `import nni` to insure the packages works normally.
First, you should use `nni.get_next_parameter()` function to get your parameters given by NNI. Then you could use these parameters to update your code.
For example, if you define your search_space.json like following format:
```json
{
"C": {"_type":"uniform","_value":[0.1, 1]},
"kernel": {"_type":"choice","_value":["linear", "rbf", "poly", "sigmoid"]},
"degree": {"_type":"choice","_value":[1, 2, 3, 4]},
"gamma": {"_type":"uniform","_value":[0.01, 0.1]},
"coef0": {"_type":"uniform","_value":[0.01, 0.1]}
}
```
You may get a parameter dict like this:
```python
params = {
'C': 1.0,
'kernel': 'linear',
'degree': 3,
'gamma': 0.01,
'coef0': 0.01
}
```
Then you could use these variables to write your scikit-learn code.
* __step 3__
After you finished your training, you could get your own score of the model, like your precision, recall or MSE etc. NNI needs your score to tuner algorithms and generate next group of parameters, please report the score back to NNI and start next trial job.
You just need to use `nni.report_final_result(score)` to communicate with NNI after you process your scikit-learn code. Or if you have multiple scores in the steps of training, you could also report them back to NNI using `nni.report_intemediate_result(score)`. Note, you may not report intermediate result of your job, but you must report back your final result.
# Automatic Model Architecture Search for Reading Comprehension
This example shows us how to use Genetic Algorithm to find good model architectures for Reading Comprehension.
## 1. Search Space
Since attention and RNN have been proven effective in Reading Comprehension, we conclude the search space as follow:
1. IDENTITY (Effectively means keep training).
2. INSERT-RNN-LAYER (Inserts a LSTM. Comparing the performance of GRU and LSTM in our experiment, we decided to use LSTM here.)
3. REMOVE-RNN-LAYER
4. INSERT-ATTENTION-LAYER(Inserts an attention layer.)
5. REMOVE-ATTENTION-LAYER
6. ADD-SKIP (Identity between random layers).
7. REMOVE-SKIP (Removes random skip).
![](../../../examples/trials/ga_squad/ga_squad.png)
### New version
Also we have another version which time cost is less and performance is better. We will release soon.
## 2. How to run this example in local?
### 2.1 Use downloading script to download data
Execute the following command to download needed files
using the downloading script:
```bash
chmod +x ./download.sh
./download.sh
```
Or Download manually
1. download "dev-v1.1.json" and "train-v1.1.json" in https://rajpurkar.github.io/SQuAD-explorer/
```bash
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
```
2. download "glove.840B.300d.txt" in https://nlp.stanford.edu/projects/glove/
```bash
wget http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip glove.840B.300d.zip
```
### 2.2 Update configuration
Modify `nni/examples/trials/ga_squad/config.yml`, here is the default configuration:
```yaml
authorName: default
experimentName: example_ga_squad
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 1
#choice: local, remote
trainingServicePlatform: local
#choice: true, false
useAnnotation: false
tuner:
codeDir: ~/nni/examples/tuners/ga_customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
optimize_mode: maximize
trial:
command: python3 trial.py
codeDir: ~/nni/examples/trials/ga_squad
gpuNum: 0
```
In the "trial" part, if you want to use GPU to perform the architecture search, change `gpuNum` from `0` to `1`. You need to increase the `maxTrialNum` and `maxExecDuration`, according to how long you want to wait for the search result.
### 2.3 submit this job
```bash
nnictl create --config ~/nni/examples/trials/ga_squad/config.yml
```
## 3 Run this example on OpenPAI
Due to the memory limitation of upload, we only upload the source code and complete the data download and training on OpenPAI. This experiment requires sufficient memory that `memoryMB >= 32G`, and the training may last for several hours.
### 3.1 Update configuration
Modify `nni/examples/trials/ga_squad/config_pai.yml`, here is the default configuration:
```yaml
authorName: default
experimentName: example_ga_squad
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: pai
#choice: true, false
useAnnotation: false
#Your nni_manager ip
nniManagerIp: 10.10.10.10
tuner:
codeDir: https://github.com/Microsoft/nni/tree/v1.9/examples/tuners/ga_customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
optimize_mode: maximize
trial:
command: chmod +x ./download.sh && ./download.sh && python3 trial.py
codeDir: .
gpuNum: 0
cpuNum: 1
memoryMB: 32869
#The docker image to run nni job on OpenPAI
image: msranni/nni:latest
paiConfig:
#The username to login OpenPAI
userName: username
#The password to login OpenPAI
passWord: password
#The host of restful server of OpenPAI
host: 10.10.10.10
```
Please change the default value to your personal account and machine information. Including `nniManagerIp`, `userName`, `passWord` and `host`.
In the "trial" part, if you want to use GPU to perform the architecture search, change `gpuNum` from `0` to `1`. You need to increase the `maxTrialNum` and `maxExecDuration`, according to how long you want to wait for the search result.
`trialConcurrency` is the number of trials running concurrently, which is the number of GPUs you want to use, if you are setting `gpuNum` to 1.
### 3.2 submit this job
```bash
nnictl create --config ~/nni/examples/trials/ga_squad/config_pai.yml
```
## 4. Technical details about the trial
### 4.1 How does it works
The evolution-algorithm based architecture for question answering has two different parts just like any other examples: the trial and the tuner.
### 4.2 The trial
The trial has a lot of different files, functions and classes. Here we will only give most of those files a brief introduction:
* `attention.py` contains an implementation for attention mechanism in Tensorflow.
* `data.py` contains functions for data preprocessing.
* `evaluate.py` contains the evaluation script.
* `graph.py` contains the definition of the computation graph.
* `rnn.py` contains an implementation for GRU in Tensorflow.
* `train_model.py` is a wrapper for the whole question answering model.
Among those files, `trial.py` and `graph_to_tf.py` are special.
`graph_to_tf.py` has a function named as `graph_to_network`, here is its skeleton code:
```python
def graph_to_network(input1,
input2,
input1_lengths,
input2_lengths,
graph,
dropout_rate,
is_training,
num_heads=1,
rnn_units=256):
topology = graph.is_topology()
layers = dict()
layers_sequence_lengths = dict()
num_units = input1.get_shape().as_list()[-1]
layers[0] = input1*tf.sqrt(tf.cast(num_units, tf.float32)) + \
positional_encoding(input1, scale=False, zero_pad=False)
layers[1] = input2*tf.sqrt(tf.cast(num_units, tf.float32))
layers[0] = dropout(layers[0], dropout_rate, is_training)
layers[1] = dropout(layers[1], dropout_rate, is_training)
layers_sequence_lengths[0] = input1_lengths
layers_sequence_lengths[1] = input2_lengths
for _, topo_i in enumerate(topology):
if topo_i == '|':
continue
if graph.layers[topo_i].graph_type == LayerType.input.value:
# ......
elif graph.layers[topo_i].graph_type == LayerType.attention.value:
# ......
# More layers to handle
```
As we can see, this function is actually a compiler, that converts the internal model DAG configuration (which will be introduced in the `Model configuration format` section) `graph`, to a Tensorflow computation graph.
```python
topology = graph.is_topology()
```
performs topological sorting on the internal graph representation, and the code inside the loop:
```python
for _, topo_i in enumerate(topology):
```
performs actually conversion that maps each layer to a part in Tensorflow computation graph.
### 4.3 The tuner
The tuner is much more simple than the trial. They actually share the same `graph.py`. Besides, the tuner has a `customer_tuner.py`, the most important class in which is `CustomerTuner`:
```python
class CustomerTuner(Tuner):
# ......
def generate_parameters(self, parameter_id):
"""Returns a set of trial graph config, as a serializable object.
parameter_id : int
"""
if len(self.population) <= 0:
logger.debug("the len of poplution lower than zero.")
raise Exception('The population is empty')
pos = -1
for i in range(len(self.population)):
if self.population[i].result == None:
pos = i
break
if pos != -1:
indiv = copy.deepcopy(self.population[pos])
self.population.pop(pos)
temp = json.loads(graph_dumps(indiv.config))
else:
random.shuffle(self.population)
if self.population[0].result > self.population[1].result:
self.population[0] = self.population[1]
indiv = copy.deepcopy(self.population[0])
self.population.pop(1)
indiv.mutation()
graph = indiv.config
temp = json.loads(graph_dumps(graph))
# ......
```
As we can see, the overloaded method `generate_parameters` implements a pretty naive mutation algorithm. The code lines:
```python
if self.population[0].result > self.population[1].result:
self.population[0] = self.population[1]
indiv = copy.deepcopy(self.population[0])
```
controls the mutation process. It will always take two random individuals in the population, only keeping and mutating the one with better result.
### 4.4 Model configuration format
Here is an example of the model configuration, which is passed from the tuner to the trial in the architecture search procedure.
```json
{
"max_layer_num": 50,
"layers": [
{
"input_size": 0,
"type": 3,
"output_size": 1,
"input": [],
"size": "x",
"output": [4, 5],
"is_delete": false
},
{
"input_size": 0,
"type": 3,
"output_size": 1,
"input": [],
"size": "y",
"output": [4, 5],
"is_delete": false
},
{
"input_size": 1,
"type": 4,
"output_size": 0,
"input": [6],
"size": "x",
"output": [],
"is_delete": false
},
{
"input_size": 1,
"type": 4,
"output_size": 0,
"input": [5],
"size": "y",
"output": [],
"is_delete": false
},
{"Comment": "More layers will be here for actual graphs."}
]
}
```
Every model configuration will have a "layers" section, which is a JSON list of layer definitions. The definition of each layer is also a JSON object, where:
* `type` is the type of the layer. 0, 1, 2, 3, 4 corresponds to attention, self-attention, RNN, input and output layer respectively.
* `size` is the length of the output. "x", "y" correspond to document length / question length, respectively.
* `input_size` is the number of inputs the layer has.
* `input` is the indices of layers taken as input of this layer.
* `output` is the indices of layers use this layer's output as their input.
* `is_delete` means whether the layer is still available.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment