Commit e773dfcc authored by qianyj's avatar qianyj
Browse files

create branch for v2.9

parents
AdaptDL Training Service
========================
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__, which is a resource-adaptive deep learning training and scheduling framework. With AdaptDL training service, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
.. note:: AdaptDL doesn't support :ref:`reuse mode <training-service-reuse>`.
Prerequisite
------------
Before starting to use NNI AdaptDL training service, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster.
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**.
Verify the Prerequisites
^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
nnictl --version
# Expected: <version_number>
.. code-block:: bash
kubectl version
# Expected that the kubectl client version matches the server version.
.. code-block:: bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
Usage
-----
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yml>`)
Here is a template configuration specification to use AdaptDL as a training service.
.. code-block:: yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: dfs
storageSize: 1Gi
.. warning::
This configuration is written following the specification of `legacy experiment configuration <https://nni.readthedocs.io/en/v2.6/Tutorial/ExperimentConfig.html>`__. It is still supported, and will be updated to the latest version in future release.
The following explains the configuration fields of AdaptDL training service.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**\ : *Required* to get the correct info and metrics back from the cluster, for ``adl`` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**\ : *Recommended* to set as ``http``. It will collect the trial logs on cluster back to your machine via http.
* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**\ : It defines the specs of an ``adl`` trial.
* **namespace**\: (*Optional*\ ) Kubernetes namespace to launch the trials. Default to ``default`` namespace.
* **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
* **image**\ : Docker image for the trial
* **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**\ : the working directory of the container. ``.`` means the default working directory defined by the image.
* **command**\ : the bash command to start the trial
* **gpuNum**\ : the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**\ : (*Optional*\ ) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
`default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
* **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*\ ) storage settings for model checkpoints.
* **storageClass**\ : check `Kubernetes storage documentation <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for how to use the appropriate ``storageClass``.
* **storageSize**\ : this value should be large enough to fit your model's checkpoints, or it could cause "disk quota exceeded" error.
More Features
-------------
NFS Storage
^^^^^^^^^^^
As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
Note that ``adl`` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The ``adl`` training service can then mount it to the kubernetes for every trials, with the proper configurations:
* **server**\ : NFS server address, e.g. IP address or domain
* **path**\ : NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**\ : In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
Monitor via Log Stream
^^^^^^^^^^^^^^^^^^^^^^
Follow the log streaming of a certain trial:
.. code-block:: bash
nnictl log trial --trial_id=TRIAL_ID
.. code-block:: bash
nnictl log trial EXPERIMENT_ID --trial_id=TRIAL_ID
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.
Monitor via TensorBoard
^^^^^^^^^^^^^^^^^^^^^^^
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.
You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.
In the trial container you may have access to two environment variables:
* ``ADAPTDL_TENSORBOARD_LOGDIR``\ : the TensorBoard logging directory for the current experiment,
* ``NNI_TRIAL_JOB_ID``\ : the ``trial`` job id for the current trial.
It is recommended for to have them joined as the directory for trial,
for example in Python:
.. code-block:: python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.
With the above setting, you can monitor the experiment easily
via TensorBoard by
.. code-block:: bash
nnictl tensorboard start
If having multiple experiment running at the same time, you may use
.. code-block:: bash
nnictl tensorboard start EXPERIMENT_ID
It will provide you the web url to access the tensorboard.
Note that you have the flexibility to set up the local ``--port``
for the TensorBoard.
AML Training Service
====================
To run your trials on `AzureML <https://azure.microsoft.com/en-us/services/machine-learning/>`__, you can use AML training service. AML training service can programmatically submit runs to AzureML platform and collect their metrics.
Prerequisite
------------
1. Create an Azure account/subscription using this `link <https://azure.microsoft.com/en-us/free/services/machine-learning/>`__. If you already have an Azure account/subscription, skip this step.
2. Install the Azure CLI on your machine, follow the install guide `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__.
3. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
.. code-block:: bash
az login
4. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download ``config.json`` which will be used later.
.. image:: ../../../img/aml_workspace.png
5. Create an AML cluster as the compute target.
.. image:: ../../../img/aml_cluster.png
6. Open a command line and install AML package environment.
.. code-block:: bash
python3 -m pip install azureml
python3 -m pip install azureml-sdk
Usage
-----
We show an example configuration here with YAML (Python configuration should be similar).
.. code-block:: yaml
trialConcurrency: 1
maxTrialNumber: 10
...
trainingService:
platform: aml
dockerImage: msranni/nni
subscriptionId: ${your subscription ID}
resourceGroup: ${your resource group}
workspaceName: ${your workspace name}
computeTarget: ${your compute target}
Configuration References
------------------------
Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports the following additional configurations.
.. list-table::
:header-rows: 1
:widths: auto
* - Field name
- Description
* - dockerImage
- Required field. The docker image name used in job. If you don't want to build your own, NNI has provided a docker image `msranni/nni <https://hub.docker.com/r/msranni/nni>`__, which is up-to-date with every NNI release.
* - subscriptionId
- Required field. The subscription id of your account, can be found in ``config.json`` described above.
* - resourceGroup
- Required field. The resource group of your account, can be found in ``config.json`` described above.
* - workspaceName
- Required field. The workspace name of your account, can be found in ``config.json`` described above.
* - computeTarget
- Required field. The compute cluster name you want to use in your AML workspace. See `reference <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ and Step 5 above.
* - maxTrialNumberPerGpu
- Optional field. Default 1. Used to specify the max concurrency trial number on a GPU device.
* - useActiveGpu
- Optional field. Default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. See :doc:`local` for details.
Monitor your trial on the cloud by using AML studio
---------------------------------------------------
To see your trial job's detailed status on the cloud, you need to visit your studio which you create at Step 5 above. Once the job completes, go to the **Outputs + logs** tab. There you can see a ``70_driver_log.txt`` file, This file contains the standard output from a run and can be useful when you're debugging remote runs in the cloud. Learn more about aml from `here <https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world>`__.
Customize a Training Service
============================
Overview
--------
TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
System architecture
-------------------
.. image:: ../../../img/NNIDesign.jpg
:target: ../../../img/NNIDesign.jpg
:alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./kubeflow` and :doc:`./frameworkcontroller`.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
Folder structure of code
------------------------
NNI's folder structure is shown below:
.. code-block:: text
nni
|- deployment
|- docs
|- examaples
|- src
| |- nni_manager
| | |- common
| | |- config
| | |- core
| | |- coverage
| | |- dist
| | |- rest_server
| | |- training_service
| | | |- common
| | | |- kubernetes
| | | |- local
| | | |- pai
| | | |- remote_machine
| | | |- test
| |- sdk
| |- webui
|- test
|- tools
| |-nni_annotation
| |-nni_cmd
| |-nni_gpu_tool
| |-nni_trial_tool
``nni/src/`` folder stores the most source code of NNI. The code in this folder is related to NNIManager, TrainingService, SDK, WebUI and other modules. Users could find the abstract class of TrainingService in ``nni/src/nni_manager/common/trainingService.ts`` file, and they should put their own implemented TrainingService in ``nni/src/nni_manager/training_service`` folder. If users have implemented their own TrainingService code, they should also supplement the unit test of the code, and place them in ``nni/src/nni_manager/training_service/test`` folder.
Function annotation of TrainingService
--------------------------------------
.. code-block:: typescript
abstract class TrainingService {
public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract get isMultiPhaseJobSupported(): boolean;
public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
public abstract setClusterMetadata(key: string, value: string): Promise<void>;
public abstract getClusterMetadata(key: string): Promise<string>;
public abstract cleanUp(): Promise<void>;
public abstract run(): Promise<void>;
}
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
**setClusterMetadata(key: string, value: string)**
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
.. code-block:: typescript
export class RemoteMachineMeta {
public readonly ip : string;
public readonly port : number;
public readonly username : string;
public readonly passwd?: string;
public readonly sshKeyPath?: string;
public readonly passphrase?: string;
public gpuSummary : GPUSummary | undefined;
/* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
public gpuReservation : Map<number, string>;
constructor(ip : string, port : number, username : string, passwd : string,
sshKeyPath : string, passphrase : string) {
this.ip = ip;
this.port = port;
this.username = username;
this.passwd = passwd;
this.sshKeyPath = sshKeyPath;
this.passphrase = passphrase;
this.gpuReservation = new Map<number, string>();
}
}
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
**getClusterMetadata(key: string)**
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
**submitTrialJob(form: JobApplicationForm)**
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
.. code-block:: typescript
interface TrialJobDetail {
readonly id: string;
readonly status: TrialJobStatus;
readonly submitTime: number;
readonly startTime?: number;
readonly endTime?: number;
readonly tags?: string[];
readonly url?: string;
readonly workingDirectory: string;
readonly form: JobApplicationForm;
readonly sequenceId: number;
isEarlyStopped?: boolean;
}
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
**cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)**
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
**updateTrialJob(trialJobId: string, form: JobApplicationForm)**
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to ``RUNNING``\ , ``SUCCEED``\ , ``FAILED`` etc.
**getTrialJob(trialJobId: string)**
This function returns a trialJob detail instance according to trialJobId.
**listTrialJobs()**
Users should put all of trial job detail information into a list, and return the list.
**addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
**removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
Close the EventEmitter.
**run()**
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
**cleanUp()**
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
TrialKeeper tool
----------------
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in ``nni/tools/nni_trial_tool``. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
.. image:: ../../../img/trialkeeper.jpg
:target: ../../../img/trialkeeper.jpg
:alt:
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in ``nni/src/nni_manager/training_service/common/clusterJobRestServer.ts``.
Reference
---------
The guideline of how to contribute, please refer to :doc:`/notes/contributing`.
FrameworkController Training Service
====================================
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__,
called frameworkcontroller mode.
FrameworkController is built to orchestrate all kinds of applications on Kubernetes,
you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator.
Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
1. A **Kubernetes** cluster using Kubernetes 1.8 or later.
Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes.
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server.
By default, NNI manager will use ``~/.kube/config`` as kubeconfig file's path.
You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable.
Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__
to learn more about kubeconfig.
3. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__
to configure **Nvidia device plugin for Kubernetes**.
4. Prepare a **NFS server** and export a general purpose mount
(we recommend to map your NFS server path in ``root_squash option``,
otherwise permission issue may raise when NNI copies files to NFS.
Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is),
or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment.
Run this command to install NFSv4 client:
.. code-block:: bash
apt install nfs-common
6. Install **NNI**:
.. code-block:: bash
python -m pip install nni
Prerequisite for Azure Kubernetes Service
-----------------------------------------
1. NNI support FrameworkController based on Azure Kubernetes Service,
follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
2. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**.
Use ``az login`` to set azure account, and connect kubectl client to AKS,
refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
3. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__
to create azure file storage account.
If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, NNI need the access key of the storage account,
and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key.
Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account.
Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Setup FrameworkController
-------------------------
Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run>`__
to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController,
and then pass the name of the FrameworkController service account to the NNI Experiment Config.
If the k8s cluster enforces Authorization, you also need to create a ServiceAccount with granted permission for FrameworkController.
Design
------
Please refer the design of :doc:`Kubeflow training service <kubeflow>`,
FrameworkController training service pipeline is similar.
Example
-------
The FrameworkController config format is:
.. code-block:: python
from nni.experiment import (
Experiment,
FrameworkAttemptCompletionPolicy,
FrameworkControllerRoleConfig,
K8sNfsConfig,
)
experiment = Experiment('frameworkcontroller')
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2
experiment.config.training_service.storage = K8sNfsConfig()
experiment.config.training_service.storage.server = '10.20.30.40'
experiment.config.training_service.storage.path = '/mnt/nfs/nni'
experiment.config.training_service.task_roles = [FrameworkControllerRoleConfig()]
experiment.config.training_service.task_roles[0].name = 'worker'
experiment.config.training_service.task_roles[0].task_number = 1
experiment.config.training_service.task_roles[0].command = 'python3 model.py'
experiment.config.training_service.task_roles[0].gpuNumber = 1
experiment.config.training_service.task_roles[0].cpuNumber = 1
experiment.config.training_service.task_roles[0].memorySize = '4g'
experiment.config.training_service.task_roles[0].framework_attempt_completion_policy = \
FrameworkAttemptCompletionPolicy(min_failed_task_count = 1, min_succeed_task_count = 1)
If you use Azure Kubernetes Service, you should set storage config as follows:
.. code-block:: python
experiment.config.training_service.storage = K8sAzureStorageConfig()
experiment.config.training_service.storage.azure_account = 'your_storage_account_name'
experiment.config.training_service.storage.azure_share = 'your_azure_share_name'
experiment.config.training_service.storage.key_vault_name = 'your_vault_name'
experiment.config.training_service.storage.key_vault_key = 'your_secret_name'
If you set `ServiceAccount <https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/>`__ in your k8s,
please set ``serviceAccountName`` in your config:
.. code-block:: python
experiment.config.training_service.service_account_name = 'frameworkcontroller'
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config,
you could refer the `Tensorflow example of FrameworkController
<https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__
for deep understanding.
Once it's ready, run:
.. code-block:: python
experiment.run(8080)
Notice: In frameworkcontroller mode,
NNIManager will start a rest server and listen on a port which is your NNI web portal's port plus 1.
For example, if your web portal port is ``8080``, the rest server will listen on ``8081``,
to receive metrics from trial job running in Kubernetes.
So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Hybrid Training Service
=======================
Hybrid training service is for aggregating different types of computation resources into a virtually unified resource pool, in which trial jobs are dispatched. Hybrid training service is for collecting user's all available computation resources to jointly work on an AutoML task, it is flexibile enough to switch among different types of computation resources. For example, NNI could submit trial jobs to multiple remote machines and AML simultaneously.
Prerequisite
------------
NNI has supported :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./aml`, :doc:`./kubeflow`, :doc:`./frameworkcontroller`, for hybrid training service. Before starting an experiment using using hybrid training service, users should first setup their chosen (sub) training services (e.g., remote training service) according to each training service's own document page.
.. note:: Reuse mode is disabled by default for local training service. But if you are using local training service in hybrid, :ref:`reuse mode <training-service-reuse>` is enabled by default.
Usage
-----
Unlike other training services (e.g., ``platform: remote`` in remote training service), there is no dedicated keyword for hybrid training service, users can simply list the configurations of their chosen training services under the ``trainingService`` field. Below is an example of a hybrid training service containing remote training service and local training service in experiment configuration yaml.
.. code-block:: yaml
# the experiment config yaml file
...
trainingService:
- platform: remote
machineList:
- host: 127.0.0.1 # your machine's IP address
user: bob
password: bob
- platform: local
...
A complete example configuration file can be found in :githublink:`examples/trials/mnist-pytorch/config_hybrid.yml`.
\ No newline at end of file
Kubeflow Training Service
=========================
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__, called kubeflow mode.
Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster,
either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__,
a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__
is setup to connect to your Kubernetes cluster.
If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start.
In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
1. A **Kubernetes** cluster using Kubernetes 1.8 or later.
Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes.
2. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster.
Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server.
By default, NNI manager will use ``~/.kube/config`` as kubeconfig file's path.
You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable.
Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__
to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__
to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount
(we recommend to map your NFS server path in ``root_squash option``,
otherwise permission issue may raise when NNI copy files to NFS.
Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is),
or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment.
Run this command to install NFSv4 client:
.. code-block:: bash
apt install nfs-common
7. Install **NNI**:
.. code-block:: bash
python -m pip install nni
Prerequisite for Azure Kubernetes Service
-----------------------------------------
1. NNI support Kubeflow based on Azure Kubernetes Service,
follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
2. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**.
Use ``az login`` to set azure account, and connect kubectl client to AKS,
refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
3. Deploy Kubeflow on Azure Kubernetes Service, follow the `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__.
4. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__
to create azure file storage account.
If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account,
and NNI use `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key.
Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account.
Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Design
------
.. image:: ../../../img/kubeflow_training_design.png
:target: ../../../img/kubeflow_training_design.png
:alt:
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local ``trial_code_directory``
together with NNI generated files like parameter.cfg into a storage volumn.
Right now we support two kinds of storage volumes:
`nfs <https://en.wikipedia.org/wiki/Network_File_System>`__
and `azure file storage <https://azure.microsoft.com/en-us/services/storage/files/>`__,
you should configure the storage volumn in experiment config.
After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs
(`tf-operator <https://github.com/kubeflow/tf-operator>`__ job
or `pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ job)
in K8S, and mount your storage volume into the job's pod.
Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn.
NNI will show the storage volumn's URL for each trial in web portal, to allow user browse the log files and job's output files.
Supported operator
------------------
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators are not tested.
Users can set operator type in experiment config.
The setting of tf-operator:
.. code-block:: yaml
config.training_service.operator = 'tf-operator'
The setting of pytorch-operator:
.. code-block:: yaml
config.training_service.operator = 'pytorch-operator'
If users want to use tf-operator, he could set ``ps`` and ``worker`` in trial config.
If users want to use pytorch-operator, he could set ``master`` and ``worker`` in trial config.
Supported storage type
----------------------
NNI support NFS and Azure Storage to store the code and output files,
users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
.. code-block:: python
config.training_service.storage = K8sNfsConfig(
server = '10.20.30.40', # your NFS server IP
path = '/mnt/nfs/nni' # your NFS server export path
)
If you use Azure storage, you should set ``storage`` in your config as follows:
.. code-block:: python
config.training_service.storage = K8sAzureStorageConfig(
azure_account = your_azure_account_name,
azure_share = your_azure_share_name,
key_vault_name = your_vault_name,
key_vault_key = your_secret_name
)
Run an experiment
-----------------
Use :doc:`PyTorch quickstart </tutorials/hpo_quickstart_pytorch/main>` as an example.
This is a PyTorch job, and use pytorch-operator of Kubeflow.
The experiment config is like:
.. code-block:: python
from nni.experiment import Experiment, K8sNfsConfig, KubeflowRowConfig
experiment = Experiment('kubeflow')
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2
experiment.config.operator = 'pytorch-operator'
experiment.config.api_version = 'v1alpha2'
experiment.config.training_service.storage = K8sNfsConfig()
experiment.config.training_service.storage.server = '10.20.30.40'
experiment.config.training_service.storage.path = '/mnt/nfs/nni'
experiment.config.training_service.worker = KubeflowRowConfig()
experiment.config.training_service.worker.replicas = 2
experiment.config.training_service.worker.command = 'python3 model.py'
experiment.config.training_service.worker.gpuNumber = 1
experiment.config.training_service.worker.cpuNumber = 1
experiment.config.training_service.worker.memorySize = '4g'
experiment.config.training_service.worker.code_directory = '.'
experiment.config.training_service.master = KubeflowRowConfig()
experiment.config.training_service.master.replicas = 1
experiment.config.training_service.master.command = 'python3 model.py'
experiment.config.training_service.master.gpuNumber = 0
experiment.config.training_service.master.cpuNumber = 1
experiment.config.training_service.master.memorySize = '4g'
experiment.config.training_service.master.code_directory = '.'
experiment.config.training_service.worker.docker_image = 'msranni/nni:latest' # default
Once it's ready, run:
.. code-block:: python
experiment.run(8080)
NNI will create Kubeflow pytorchjob for each trial,
and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see the Kubeflow jobs created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI web portal's port plus 1.
For example, if your web portal port is ``8080``, the rest server will listen on ``8081``,
to receive metrics from trial job running in Kubernetes.
So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI web portal's overview page (like http://localhost:8080/oview)
to check trials' information.
Local Training Service
======================
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
.. note:: Currently, :ref:`reuse mode <training-service-reuse>` remains disabled by default in local training service.
Prerequisite
------------
You are recommended to go through quick start first, as this document page only explains the configuration of local training service, one part of the experiment configuration yaml file.
Usage
-----
.. code-block:: yaml
# the experiment config yaml file
...
trainingService:
platform: local
useActiveGpu: false # optional
...
There are other supported fields for local training service, such as ``maxTrialNumberPerGpu``, ``gpuIndices``, for concurrently running multiple trials on one GPU, and running trials on a subset of GPUs on your machine. Please refer to :ref:`reference-local-config-label` in reference for detailed usage.
.. note::
Users should set **useActiveGpu** to `true`, if the local machine has GPUs and your trial uses GPU, but generated trials keep waiting. This is usually the case when you are using graphical OS like Windows 10 and Ubuntu desktop.
Then we explain how local training service works with different configurations of ``trialGpuNumber`` and ``trialConcurrency``. Suppose user's local machine has 4 GPUs, with configuration ``trialGpuNumber: 1`` and ``trialConcurrency: 4``, there will be 4 trials run on this machine concurrently, each of which uses 1 GPU. If the configuration is ``trialGpuNumber: 2`` and ``trialConcurrency: 2``, there will be 2 trials run on this machine concurrently, each of which uses 2 GPUs. Which GPU is allocated to which trial is decided by local training service, users do not need to worry about it. An exmaple configuration below.
.. code-block:: yaml
...
trialGpuNumber: 1
trialConcurrency: 4
...
trainingService:
platform: local
useActiveGpu: false
A complete example configuration file can be found :githublink:`examples/trials/mnist-pytorch/config.yml`.
\ No newline at end of file
OpenPAI Training Service
========================
NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__. OpenPAI manages computing resources and is optimized for deep learning. Through docker technology, the computing hardware are decoupled with software, so that it's easy to run distributed jobs, switch with different deep learning frameworks, or run other kinds of jobs on consistent environments.
Prerequisite
------------
1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers.
2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
.. image:: ../../../img/pai_profile.jpg
:scale: 80%
Click ``copy`` button in the page to copy a jwt token.
.. image:: ../../../img/pai_token.jpg
:scale: 67%
3. Mount NFS storage to local machine. If you don't know where to find the NFS storage, please click ``Submit job`` button in web portal.
.. image:: ../../../img/pai_job_submission_page.jpg
:scale: 50%
Find the data management region in job submission page.
.. image:: ../../../img/pai_data_management_page.jpg
:scale: 33%
The ``Preview container paths`` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage to upload data/code to or download from OpenPAI cluster. To mount the storage, please use ``mount`` command, for example:
.. code-block:: bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
Then the ``/data`` folder in container will be mounted to ``/local/mnt`` folder in your local machine. Please keep in mind that ``localStorageMountPoint`` should be set to ``/local/mnt`` in this case.
4. Get OpenPAI's storage config name and ``containerStorageMountPoint``. They can also be found in data management region in job submission page. Please find the ``Name`` and ``Path`` of your ``Team share storage``. They should be put into ``storageConfigName`` and ``containerStorageMountPoint``. For example,
.. code-block:: yaml
storageConfigName: confignfs-data
containerStorageMountPoint: /mnt/confignfs-data
Usage
-----
We show an example configuration here with YAML (Python configuration should be similar).
.. code-block:: yaml
trialGpuNumber: 0
trialConcurrency: 1
...
trainingService:
platform: openpai
host: http://123.123.123.123
username: ${your user name}
token: ${your token}
dockerImage: msranni/nni
trialCpuNumber: 1
trialMemorySize: 8GB
storageConfigName: confignfs-data
localStorageMountPoint: /local/mnt
containerStorageMountPoint: /mnt/confignfs-data
Once completing the configuration and run nnictl / use Python to launch the experiment. NNI will start to spawn trials to your specified OpenPAI platform.
The job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``. You can see jobs created by NNI on the OpenPAI cluster's web portal, like:
.. image:: ../../../img/nni_pai_joblist.jpg
.. note:: For OpenPAI training service, NNI will start an additional rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``, the rest server will listen on ``8081``, to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like ``http://localhost:8080/oview``) to check trial's information. For example, you can expand a trial information in trial list view, click the logPath link like:
.. image:: ../../../img/nni_webui_joblist.png
:scale: 30%
Configuration References
------------------------
Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports the following additional configurations.
.. list-table::
:header-rows: 1
:widths: auto
* - Field name
- Description
* - username
- Required field. User name of OpenPAI platform.
* - token
- Required field. Authentication key of OpenPAI platform.
* - host
- Required field. The host of OpenPAI platform. It's PAI's job submission page URI, like ``10.10.5.1``. The default protocol in NNI is HTTPS. If your PAI's cluster has disabled https, please use the URI in ``http://10.10.5.1`` format.
* - trialCpuNumber
- Optional field. Should be positive number based on your trial program's CPU requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - trialMemorySize
- Optional field. Should be in format like ``2gb`` based on your trial program's memory requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - dockerImage
- Optional field. In OpenPAI training service, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run. Upon every NNI release, we build `a docker image <https://hub.docker.com/r/msranni/nni>`__ with `this Dockerfile <https://hub.docker.com/r/msranni/nni>`__. You can either use this image directly in your config file, or build your own image. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - virtualCluster
- Optional field. Set the virtualCluster of OpenPAI. If omitted, the job will run on ``default`` virtual cluster.
* - localStorageMountPoint
- Required field. Set the mount path in the machine you start the experiment.
* - containerStorageMountPoint
- Optional field. Set the mount path in your container used in OpenPAI.
* - storageConfigName
- Optional field. Set the storage name used in OpenPAI. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - openpaiConfigFile
- Optional field. Set the file path of OpenPAI job configuration, the file is in yaml format. If users set ``openpaiConfigFile`` in NNI's configuration file, there's no need to specify the fields ``storageConfigName``, ``virtualCluster``, ``dockerImage``, ``trialCpuNumber``, ``trialGpuNumber``, ``trialMemorySize`` in configuration. These fields will use the values from the config file specified by ``openpaiConfigFile``.
* - openpaiConfig
- Optional field. Similar to ``openpaiConfigFile``, but instead of referencing an external file, using this field you embed the content into NNI's config YAML.
.. note::
#. The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is ``nni_exp_{this.experimentId}_trial_{trialJobId}`` .
#. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taskRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
Data management
---------------
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage (NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by ``paiStorageConfigName`` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the ``nniManagerNFSMountPath`` field in configuration file, NNI will generate bash files and copy data in ``codeDir`` to the ``nniManagerNFSMountPath`` folder, then NNI will start a trial job. The data in ``nniManagerNFSMountPath`` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in ``containerNFSMountPath``, NNI will enter this folder first, and then run scripts to start a trial job.
Version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../../img/webui-img/experimentError.png
:scale: 80%
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
Overview
========
NNI has supported many training services listed below. Users can go through each page to learning how to configure the corresponding training service. NNI has high extensibility by design, users can customize new training service for their special resource, platform or needs.
.. list-table::
:header-rows: 1
* - Training Service
- Description
* - :doc:`Local <local>`
- The whole experiment runs on your dev machine (i.e., a single local machine)
* - :doc:`Remote <remote>`
- The trials are dispatched to your configured SSH servers
* - :doc:`OpenPAI <openpai>`
- Running trials on OpenPAI, a DNN model training platform based on Kubernetes
* - :doc:`Kubeflow <kubeflow>`
- Running trials with Kubeflow, a DNN model training framework based on Kubernetes
* - :doc:`AdaptDL <adaptdl>`
- Running trials on AdaptDL, an elastic DNN model training platform
* - :doc:`FrameworkController <frameworkcontroller>`
- Running trials with FrameworkController, a DNN model training framework on Kubernetes
* - :doc:`AML <aml>`
- Running trials on Azure Machine Learning (AML) cloud service
* - :doc:`PAI-DLC <paidlc>`
- Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK
* - :doc:`Hybrid <hybrid>`
- Support jointly using multiple above training services
.. _training-service-reuse:
Training Service Under Reuse Mode
---------------------------------
Since NNI v2.0, there are two sets of training service implementations in NNI. The new one is called *reuse mode*. When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.
.. note:: In the reuse mode, users need to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).
PAI-DLC Training Service
========================
NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ which is deep learning containers based on Alibaba ACK.
PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.
Prerequisite
------------
Step 1. Install NNI, follow the :doc:`install guide </installation>`.
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.
Step 3. Open PAI-DLC `here <https://pai-dlc.console.aliyun.com/#/guide>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)
Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to `this link <https://help.aliyun.com/document_detail/203290.html>`__. Skip this step if SDK is already installed.
.. code-block:: bash
wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.
Usage
-----
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
# working directory on DSW, please provie FULL path
experimentWorkingDirectory: /home/admin/workspace/{your_working_dir}
searchSpaceFile: search_space.json
# the command on trial runner(or, DLC container), be aware of data_dir
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit.
maxTrialNumber: 10
tuner:
name: TPE
classArgs:
optimize_mode: maximize
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
trainingService:
platform: dlc
type: Worker
image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
jobType: PyTorchJob # choices: [TFJob, PyTorchJob]
podCount: 1
ecsSpec: ecs.c6.large
region: cn-hangzhou
workspaceId: ${your_workspace_id}
accessKeyId: ${your_ak_id}
accessKeySecret: ${your_ak_key}
nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID, e.g., datat56by9n1xt0a
ossDataSourceId: ${your_oss_data_source_id} # OSS datasource ID, in case your data is on oss
localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW
containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting
Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode.
Compared with :doc:`local`, training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__.
Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``.
Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-pytorch
# modify config_dlc.yml ...
nnictl create --config config_dlc.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``.
Monitor your job
^^^^^^^^^^^^^^^^
To monitor your job on DLC, you need to visit `DLC <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status.
Remote Training Service
=======================
NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.
Prerequisite
------------
1. Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.
2. Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usage, please refer to :ref:`reference-remote-config-label` in reference for detailed usage.
3. Make sure the NNI version on each machine is consistent. Follow the install guide :doc:`here </installation>` to install NNI.
4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
In addition, there are several steps for Windows server.
1. Install and start ``OpenSSH Server``.
1) Open ``Settings`` app on Windows.
2) Click ``Apps``\ , then click ``Optional features``.
3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
4) Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
sc config sshd start=auto
net start sshd
2. Make sure remote account is administrator, so that it can stop running trials.
3. Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.
The output like below is ok, when opening a new command window.
.. code-block:: text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
Usage
-----
Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Here is a template configuration specification.
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 4
maxTrialNumber: 20
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: remote
machineList:
- host: 192.0.2.1
user: alice
ssh_key_file: ~/.ssh/id_rsa
- host: 192.0.2.2
port: 10022
user: bob
password: bob123
The example configuration is saved in ``examples/trials/mnist-pytorch/config_remote.yml``.
You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
.. code-block:: bash
nnictl create --config examples/trials/mnist-pytorch/config_remote.yml
.. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your trial code directory contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
More features
-------------
Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine.
For example, with anaconda you can specify:
.. code-block:: yaml
pythonPath: /home/bob/.conda/envs/ENV-NAME/bin
Configure shared storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support shared storage, which can help use your own storage during using NNI. Follow the guide :doc:`here <./shared_storage>` to learn how to use shared storage.
Monitor via TensorBoard
^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support trial visualization via TensorBoard. Follow the guide :doc:`/experiment/web_portal/tensorboard` to learn how to use TensorBoard.
How to Use Shared Storage
=========================
If you want to use your own storage during using NNI, shared storage can satisfy you.
Instead of using training service native storage, shared storage can bring you more convenience.
All the information generated by the experiment will be stored under ``/nni`` folder in your shared storage.
All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage.
This saves you from finding for experiment-related information in various places.
Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it.
And we will develop more practical features in the future based on shared storage. The config reference can be found :ref:`here <reference-sharedstorage-config-label>`.
.. note::
Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote.
And make sure your local machine can mount NFS or fuse AzureBlob and the machine used in training service has ``sudo`` permission without password. We only support shared storage under training service with reuse mode for now.
.. note::
What is the difference between training service native storage and shared storage? Training service native storage is usually provided by the specific training service.
E.g., the local storage on remote machine in remote mode, the provided storage in openpai mode. These storages might not easy to use, e.g., users have to upload datasets to all remote machines to train the model.
In these cases, shared storage can automatically mount to the machine in the training platform. Users can directly save and load data from the shared storage. All the data/log used/generated in one experiment can be placed under the same place.
After the experiment is finished, shared storage will automatically unmount from the training platform.
Example
-------
If you want to use AzureBlob, add below to your config. Full config file see :githublink:`mnist-sharedstorage/config_azureblob.yml <examples/trials/mnist-sharedstorage/config_azureblob.yml>`.
.. code-block:: yaml
sharedStorage:
storageType: AzureBlob
# please set localMountPoint as absolute path and localMountPoint should outside the code directory
# because nni will copy user code to localMountPoint
localMountPoint: ${your/local/mount/point}
# remoteMountPoint is the mount point on training service machine, it can be set as both absolute path and relative path
# make sure you have `sudo` permission without password on training service machine
remoteMountPoint: ${your/remote/mount/point}
storageAccountName: ${replace_to_your_storageAccountName}
storageAccountKey: ${replace_to_your_storageAccountKey}
containerName: ${replace_to_your_containerName}
# usermount means you have already mount this storage on localMountPoint
# nnimount means nni will try to mount this storage on localMountPoint
# nomount means storage will not mount in local machine, will support partial storages in the future
localMounted: nnimount
You can find ``storageAccountName``, ``storageAccountKey``, ``containerName`` on azure storage account portal.
.. image:: ../../../img/azure_storage.png
:target: ../../../img/azure_storage.png
:alt:
If you want to use NFS, add below to your config. Full config file see :githublink:`mnist-sharedstorage/config_nfs.yml <examples/trials/mnist-sharedstorage/config_nfs.yml>`.
.. code-block:: yaml
sharedStorage:
storageType: NFS
localMountPoint: ${your/local/mount/point}
remoteMountPoint: ${your/remote/mount/point}
nfsServer: ${nfs-server-ip}
exportedDirectory: ${nfs/exported/directory}
# usermount means you have already mount this storage on localMountPoint
# nnimount means nni will try to mount this storage on localMountPoint
# nomount means storage will not mount in local machine, will support partial storages in the future
localMounted: nnimount
Training Service
================
.. toctree::
:hidden:
Overview <overview>
Local <local>
Remote <remote>
OpenPAI <openpai>
Kubeflow <kubeflow>
AdaptDL <adaptdl>
FrameworkController <frameworkcontroller>
AML <aml>
PAI-DLC <paidlc>
Hybrid <hybrid>
Customize a Training Service <customize>
Shared Storage <shared_storage>
Visualize Trial with TensorBoard
================================
You can launch a tensorboard process cross one or multi trials within webportal since NNI v2.2. This feature supports local training service and reuse mode training service with shared storage for now, and will support more scenarios in later nni version.
Preparation
-----------
Make sure tensorboard installed in your environment. If you never used tensorboard, here are getting start tutorials for your reference, `tensorboard with tensorflow <https://www.tensorflow.org/tensorboard/get_started>`__, `tensorboard with pytorch <https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html>`__.
Use WebUI Launch Tensorboard
----------------------------
Save Logs
^^^^^^^^^
NNI will automatically fetch the ``tensorboard`` subfolder under trial's output folder as tensorboard logdir. So in trial's source code, you need to save the tensorboard logs under ``NNI_OUTPUT_DIR/tensorboard``. This log path can be joined as:
.. code-block:: python
log_dir = os.path.join(os.environ["NNI_OUTPUT_DIR"], 'tensorboard')
Launch Tensorboard
^^^^^^^^^^^^^^^^^^
* Like compare, select the trials you want to combine to launch the tensorboard at first, then click the ``Tensorboard`` button.
.. image:: ../../../img/Tensorboard_1.png
:target: ../../../img/Tensorboard_1.png
:alt:
* After click the ``OK`` button in the pop-up box, you will jump to the tensorboard portal.
.. image:: ../../../img/Tensorboard_2.png
:target: ../../../img/Tensorboard_2.png
:alt:
* You can see the ``SequenceID-TrialID`` on the tensorboard portal.
.. image:: ../../../img/Tensorboard_3.png
:target: ../../../img/Tensorboard_3.png
:alt:
Stop All
^^^^^^^^
If you want to open the portal you have already launched, click the tensorboard id. If you don't need the tensorboard anymore, click ``Stop all tensorboard`` button.
.. image:: ../../../img/Tensorboard_4.png
:target: ../../../img/Tensorboard_4.png
:alt:
Web Portal
==========
.. toctree::
:hidden:
Experiment Web Portal <web_portal>
Visualize with TensorBoard <tensorboard>
Web Portal
==========
Web portal is for users to conveniently visualize their NNI experiments, tuning and training progress, detailed metrics, and error logs. Web portal also allows users to control their NNI experiments, trials, such as updating an experiment of its concurrency, duration, rerunning trials.
.. image:: ../../../static/img/webui.gif
:width: 100%
Q&A
---
There are many trials in the detail table but ``Default Metric`` chart is empty
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
First you should know that ``Default metric`` and ``Hyper parameter`` chart only show succeeded trials.
What should you do when you think the chart is strange, such as ``Default metric``, ``Hyper parameter``...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Download the experiment results(``experiment config``, ``trial message`` and ``intermeidate metrics``) from ``Experiment summary`` and then upload these results in your issue.
.. image:: ../../../img/webui-img/summary.png
:width: 80%
What should you do when your experiment has error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Click the icon in the right of ``experiment status`` and screenshot the error message.
* And then click the ``learn more`` to download ``nni-manager`` and ``dispatcher`` logfile.
* Please file an issue from the `Feedback` in the `About` and upload above message.
.. image:: ../../../img/webui-img/experimentError.png
:width: 80%
What should you do when your trial fails
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* ``Customized trial`` could be used in here. Just submit the same parameters to the experiment to rerun the trial.
.. image:: ../../../img/webui-img/detail/customizedTrialButton.png
:width: 25%
.. image:: ../../../img/webui-img/detail/customizedTrial.png
:width: 40%
* ``Log model`` will help you find the error reason. There are three buttons ``View trial log``, ``View trial error`` and ``View trial stdout`` on local mode. If you run on the OpenPAI or Kubeflow platform, you could see trial stdout and nfs log.
If you have any question you could tell us in the issue.
**local mode:**
.. image:: ../../../img/webui-img/detail/log-local.png
:width: 100%
**OpenPAI, Kubeflow and other mode:**
.. image:: ../../../img/webui-img/detail-pai.png
:width: 100%
How to use dict intermediate result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`The discussion <https://github.com/microsoft/nni/discussions/4289>`_ could help you.
.. _exp-manage-webportal:
Experiments management
----------------------
Experiments management page could manage many experiments on your machine.
.. image:: ../../../img/webui-img/managerExperimentList/experimentListNav.png
:width: 100%
* On the ``All experiments`` page, you can see all the experiments on your machine.
.. image:: ../../../img/webui-img/managerExperimentList/expList.png
:width: 100%
* When you want to see more details about an experiment you could click the trial id, look that:
.. image:: ../../../img/webui-img/managerExperimentList/toAnotherExp.png
:width: 100%
* If has many experiments on the table, you can use the ``filter`` button.
.. image:: ../../../img/webui-img/managerExperimentList/expFilter.png
:width: 100%
Experiment details
------------------
View overview page
^^^^^^^^^^^^^^^^^^
* On the overview tab, you can see the experiment information and status and the performance of ``top trials``.
.. image:: ../../../img/webui-img/full-oview.png
:width: 100%
* If you want to see experiment search space and config, please click the right button ``Search space`` and ``Config`` (when you hover on this button).
**Search space file:**
.. image:: ../../../img/webui-img/searchSpace.png
:width: 80%
**Config file:**
.. image:: ../../../img/webui-img/config.png
:width: 80%
* You can view and download ``nni-manager/dispatcher log files`` on here.
.. image:: ../../../img/webui-img/review-log.png
:width: 80%
* If your experiment has many trials, you can change the refresh interval here.
.. image:: ../../../img/webui-img/refresh-interval.png
:width: 100%
* You can change some experiment configurations such as ``maxExecDuration``, ``maxTrialNum`` and ``trial concurrency`` on here.
.. image:: ../../../img/webui-img/edit-experiment-param.png
:width: 80%
View job default metric
^^^^^^^^^^^^^^^^^^^^^^^
* Click the tab ``Default metric`` to see the point chart of all trials. Hover to see its specific default metric and search space message.
.. image:: ../../../img/webui-img/default-metric.png
:width: 100%
* Turn on the switch named ``Optimization curve`` to see the experiment's optimization curve.
.. image:: ../../../img/webui-img/best-curve.png
:width: 100%
View hyper parameter
^^^^^^^^^^^^^^^^^^^^
Click the tab ``Hyper-parameter`` to see the parallel chart.
* You can click the ``add/remove`` button to add or remove axes.
* Drag the axes to swap axes on the chart.
* You can select the percentage to see top trials.
.. image:: ../../../img/webui-img/hyperPara.png
:width: 100%
View Trial Duration
^^^^^^^^^^^^^^^^^^^
Click the tab ``Trial Duration`` to see the bar chart.
.. image:: ../../../img/webui-img/trial_duration.png
:width: 100%
View Trial Intermediate Result chart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Click the tab ``Intermediate Result`` to see the line chart.
.. image:: ../../../img/webui-img/trials_intermeidate.png
:width: 100%
The trial may have many intermediate results in the training process. In order to see the trend of some trials more clearly, we set a filtering function for the intermediate result chart.
You may find that these trials will get better or worse at an intermediate result. This indicates that it is an important and relevant intermediate result. To take a closer look at the point here, you need to enter its corresponding X-value at #Intermediate. Then input the range of metrics on this intermedia result. In the picture below, we choose the No. 4 intermediate result and set the range of metrics to 0.8-1.
.. image:: ../../../img/webui-img/filter-intermediate.png
:width: 100%
View trials status
^^^^^^^^^^^^^^^^^^
Click the tab ``Trials Detail`` to see the status of all trials. Specifically:
* Trial detail: trial's id, trial's duration, start time, end time, status, accuracy, and search space file.
.. image:: ../../../img/webui-img/detail-local.png
:width: 100%
* Support searching for a specific trial by its id, status, Trial No. and trial parameters.
**Trial id:**
.. image:: ../../../img/webui-img/detail/searchId.png
:width: 80%
**Trial No.:**
.. image:: ../../../img/webui-img/detail/searchNo.png
:width: 80%
**Trial status:**
.. image:: ../../../img/webui-img/detail/searchStatus.png
:width: 80%
**Trial parameters:**
``parameters whose type is choice:``
.. image:: ../../../img/webui-img/detail/searchParameterChoice.png
:width: 80%
``parameters whose type is not choice:``
.. image:: ../../../img/webui-img/detail/searchParameterRange.png
:width: 80%
* The button named ``Add column`` can select which column to show on the table. If you run an experiment whose final result is a dict, you can see other keys in the table. You can choose the column ``Intermediate count`` to watch the trial's progress.
.. image:: ../../../img/webui-img/addColumn.png
:width: 40%
* If you want to compare some trials, you can select them and then click ``Compare`` to see the results.
.. image:: ../../../img/webui-img/select-trial.png
:width: 100%
.. image:: ../../../img/webui-img/compare.png
:width: 80%
* You can use the button named ``Copy as python`` to copy the trial's parameters.
.. image:: ../../../img/webui-img/copyParameter.png
:width: 100%
* Intermediate Result chart: you can see the default metric in this chart by clicking the intermediate button.
.. image:: ../../../img/webui-img/intermediate.png
:width: 100%
* Kill: you can kill a job that status is running.
.. image:: ../../../img/webui-img/kill-running.png
:width: 100%
.. a6a9f0292afa81c7796304ae7da5afcd
Web 界面
========
Web portal 为用户提供了便捷的可视化页面,用户可以在上面观察 NNI 实验训练过程、详细的 metrics 以及实验的 log 和 error。
当然,用户可以管理实验,调制 trials 比如修改实验的 concurrency 值,时长以及重跑一些 trials。
.. image:: ../../../static/img/webui.gif
:width: 100%
Q&A
---
在 detail 页面的表格里明明有很多 trial 但是 Default Metric 图是空的没有数据
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
首先你要明白 ``Default metric`` 和 ``Hyper parameter`` 图只展示成功 trial。
当你觉得 ``Default metric``、``Hyper parameter`` 图有问题的时候应该做什么
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 从 Experiment summary 下载实验结果(实验配置,trial 信息,中间值),并把这些结果上传进 issue 里。
.. image:: ../../../img/webui-img/summary.png
:width: 80%
当你的实验有故障时应该做什么
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 点击实验状态右边的小图标把 error 信息截屏。
* 然后点击 learn about 去下载 log 文件。And then click the ``learn about`` to download ``nni-manager`` and ``dispatcher`` logfile.
* 点击页面导航栏的 About 按钮点 Feedback 开一个 issue,附带上以上的截屏和 log 信息。
.. image:: ../../../img/webui-img/experimentError.png
:width: 80%
当你的 trial 跑失败了你应该怎么做
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 使用 Customized trial 功能。向实验提交相同的 trial 参数即可。
.. image:: ../../../img/webui-img/detail/customizedTrialButton.png
:width: 25%
.. image:: ../../../img/webui-img/detail/customizedTrial.png
:width: 40%
* ``Log 模块`` 能帮助你找到错误原因。 有三个按钮: ``View trial log``, ``View trial error`` 和 ``View trial stdout`` 可查 log。如果你用 OpenPai 或者 Kubeflow,你能看到 trial stdout 和 nfs log。
有任何问题请在 issue 里联系我们。
**local mode:**
.. image:: ../../../img/webui-img/detail/log-local.png
:width: 100%
**OpenPAI, Kubeflow and other mode:**
.. image:: ../../../img/webui-img/detail-pai.png
:width: 100%
怎样去使用 dict intermediate result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`The discussion <https://github.com/microsoft/nni/discussions/4289>`_ 能帮助你。
.. _exp-manage-webportal:
实验管理
--------
实验管理页面能统筹你机器上的所有实验。
.. image:: ../../../img/webui-img/managerExperimentList/experimentListNav.png
:width: 100%
* 在 ``All experiments`` 页面,可以看到机器上的所有 Experiment。
.. image:: ../../../img/webui-img/managerExperimentList/expList.png
:width: 100%
* 查看 Experiment 更多详细信息时,可以单击 trial ID 跳转至该 Experiment 详情页,如下所示:
.. image:: ../../../img/webui-img/managerExperimentList/toAnotherExp.png
:width: 100%
* 如果表格里有很多 Experiment,可以使用 ``filter`` 按钮。
.. image:: ../../../img/webui-img/managerExperimentList/expFilter.png
:width: 100%
实验详情
--------
查看实验 overview 页面
^^^^^^^^^^^^^^^^^^^^^^^
* 在 Overview 标签上,可看到 Experiment trial 的概况、搜索空间以及 ``top trials`` 的结果。
.. image:: ../../../img/webui-img/full-oview.png
:width: 100%
* 如果想查看 Experiment 配置和搜索空间,点击右边的 ``Search space`` 和 ``Config`` 按钮。
**搜索空间文件:**
.. image:: ../../../img/webui-img/searchSpace.png
:width: 80%
**配置文件:**
.. image:: ../../../img/webui-img/config.png
:width: 80%
* 你可以在这里查看和下载 ``nni-manager/dispatcher 日志文件``。
.. image:: ../../../img/webui-img/review-log.png
:width: 80%
* 如果 Experiment 包含了较多 Trial,可改变刷新间隔。
.. image:: ../../../img/webui-img/refresh-interval.png
:width: 100%
* 在这里修改 Experiment 配置(例如 ``maxExecDuration``, ``maxTrialNum`` 和 ``trial concurrency``)。
.. image:: ../../../img/webui-img/edit-experiment-param.png
:width: 80%
查看 trial 最终结果
^^^^^^^^^^^^^^^^^^^^^
* ``Default metric`` 是所有 trial 的最终结果图。 在每一个结果上悬停鼠标可以看到 trial 信息,比如 trial id、No. 超参等。
.. image:: ../../../img/webui-img/default-metric.png
:width: 100%
* 打开 ``Optimization curve`` 来查看 Experiment 的优化曲线。
.. image:: ../../../img/webui-img/best-curve.png
:width: 100%
查看超参
^^^^^^^^^^
单击 ``Hyper-parameter`` 标签查看平行坐标系图。
* 可以点击 ``添加/删除`` 按钮来添加或删减纵坐标轴。
* 直接在图上拖动轴线来交换轴线位置。
* 通过调节百分比来查看 top trial。
.. image:: ../../../img/webui-img/hyperPara.png
:width: 100%
查看 Trial 运行时间
^^^^^^^^^^^^^^^^^^^^^^
点击 ``Trial Duration`` 标签来查看柱状图。
.. image:: ../../../img/webui-img/trial_duration.png
:width: 100%
查看 Trial 中间结果
^^^^^^^^^^^^^^^^^^^^^^
单击 ``Intermediate Result`` 标签查看折线图。
.. image:: ../../../img/webui-img/trials_intermeidate.png
:width: 100%
Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解一些 Trial 的趋势,可以为中间结果图设置过滤功能。
这样可以发现 Trial 在某个中间结果上会变得更好或更差。 这表明它是一个重要的并相关的中间结果。 如果要仔细查看这个点,可以在 #Intermediate 中输入其 X 坐标。 并输入这个中间结果的指标范围。 在下图中,选择了第四个中间结果并将指标范围设置为了 0.8 -1。
.. image:: ../../../img/webui-img/filter-intermediate.png
:width: 100%
查看 Trial 状态
^^^^^^^^^^^^^^^^^^
点击 ``Trials Detail`` 标签查看所有 Trial 的状态。具体如下:
* Trial 详情:Trial id,持续时间,开始时间,结束时间,状态,精度和 search space 文件。
.. image:: ../../../img/webui-img/detail-local.png
:width: 100%
* * 支持通过 id,状态,Trial 编号以及参数来搜索。
**Trial id:**
.. image:: ../../../img/webui-img/detail/searchId.png
:width: 80%
**Trial No.:**
.. image:: ../../../img/webui-img/detail/searchNo.png
:width: 80%
**Trial status:**
.. image:: ../../../img/webui-img/detail/searchStatus.png
:width: 80%
**Trial parameters:**
``类型为 choice 的参数:``
.. image:: ../../../img/webui-img/detail/searchParameterChoice.png
:width: 80%
``类型不是 choice 的参数:``
.. image:: ../../../img/webui-img/detail/searchParameterRange.png
:width: 80%
* ``Add column`` 按钮可选择在表格中显示的列。 如果 Experiment 的最终结果是 dict,则可以在表格中查看其它键。可选择 ``Intermediate count`` 列来查看 Trial 进度。
.. image:: ../../../img/webui-img/addColumn.png
:width: 40%
* 如果要比较某些 Trial,可选择并点击 ``Compare`` 来查看结果。
.. image:: ../../../img/webui-img/select-trial.png
:width: 100%
.. image:: ../../../img/webui-img/compare.png
:width: 80%
* 可使用 ``Copy as python`` 按钮来拷贝 Trial 的参数。
.. image:: ../../../img/webui-img/copyParameter.png
:width: 100%
* 中间结果图:可在此图中通过点击 intermediate 按钮来查看默认指标。
.. image:: ../../../img/webui-img/intermediate.png
:width: 100%
* Kill: 可终止正在运行的 trial。
.. image:: ../../../img/webui-img/kill-running.png
:width: 100%
GBDTSelector
------------
GBDTSelector is based on `LightGBM <https://github.com/microsoft/LightGBM>`__\ , which is a gradient boosting framework that uses tree-based learning algorithms.
When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
For now, we support the ``importance_type`` is ``split`` and ``gain``. But we will support customized ``importance_type`` in the future, which means the user could define how to calculate the ``feature score`` by themselves.
Usage
^^^^^
First you need to install dependency:
.. code-block:: bash
pip install lightgbm
Then
.. code-block:: python
from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = GBDTSelector()
# fit data
fgs.fit(X_train, y_train, ...)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(10))
...
And you could reference the examples in ``/examples/feature_engineering/gbdt_selector/``\ , too.
**Requirement of fit FuncArgs**
*
**X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
*
**y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
*
**lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
*
**eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
*
**early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__.
*
**importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance>`__.
*
**num_boost_round** (int, require) - number of boost round. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train>`__.
**Requirement of get_selected_features FuncArgs**
* **topk** (int, require) - the topK impotance features you want to selected.
GradientFeatureSelector
-----------------------
The algorithm in GradientFeatureSelector comes from `Feature Gradients: Scalable Feature Selection via Discrete Relaxation <https://arxiv.org/pdf/1908.10382.pdf>`__.
GradientFeatureSelector, a gradient-based search algorithm
for feature selection.
1) This approach extends a recent result on the estimation of
learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N.
2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
Usage
^^^^^
.. code-block:: python
from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = FeatureGradientSelector(n_features=10)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features())
...
And you could reference the examples in ``/examples/feature_engineering/gradient_feature_selector/``\ , too.
**Parameters of class FeatureGradientSelector constructor**
*
**order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
*
**penalty** (int, optional, default = 1) - Constant that multiplies the regularization term.
*
**n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
*
**max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
*
**learning_rate** (float, optional, default = 1e-1) - learning rate
*
**init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*\ ) - How to initialize the vector of scores. 'zero' is the default.
*
**n_epochs** (int, optional, default = 1) - number of epochs to run
*
**shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
*
**batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
*
**target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
*
**classification** (bool, optional, default = True) - If True, problem is classification, else regression.
*
**ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
*
**balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
*
**prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
*
**soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
*
**verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
*
**device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
**Requirement of fit FuncArgs**
*
**X** (array-like, require) - The training input samples which shape = [n_samples, n_features]. `np.ndarry` recommended.
*
**y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples]. `np.ndarry` recommended.
*
**groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
**Requirement of get_selected_features FuncArgs**
For now, the ``get_selected_features`` function has no parameters.
Feature Engineering with NNI
============================
.. note::
We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.
For now, we support the following feature selector:
* :doc:`GradientFeatureSelector <./gradient_feature_selector>`
* :doc:`GBDTSelector <./gbdt_selector>`
These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).
In addition, those selector only for feature selection. If you want to:
1) generate high-order combined features on nni while doing feature selection;
2) leverage your distributed resources;
you could try this :githublink:`example <examples/feature_engineering/auto-feature-engineering>`.
How to use?
-----------
.. code-block:: python
from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
# from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = FeatureGradientSelector(...)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(...))
...
When using the built-in Selector, you first need to ``import`` a feature selector, and ``initialize`` it. You could call the function ``fit`` in the selector to pass the data to the selector. After that, you could use ``get_seleteced_features`` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it.
How to customize?
-----------------
NNI provides *state-of-the-art* feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.
If you want to implement a customized feature selector, you need to:
#. Inherit the base FeatureSelector class
#. Implement *fit* and _get_selected *features* function
#. Integrate with sklearn (Optional)
Here is an example:
**1. Inherit the base Featureselector Class**
.. code-block:: python
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector):
def __init__(self, *args, **kwargs):
...
**2. Implement fit and _get_selected features Function**
.. code-block:: python
from nni.tuner import Tuner
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector):
def __init__(self, *args, **kwargs):
...
def fit(self, X, y, **kwargs):
"""
Fit the training data to FeatureSelector
Parameters
------------
X : array-like numpy matrix
The training input samples, which shape is [n_samples, n_features].
y: array-like numpy matrix
The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
"""
self.X = X
self.y = y
...
def get_selected_features(self):
"""
Get important feature
Returns
-------
list :
Return the index of the important feature.
"""
...
return self.selected_features_
...
**3. Integrate with Sklearn**
``sklearn.pipeline.Pipeline`` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow.
The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a module of the pipeline.
#. Inherit the calss *sklearn.base.BaseEstimator*
#. Implement _get\ *params* and _set*params* function in *BaseEstimator*
#. Inherit the class _sklearn.feature\ *selection.base.SelectorMixin*
#. Implement _get\ *support*\ , *transform* and _inverse*transform* Function in *SelectorMixin*
Here is an example:
**1. Inherit the BaseEstimator Class and its Function**
.. code-block:: python
from sklearn.base import BaseEstimator
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector, BaseEstimator):
def __init__(self, *args, **kwargs):
...
def get_params(self, *args, **kwargs):
"""
Get parameters for this estimator.
"""
params = self.__dict__
params = {key: val for (key, val) in params.items() if not key.endswith('_')}
return params
def set_params(self, **params):
"""
Set the parameters of this estimator.
"""
for param in params:
if hasattr(self, param):
setattr(self, param, params[param])
return self
**2. Inherit the SelectorMixin Class and its Function**
.. code-block:: python
from sklearn.base import BaseEstimator
from sklearn.feature_selection.base import SelectorMixin
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector, BaseEstimator, SelectorMixin):
def __init__(self, *args, **kwargs):
...
def get_params(self, *args, **kwargs):
"""
Get parameters for this estimator.
"""
params = self.__dict__
params = {key: val for (key, val) in params.items()
if not key.endswith('_')}
return params
def set_params(self, **params):
"""
Set the parameters of this estimator.
"""
for param in params:
if hasattr(self, param):
setattr(self, param, params[param])
return self
def get_support(self, indices=False):
"""
Get a mask, or integer index, of the features selected.
Parameters
----------
indices : bool
Default False. If True, the return value will be an array of integers, rather than a boolean mask.
Returns
-------
list :
returns support: An index that selects the retained features from a feature vector.
If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
If indices are True, this is an integer array of shape [# output features] whose values
are indices into the input feature vector.
"""
...
return mask
def transform(self, X):
"""Reduce X to the selected features.
Parameters
----------
X : array
which shape is [n_samples, n_features]
Returns
-------
X_r : array
which shape is [n_samples, n_selected_features]
The input samples with only the selected features.
"""
...
return X_r
def inverse_transform(self, X):
"""
Reverse the transformation operation
Parameters
----------
X : array
shape is [n_samples, n_selected_features]
Returns
-------
X_r : array
shape is [n_samples, n_original_features]
"""
...
return X_r
After integrating with Sklearn, we could use the feature selector as follows:
.. code-block:: python
from sklearn.linear_model import LogisticRegression
# load data
...
X_train, y_train = ...
# build a ppipeline
pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
pipeline.fit(X_train, y_train)
# score
print("Pipeline Score: ", pipeline.score(X_train, y_train))
Benchmark
---------
``Baseline`` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
.. list-table::
:header-rows: 1
:widths: auto
* - Dataset
- All Features + LR (acc, time, memory)
- GradientFeatureSelector + LR (acc, time, memory)
- TreeBasedClassifier + LR (acc, time, memory)
- #Train
- #Feature
* - colon-cancer
- 0.7547, 890ms, 348MiB
- 0.7368, 363ms, 286MiB
- 0.7223, 171ms, 1171 MiB
- 62
- 2,000
* - gisette
- 0.9725, 215ms, 584MiB
- 0.89416, 446ms, 397MiB
- 0.9792, 911ms, 234MiB
- 6,000
- 5,000
* - avazu
- 0.8834, N/A, N/A
- N/A, N/A, N/A
- N/A, N/A, N/A
- 40,428,967
- 1,000,000
* - rcv1
- 0.9644, 557ms, 241MiB
- 0.7333, 401ms, 281MiB
- 0.9615, 752ms, 284MiB
- 20,242
- 47,236
* - news20.binary
- 0.9208, 707ms, 361MiB
- 0.6870, 565ms, 371MiB
- 0.9070, 904ms, 364MiB
- 19,996
- 1,355,191
* - real-sim
- 0.9681, 433ms, 274MiB
- 0.7969, 251ms, 274MiB
- 0.9591, 643ms, 367MiB
- 72,309
- 20,958
The dataset of benchmark could be download in `here <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/>`__
The code could be refenrence ``/examples/feature_engineering/gradient_feature_selector/benchmark_test.py``.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment