Commit 9f73153f authored by zhanggzh's avatar zhanggzh
Browse files

add dtk24.04 code

parent eb77376e
Hybrid Training Service
=======================
Hybrid training service is for aggregating different types of computation resources into a virtually unified resource pool, in which trial jobs are dispatched. Hybrid training service is for collecting user's all available computation resources to jointly work on an AutoML task, it is flexibile enough to switch among different types of computation resources. For example, NNI could submit trial jobs to multiple remote machines and AML simultaneously.
Prerequisite
------------
NNI has supported :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./aml`, :doc:`./kubeflow`, :doc:`./frameworkcontroller`, for hybrid training service. Before starting an experiment using using hybrid training service, users should first setup their chosen (sub) training services (e.g., remote training service) according to each training service's own document page.
.. note:: Reuse mode is disabled by default for local training service. But if you are using local training service in hybrid, :ref:`reuse mode <training-service-reuse>` is enabled by default.
Usage
-----
Unlike other training services (e.g., ``platform: remote`` in remote training service), there is no dedicated keyword for hybrid training service, users can simply list the configurations of their chosen training services under the ``trainingService`` field. Below is an example of a hybrid training service containing remote training service and local training service in experiment configuration yaml.
.. code-block:: yaml
# the experiment config yaml file
...
trainingService:
- platform: remote
machineList:
- host: 127.0.0.1 # your machine's IP address
user: bob
password: bob
- platform: local
...
A complete example configuration file can be found in :githublink:`examples/trials/mnist-pytorch/config_hybrid.yml`.
\ No newline at end of file
Kubeflow Training Service
=========================
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__, called kubeflow mode.
Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster,
either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__,
a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__
is setup to connect to your Kubernetes cluster.
If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start.
In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
1. A **Kubernetes** cluster using Kubernetes 1.8 or later.
Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes.
2. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster.
Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server.
By default, NNI manager will use ``~/.kube/config`` as kubeconfig file's path.
You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable.
Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__
to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__
to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount
(we recommend to map your NFS server path in ``root_squash option``,
otherwise permission issue may raise when NNI copy files to NFS.
Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is),
or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment.
Run this command to install NFSv4 client:
.. code-block:: bash
apt install nfs-common
7. Install **NNI**:
.. code-block:: bash
python -m pip install nni
Prerequisite for Azure Kubernetes Service
-----------------------------------------
1. NNI support Kubeflow based on Azure Kubernetes Service,
follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
2. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**.
Use ``az login`` to set azure account, and connect kubectl client to AKS,
refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
3. Deploy Kubeflow on Azure Kubernetes Service, follow the `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__.
4. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__
to create azure file storage account.
If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account,
and NNI use `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key.
Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account.
Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Design
------
.. image:: ../../../img/kubeflow_training_design.png
:target: ../../../img/kubeflow_training_design.png
:alt:
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local ``trial_code_directory``
together with NNI generated files like parameter.cfg into a storage volumn.
Right now we support two kinds of storage volumes:
`nfs <https://en.wikipedia.org/wiki/Network_File_System>`__
and `azure file storage <https://azure.microsoft.com/en-us/services/storage/files/>`__,
you should configure the storage volumn in experiment config.
After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs
(`tf-operator <https://github.com/kubeflow/tf-operator>`__ job
or `pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ job)
in K8S, and mount your storage volume into the job's pod.
Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn.
NNI will show the storage volumn's URL for each trial in web portal, to allow user browse the log files and job's output files.
Supported operator
------------------
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators are not tested.
Users can set operator type in experiment config.
The setting of tf-operator:
.. code-block:: yaml
config.training_service.operator = 'tf-operator'
The setting of pytorch-operator:
.. code-block:: yaml
config.training_service.operator = 'pytorch-operator'
If users want to use tf-operator, he could set ``ps`` and ``worker`` in trial config.
If users want to use pytorch-operator, he could set ``master`` and ``worker`` in trial config.
Supported storage type
----------------------
NNI support NFS and Azure Storage to store the code and output files,
users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
.. code-block:: python
config.training_service.storage = K8sNfsConfig(
server = '10.20.30.40', # your NFS server IP
path = '/mnt/nfs/nni' # your NFS server export path
)
If you use Azure storage, you should set ``storage`` in your config as follows:
.. code-block:: python
config.training_service.storage = K8sAzureStorageConfig(
azure_account = your_azure_account_name,
azure_share = your_azure_share_name,
key_vault_name = your_vault_name,
key_vault_key = your_secret_name
)
Run an experiment
-----------------
Use :doc:`PyTorch quickstart </tutorials/hpo_quickstart_pytorch/main>` as an example.
This is a PyTorch job, and use pytorch-operator of Kubeflow.
The experiment config is like:
.. code-block:: python
from nni.experiment import Experiment, K8sNfsConfig, KubeflowRowConfig
experiment = Experiment('kubeflow')
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2
experiment.config.operator = 'pytorch-operator'
experiment.config.api_version = 'v1alpha2'
experiment.config.training_service.storage = K8sNfsConfig()
experiment.config.training_service.storage.server = '10.20.30.40'
experiment.config.training_service.storage.path = '/mnt/nfs/nni'
experiment.config.training_service.worker = KubeflowRowConfig()
experiment.config.training_service.worker.replicas = 2
experiment.config.training_service.worker.command = 'python3 model.py'
experiment.config.training_service.worker.gpuNumber = 1
experiment.config.training_service.worker.cpuNumber = 1
experiment.config.training_service.worker.memorySize = '4g'
experiment.config.training_service.worker.code_directory = '.'
experiment.config.training_service.master = KubeflowRowConfig()
experiment.config.training_service.master.replicas = 1
experiment.config.training_service.master.command = 'python3 model.py'
experiment.config.training_service.master.gpuNumber = 0
experiment.config.training_service.master.cpuNumber = 1
experiment.config.training_service.master.memorySize = '4g'
experiment.config.training_service.master.code_directory = '.'
experiment.config.training_service.worker.docker_image = 'msranni/nni:latest' # default
Once it's ready, run:
.. code-block:: python
experiment.run(8080)
NNI will create Kubeflow pytorchjob for each trial,
and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see the Kubeflow jobs created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI web portal's port plus 1.
For example, if your web portal port is ``8080``, the rest server will listen on ``8081``,
to receive metrics from trial job running in Kubernetes.
So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI web portal's overview page (like http://localhost:8080/oview)
to check trials' information.
Local Training Service
======================
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
.. note:: Currently, :ref:`reuse mode <training-service-reuse>` remains disabled by default in local training service.
Prerequisite
------------
You are recommended to go through quick start first, as this document page only explains the configuration of local training service, one part of the experiment configuration yaml file.
Usage
-----
.. code-block:: yaml
# the experiment config yaml file
...
trainingService:
platform: local
useActiveGpu: false # optional
...
There are other supported fields for local training service, such as ``maxTrialNumberPerGpu``, ``gpuIndices``, for concurrently running multiple trials on one GPU, and running trials on a subset of GPUs on your machine. Please refer to :ref:`reference-local-config-label` in reference for detailed usage.
.. note::
Users should set **useActiveGpu** to `true`, if the local machine has GPUs and your trial uses GPU, but generated trials keep waiting. This is usually the case when you are using graphical OS like Windows 10 and Ubuntu desktop.
Then we explain how local training service works with different configurations of ``trialGpuNumber`` and ``trialConcurrency``. Suppose user's local machine has 4 GPUs, with configuration ``trialGpuNumber: 1`` and ``trialConcurrency: 4``, there will be 4 trials run on this machine concurrently, each of which uses 1 GPU. If the configuration is ``trialGpuNumber: 2`` and ``trialConcurrency: 2``, there will be 2 trials run on this machine concurrently, each of which uses 2 GPUs. Which GPU is allocated to which trial is decided by local training service, users do not need to worry about it. An exmaple configuration below.
.. code-block:: yaml
...
trialGpuNumber: 1
trialConcurrency: 4
...
trainingService:
platform: local
useActiveGpu: false
A complete example configuration file can be found :githublink:`examples/trials/mnist-pytorch/config.yml`.
\ No newline at end of file
OpenPAI Training Service
========================
NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__. OpenPAI manages computing resources and is optimized for deep learning. Through docker technology, the computing hardware are decoupled with software, so that it's easy to run distributed jobs, switch with different deep learning frameworks, or run other kinds of jobs on consistent environments.
Prerequisite
------------
1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers.
2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
.. image:: ../../../img/pai_profile.jpg
:scale: 80%
Click ``copy`` button in the page to copy a jwt token.
.. image:: ../../../img/pai_token.jpg
:scale: 67%
3. Mount NFS storage to local machine. If you don't know where to find the NFS storage, please click ``Submit job`` button in web portal.
.. image:: ../../../img/pai_job_submission_page.jpg
:scale: 50%
Find the data management region in job submission page.
.. image:: ../../../img/pai_data_management_page.jpg
:scale: 33%
The ``Preview container paths`` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage to upload data/code to or download from OpenPAI cluster. To mount the storage, please use ``mount`` command, for example:
.. code-block:: bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
Then the ``/data`` folder in container will be mounted to ``/local/mnt`` folder in your local machine. Please keep in mind that ``localStorageMountPoint`` should be set to ``/local/mnt`` in this case.
4. Get OpenPAI's storage config name and ``containerStorageMountPoint``. They can also be found in data management region in job submission page. Please find the ``Name`` and ``Path`` of your ``Team share storage``. They should be put into ``storageConfigName`` and ``containerStorageMountPoint``. For example,
.. code-block:: yaml
storageConfigName: confignfs-data
containerStorageMountPoint: /mnt/confignfs-data
Usage
-----
We show an example configuration here with YAML (Python configuration should be similar).
.. code-block:: yaml
trialGpuNumber: 0
trialConcurrency: 1
...
trainingService:
platform: openpai
host: http://123.123.123.123
username: ${your user name}
token: ${your token}
dockerImage: msranni/nni
trialCpuNumber: 1
trialMemorySize: 8GB
storageConfigName: confignfs-data
localStorageMountPoint: /local/mnt
containerStorageMountPoint: /mnt/confignfs-data
Once completing the configuration and run nnictl / use Python to launch the experiment. NNI will start to spawn trials to your specified OpenPAI platform.
The job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``. You can see jobs created by NNI on the OpenPAI cluster's web portal, like:
.. image:: ../../../img/nni_pai_joblist.jpg
.. note:: For OpenPAI training service, NNI will start an additional rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``, the rest server will listen on ``8081``, to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like ``http://localhost:8080/oview``) to check trial's information. For example, you can expand a trial information in trial list view, click the logPath link like:
.. image:: ../../../img/nni_webui_joblist.png
:scale: 30%
Configuration References
------------------------
Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports the following additional configurations.
.. list-table::
:header-rows: 1
:widths: auto
* - Field name
- Description
* - username
- Required field. User name of OpenPAI platform.
* - token
- Required field. Authentication key of OpenPAI platform.
* - host
- Required field. The host of OpenPAI platform. It's PAI's job submission page URI, like ``10.10.5.1``. The default protocol in NNI is HTTPS. If your PAI's cluster has disabled https, please use the URI in ``http://10.10.5.1`` format.
* - trialCpuNumber
- Optional field. Should be positive number based on your trial program's CPU requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - trialMemorySize
- Optional field. Should be in format like ``2gb`` based on your trial program's memory requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - dockerImage
- Optional field. In OpenPAI training service, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run. Upon every NNI release, we build `a docker image <https://hub.docker.com/r/msranni/nni>`__ with `this Dockerfile <https://hub.docker.com/r/msranni/nni>`__. You can either use this image directly in your config file, or build your own image. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - virtualCluster
- Optional field. Set the virtualCluster of OpenPAI. If omitted, the job will run on ``default`` virtual cluster.
* - localStorageMountPoint
- Required field. Set the mount path in the machine you start the experiment.
* - containerStorageMountPoint
- Optional field. Set the mount path in your container used in OpenPAI.
* - storageConfigName
- Optional field. Set the storage name used in OpenPAI. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - openpaiConfigFile
- Optional field. Set the file path of OpenPAI job configuration, the file is in yaml format. If users set ``openpaiConfigFile`` in NNI's configuration file, there's no need to specify the fields ``storageConfigName``, ``virtualCluster``, ``dockerImage``, ``trialCpuNumber``, ``trialGpuNumber``, ``trialMemorySize`` in configuration. These fields will use the values from the config file specified by ``openpaiConfigFile``.
* - openpaiConfig
- Optional field. Similar to ``openpaiConfigFile``, but instead of referencing an external file, using this field you embed the content into NNI's config YAML.
.. note::
#. The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is ``nni_exp_{this.experimentId}_trial_{trialJobId}`` .
#. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taskRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
Data management
---------------
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage (NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by ``paiStorageConfigName`` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the ``nniManagerNFSMountPath`` field in configuration file, NNI will generate bash files and copy data in ``codeDir`` to the ``nniManagerNFSMountPath`` folder, then NNI will start a trial job. The data in ``nniManagerNFSMountPath`` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in ``containerNFSMountPath``, NNI will enter this folder first, and then run scripts to start a trial job.
Version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../../img/webui-img/experimentError.png
:scale: 80%
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
Overview
========
NNI has supported many training services listed below. Users can go through each page to learning how to configure the corresponding training service. NNI has high extensibility by design, users can customize new training service for their special resource, platform or needs.
.. list-table::
:header-rows: 1
* - Training Service
- Description
* - :doc:`Local <local>`
- The whole experiment runs on your dev machine (i.e., a single local machine)
* - :doc:`Remote <remote>`
- The trials are dispatched to your configured SSH servers
* - :doc:`OpenPAI <openpai>`
- Running trials on OpenPAI, a DNN model training platform based on Kubernetes
* - :doc:`Kubeflow <kubeflow>`
- Running trials with Kubeflow, a DNN model training framework based on Kubernetes
* - :doc:`AdaptDL <adaptdl>`
- Running trials on AdaptDL, an elastic DNN model training platform
* - :doc:`FrameworkController <frameworkcontroller>`
- Running trials with FrameworkController, a DNN model training framework on Kubernetes
* - :doc:`AML <aml>`
- Running trials on Azure Machine Learning (AML) cloud service
* - :doc:`PAI-DLC <paidlc>`
- Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK
* - :doc:`Hybrid <hybrid>`
- Support jointly using multiple above training services
.. _training-service-reuse:
Training Service Under Reuse Mode
---------------------------------
Since NNI v2.0, there are two sets of training service implementations in NNI. The new one is called *reuse mode*. When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.
.. note:: In the reuse mode, users need to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).
PAI-DLC Training Service
========================
NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ which is deep learning containers based on Alibaba ACK.
PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.
Prerequisite
------------
Step 1. Install NNI, follow the :doc:`install guide </installation>`.
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.
Step 3. Open PAI-DLC `here <https://pai-dlc.console.aliyun.com/#/guide>`__, select the same region as your PAI-DSW server. Move to ``dataset configuration`` and mount the same NAS disk as the PAI-DSW server does. (Note currently only PAI-DLC public-cluster is supported.)
Step 4. Open your PAI-DSW server command line, download and install PAI-DLC python SDK to submit DLC tasks, refer to `this link <https://help.aliyun.com/document_detail/203290.html>`__. Skip this step if SDK is already installed.
.. code-block:: bash
wget https://sdk-portal-cluster-prod.oss-cn-zhangjiakou.aliyuncs.com/downloads/u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
unzip u-3536038a-3de7-4f2e-9379-0cb309d29355-python-pai-dlc.zip
pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.
Usage
-----
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
# working directory on DSW, please provie FULL path
experimentWorkingDirectory: /home/admin/workspace/{your_working_dir}
searchSpaceFile: search_space.json
# the command on trial runner(or, DLC container), be aware of data_dir
trialCommand: python mnist.py --data_dir /root/data/{your_data_dir}
trialConcurrency: 1 # NOTE: please provide number <= 3 due to DLC system limit.
maxTrialNumber: 10
tuner:
name: TPE
classArgs:
optimize_mode: maximize
# ref: https://help.aliyun.com/document_detail/203290.html?spm=a2c4g.11186623.6.727.6f9b5db6bzJh4x
trainingService:
platform: dlc
type: Worker
image: registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.6.0-gpu-py37-cu101-ubuntu18.04
jobType: PyTorchJob # choices: [TFJob, PyTorchJob]
podCount: 1
ecsSpec: ecs.c6.large
region: cn-hangzhou
workspaceId: ${your_workspace_id}
accessKeyId: ${your_ak_id}
accessKeySecret: ${your_ak_key}
nasDataSourceId: ${your_nas_data_source_id} # NAS datasource ID, e.g., datat56by9n1xt0a
ossDataSourceId: ${your_oss_data_source_id} # OSS datasource ID, in case your data is on oss
localStorageMountPoint: /home/admin/workspace/ # default NAS path on DSW
containerStorageMountPoint: /root/data/ # default NAS path on DLC container, change it according your setting
Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode.
Compared with :doc:`local`, training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__.
Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``.
Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-pytorch
# modify config_dlc.yml ...
nnictl create --config config_dlc.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``.
Monitor your job
^^^^^^^^^^^^^^^^
To monitor your job on DLC, you need to visit `DLC <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status.
Remote Training Service
=======================
NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.
Prerequisite
------------
1. Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.
2. Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usage, please refer to :ref:`reference-remote-config-label` in reference for detailed usage.
3. Make sure the NNI version on each machine is consistent. Follow the install guide :doc:`here </installation>` to install NNI.
4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
In addition, there are several steps for Windows server.
1. Install and start ``OpenSSH Server``.
1) Open ``Settings`` app on Windows.
2) Click ``Apps``\ , then click ``Optional features``.
3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
4) Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
sc config sshd start=auto
net start sshd
2. Make sure remote account is administrator, so that it can stop running trials.
3. Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.
The output like below is ok, when opening a new command window.
.. code-block:: text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
Usage
-----
Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Here is a template configuration specification.
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 4
maxTrialNumber: 20
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: remote
machineList:
- host: 192.0.2.1
user: alice
ssh_key_file: ~/.ssh/id_rsa
- host: 192.0.2.2
port: 10022
user: bob
password: bob123
The example configuration is saved in ``examples/trials/mnist-pytorch/config_remote.yml``.
You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
.. code-block:: bash
nnictl create --config examples/trials/mnist-pytorch/config_remote.yml
.. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your trial code directory contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
More features
-------------
Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine.
For example, with anaconda you can specify:
.. code-block:: yaml
pythonPath: /home/bob/.conda/envs/ENV-NAME/bin
Configure shared storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support shared storage, which can help use your own storage during using NNI. Follow the guide :doc:`here <./shared_storage>` to learn how to use shared storage.
Monitor via TensorBoard
^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support trial visualization via TensorBoard. Follow the guide :doc:`/experiment/web_portal/tensorboard` to learn how to use TensorBoard.
How to Use Shared Storage
=========================
If you want to use your own storage during using NNI, shared storage can satisfy you.
Instead of using training service native storage, shared storage can bring you more convenience.
All the information generated by the experiment will be stored under ``/nni`` folder in your shared storage.
All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage.
This saves you from finding for experiment-related information in various places.
Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it.
And we will develop more practical features in the future based on shared storage. The config reference can be found :ref:`here <reference-sharedstorage-config-label>`.
.. note::
Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote.
And make sure your local machine can mount NFS or fuse AzureBlob and the machine used in training service has ``sudo`` permission without password. We only support shared storage under training service with reuse mode for now.
.. note::
What is the difference between training service native storage and shared storage? Training service native storage is usually provided by the specific training service.
E.g., the local storage on remote machine in remote mode, the provided storage in openpai mode. These storages might not easy to use, e.g., users have to upload datasets to all remote machines to train the model.
In these cases, shared storage can automatically mount to the machine in the training platform. Users can directly save and load data from the shared storage. All the data/log used/generated in one experiment can be placed under the same place.
After the experiment is finished, shared storage will automatically unmount from the training platform.
Example
-------
If you want to use AzureBlob, add below to your config. Full config file see :githublink:`mnist-sharedstorage/config_azureblob.yml <examples/trials/mnist-sharedstorage/config_azureblob.yml>`.
.. code-block:: yaml
sharedStorage:
storageType: AzureBlob
# please set localMountPoint as absolute path and localMountPoint should outside the code directory
# because nni will copy user code to localMountPoint
localMountPoint: ${your/local/mount/point}
# remoteMountPoint is the mount point on training service machine, it can be set as both absolute path and relative path
# make sure you have `sudo` permission without password on training service machine
remoteMountPoint: ${your/remote/mount/point}
storageAccountName: ${replace_to_your_storageAccountName}
storageAccountKey: ${replace_to_your_storageAccountKey}
containerName: ${replace_to_your_containerName}
# usermount means you have already mount this storage on localMountPoint
# nnimount means nni will try to mount this storage on localMountPoint
# nomount means storage will not mount in local machine, will support partial storages in the future
localMounted: nnimount
You can find ``storageAccountName``, ``storageAccountKey``, ``containerName`` on azure storage account portal.
.. image:: ../../../img/azure_storage.png
:target: ../../../img/azure_storage.png
:alt:
If you want to use NFS, add below to your config. Full config file see :githublink:`mnist-sharedstorage/config_nfs.yml <examples/trials/mnist-sharedstorage/config_nfs.yml>`.
.. code-block:: yaml
sharedStorage:
storageType: NFS
localMountPoint: ${your/local/mount/point}
remoteMountPoint: ${your/remote/mount/point}
nfsServer: ${nfs-server-ip}
exportedDirectory: ${nfs/exported/directory}
# usermount means you have already mount this storage on localMountPoint
# nnimount means nni will try to mount this storage on localMountPoint
# nomount means storage will not mount in local machine, will support partial storages in the future
localMounted: nnimount
Training Service
================
.. toctree::
:hidden:
Overview <overview>
Local <local>
Remote <remote>
OpenPAI <openpai>
Kubeflow <kubeflow>
AdaptDL <adaptdl>
FrameworkController <frameworkcontroller>
AML <aml>
PAI-DLC <paidlc>
Hybrid <hybrid>
Customize a Training Service <customize>
Shared Storage <shared_storage>
Visualize Trial with TensorBoard
================================
You can launch a tensorboard process cross one or multi trials within webportal since NNI v2.2. This feature supports local training service and reuse mode training service with shared storage for now, and will support more scenarios in later nni version.
Preparation
-----------
Make sure tensorboard installed in your environment. If you never used tensorboard, here are getting start tutorials for your reference, `tensorboard with tensorflow <https://www.tensorflow.org/tensorboard/get_started>`__, `tensorboard with pytorch <https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html>`__.
Use WebUI Launch Tensorboard
----------------------------
Save Logs
^^^^^^^^^
NNI will automatically fetch the ``tensorboard`` subfolder under trial's output folder as tensorboard logdir. So in trial's source code, you need to save the tensorboard logs under ``NNI_OUTPUT_DIR/tensorboard``. This log path can be joined as:
.. code-block:: python
log_dir = os.path.join(os.environ["NNI_OUTPUT_DIR"], 'tensorboard')
Launch Tensorboard
^^^^^^^^^^^^^^^^^^
* Like compare, select the trials you want to combine to launch the tensorboard at first, then click the ``Tensorboard`` button.
.. image:: ../../../img/Tensorboard_1.png
:target: ../../../img/Tensorboard_1.png
:alt:
* After click the ``OK`` button in the pop-up box, you will jump to the tensorboard portal.
.. image:: ../../../img/Tensorboard_2.png
:target: ../../../img/Tensorboard_2.png
:alt:
* You can see the ``SequenceID-TrialID`` on the tensorboard portal.
.. image:: ../../../img/Tensorboard_3.png
:target: ../../../img/Tensorboard_3.png
:alt:
Stop All
^^^^^^^^
If you want to open the portal you have already launched, click the tensorboard id. If you don't need the tensorboard anymore, click ``Stop all tensorboard`` button.
.. image:: ../../../img/Tensorboard_4.png
:target: ../../../img/Tensorboard_4.png
:alt:
Web Portal
==========
.. toctree::
:hidden:
Experiment Web Portal <web_portal>
Visualize with TensorBoard <tensorboard>
Web Portal
==========
Web portal is for users to conveniently visualize their NNI experiments, tuning and training progress, detailed metrics, and error logs. Web portal also allows users to control their NNI experiments, trials, such as updating an experiment of its concurrency, duration, rerunning trials.
.. image:: ../../../static/img/webui.gif
:width: 100%
Q&A
---
There are many trials in the detail table but ``Default Metric`` chart is empty
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
First you should know that ``Default metric`` and ``Hyper parameter`` chart only show succeeded trials.
What should you do when you think the chart is strange, such as ``Default metric``, ``Hyper parameter``...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Download the experiment results(``experiment config``, ``trial message`` and ``intermeidate metrics``) from ``Experiment summary`` and then upload these results in your issue.
.. image:: ../../../img/webui-img/summary.png
:width: 80%
What should you do when your experiment has error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Click the icon in the right of ``experiment status`` and screenshot the error message.
* And then click the ``learn more`` to download ``nni-manager`` and ``dispatcher`` logfile.
* Please file an issue from the `Feedback` in the `About` and upload above message.
.. image:: ../../../img/webui-img/experimentError.png
:width: 80%
What should you do when your trial fails
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* ``Customized trial`` could be used in here. Just submit the same parameters to the experiment to rerun the trial.
.. image:: ../../../img/webui-img/detail/customizedTrialButton.png
:width: 25%
.. image:: ../../../img/webui-img/detail/customizedTrial.png
:width: 40%
* ``Log model`` will help you find the error reason. There are three buttons ``View trial log``, ``View trial error`` and ``View trial stdout`` on local mode. If you run on the OpenPAI or Kubeflow platform, you could see trial stdout and nfs log.
If you have any question you could tell us in the issue.
**local mode:**
.. image:: ../../../img/webui-img/detail/log-local.png
:width: 100%
**OpenPAI, Kubeflow and other mode:**
.. image:: ../../../img/webui-img/detail-pai.png
:width: 100%
How to use dict intermediate result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`The discussion <https://github.com/microsoft/nni/discussions/4289>`_ could help you.
.. _exp-manage-webportal:
Experiments management
----------------------
Experiments management page could manage many experiments on your machine.
.. image:: ../../../img/webui-img/managerExperimentList/experimentListNav.png
:width: 100%
* On the ``All experiments`` page, you can see all the experiments on your machine.
.. image:: ../../../img/webui-img/managerExperimentList/expList.png
:width: 100%
* When you want to see more details about an experiment you could click the trial id, look that:
.. image:: ../../../img/webui-img/managerExperimentList/toAnotherExp.png
:width: 100%
* If has many experiments on the table, you can use the ``filter`` button.
.. image:: ../../../img/webui-img/managerExperimentList/expFilter.png
:width: 100%
Experiment details
------------------
View overview page
^^^^^^^^^^^^^^^^^^
* On the overview tab, you can see the experiment information and status and the performance of ``top trials``.
.. image:: ../../../img/webui-img/full-oview.png
:width: 100%
* If you want to see experiment search space and config, please click the right button ``Search space`` and ``Config`` (when you hover on this button).
**Search space file:**
.. image:: ../../../img/webui-img/searchSpace.png
:width: 80%
**Config file:**
.. image:: ../../../img/webui-img/config.png
:width: 80%
* You can view and download ``nni-manager/dispatcher log files`` on here.
.. image:: ../../../img/webui-img/review-log.png
:width: 80%
* If your experiment has many trials, you can change the refresh interval here.
.. image:: ../../../img/webui-img/refresh-interval.png
:width: 100%
* You can change some experiment configurations such as ``maxExecDuration``, ``maxTrialNum`` and ``trial concurrency`` on here.
.. image:: ../../../img/webui-img/edit-experiment-param.png
:width: 80%
View job default metric
^^^^^^^^^^^^^^^^^^^^^^^
* Click the tab ``Default metric`` to see the point chart of all trials. Hover to see its specific default metric and search space message.
.. image:: ../../../img/webui-img/default-metric.png
:width: 100%
* Turn on the switch named ``Optimization curve`` to see the experiment's optimization curve.
.. image:: ../../../img/webui-img/best-curve.png
:width: 100%
View hyper parameter
^^^^^^^^^^^^^^^^^^^^
Click the tab ``Hyper-parameter`` to see the parallel chart.
* You can click the ``add/remove`` button to add or remove axes.
* Drag the axes to swap axes on the chart.
* You can select the percentage to see top trials.
.. image:: ../../../img/webui-img/hyperPara.png
:width: 100%
View Trial Duration
^^^^^^^^^^^^^^^^^^^
Click the tab ``Trial Duration`` to see the bar chart.
.. image:: ../../../img/webui-img/trial_duration.png
:width: 100%
View Trial Intermediate Result chart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Click the tab ``Intermediate Result`` to see the line chart.
.. image:: ../../../img/webui-img/trials_intermeidate.png
:width: 100%
The trial may have many intermediate results in the training process. In order to see the trend of some trials more clearly, we set a filtering function for the intermediate result chart.
You may find that these trials will get better or worse at an intermediate result. This indicates that it is an important and relevant intermediate result. To take a closer look at the point here, you need to enter its corresponding X-value at #Intermediate. Then input the range of metrics on this intermedia result. In the picture below, we choose the No. 4 intermediate result and set the range of metrics to 0.8-1.
.. image:: ../../../img/webui-img/filter-intermediate.png
:width: 100%
View trials status
^^^^^^^^^^^^^^^^^^
Click the tab ``Trials Detail`` to see the status of all trials. Specifically:
* Trial detail: trial's id, trial's duration, start time, end time, status, accuracy, and search space file.
.. image:: ../../../img/webui-img/detail-local.png
:width: 100%
* Support searching for a specific trial by its id, status, Trial No. and trial parameters.
**Trial id:**
.. image:: ../../../img/webui-img/detail/searchId.png
:width: 80%
**Trial No.:**
.. image:: ../../../img/webui-img/detail/searchNo.png
:width: 80%
**Trial status:**
.. image:: ../../../img/webui-img/detail/searchStatus.png
:width: 80%
**Trial parameters:**
``parameters whose type is choice:``
.. image:: ../../../img/webui-img/detail/searchParameterChoice.png
:width: 80%
``parameters whose type is not choice:``
.. image:: ../../../img/webui-img/detail/searchParameterRange.png
:width: 80%
* The button named ``Add column`` can select which column to show on the table. If you run an experiment whose final result is a dict, you can see other keys in the table. You can choose the column ``Intermediate count`` to watch the trial's progress.
.. image:: ../../../img/webui-img/addColumn.png
:width: 40%
* If you want to compare some trials, you can select them and then click ``Compare`` to see the results.
.. image:: ../../../img/webui-img/select-trial.png
:width: 100%
.. image:: ../../../img/webui-img/compare.png
:width: 80%
* You can use the button named ``Copy as python`` to copy the trial's parameters.
.. image:: ../../../img/webui-img/copyParameter.png
:width: 100%
* Intermediate Result chart: you can see the default metric in this chart by clicking the intermediate button.
.. image:: ../../../img/webui-img/intermediate.png
:width: 100%
* Kill: you can kill a job that status is running.
.. image:: ../../../img/webui-img/kill-running.png
:width: 100%
.. e6c000f46f269ea88861ca2cd3b597ae
Web 界面
========
Web portal 为用户提供了便捷的可视化页面,用户可以在上面观察 NNI 实验训练过程、详细的 metrics 以及实验的 log 和 error。
当然,用户可以管理实验,调制 trials 比如修改实验的 concurrency 值,时长以及重跑一些 trials。
.. image:: ../../../static/img/webui.gif
:width: 100%
Q&A
---
在 detail 页面的表格里明明有很多 trial 但是 Default Metric 图是空的没有数据
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
首先你要明白 ``Default metric`` 和 ``Hyper parameter`` 图只展示成功 trial。
当你觉得 ``Default metric``、``Hyper parameter`` 图有问题的时候应该做什么
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 从 Experiment summary 下载实验结果(实验配置,trial 信息,中间值),并把这些结果上传进 issue 里。
.. image:: ../../../img/webui-img/summary.png
:width: 80%
当你的实验有故障时应该做什么
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 点击实验状态右边的小图标把 error 信息截屏。
* 然后点击 learn about 去下载 log 文件。And then click the ``learn about`` to download ``nni-manager`` and ``dispatcher`` logfile.
* 点击页面导航栏的 About 按钮点 Feedback 开一个 issue,附带上以上的截屏和 log 信息。
.. image:: ../../../img/webui-img/experimentError.png
:width: 80%
当你的 trial 跑失败了你应该怎么做
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 使用 Customized trial 功能。向实验提交相同的 trial 参数即可。
.. image:: ../../../img/webui-img/detail/customizedTrialButton.png
:width: 25%
.. image:: ../../../img/webui-img/detail/customizedTrial.png
:width: 40%
* ``Log 模块`` 能帮助你找到错误原因。 有三个按钮: ``View trial log``, ``View trial error`` 和 ``View trial stdout`` 可查 log。如果你用 OpenPai 或者 Kubeflow,你能看到 trial stdout 和 nfs log。
有任何问题请在 issue 里联系我们。
**local mode:**
.. image:: ../../../img/webui-img/detail/log-local.png
:width: 100%
**OpenPAI, Kubeflow and other mode:**
.. image:: ../../../img/webui-img/detail-pai.png
:width: 100%
怎样去使用 dict intermediate result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`The discussion <https://github.com/microsoft/nni/discussions/4289>`_ 能帮助你。
.. _exp-manage-webportal:
实验管理
--------
实验管理页面能统筹你机器上的所有实验。
.. image:: ../../../img/webui-img/managerExperimentList/experimentListNav.png
:width: 100%
* 在 ``All experiments`` 页面,可以看到机器上的所有 Experiment。
.. image:: ../../../img/webui-img/managerExperimentList/expList.png
:width: 100%
* 查看 Experiment 更多详细信息时,可以单击 trial ID 跳转至该 Experiment 详情页,如下所示:
.. image:: ../../../img/webui-img/managerExperimentList/toAnotherExp.png
:width: 100%
* 如果表格里有很多 Experiment,可以使用 ``filter`` 按钮。
.. image:: ../../../img/webui-img/managerExperimentList/expFilter.png
:width: 100%
实验详情
--------
查看实验 overview 页面
^^^^^^^^^^^^^^^^^^^^^^^
* 在 Overview 标签上,可看到 Experiment trial 的概况、搜索空间以及 ``top trials`` 的结果。
.. image:: ../../../img/webui-img/full-oview.png
:width: 100%
* 如果想查看 Experiment 配置和搜索空间,点击右边的 ``Search space`` 和 ``Config`` 按钮。
**搜索空间文件:**
.. image:: ../../../img/webui-img/searchSpace.png
:width: 80%
**配置文件:**
.. image:: ../../../img/webui-img/config.png
:width: 80%
* 你可以在这里查看和下载 ``nni-manager/dispatcher 日志文件``。
.. image:: ../../../img/webui-img/review-log.png
:width: 80%
* 如果 Experiment 包含了较多 Trial,可改变刷新间隔。
.. image:: ../../../img/webui-img/refresh-interval.png
:width: 100%
* 在这里修改 Experiment 配置(例如 ``maxExecDuration``, ``maxTrialNum`` 和 ``trial concurrency``)。
.. image:: ../../../img/webui-img/edit-experiment-param.png
:width: 80%
查看 trial 最终结果
^^^^^^^^^^^^^^^^^^^^^
* ``Default metric`` 是所有 trial 的最终结果图。 在每一个结果上悬停鼠标可以看到 trial 信息,比如 trial id、No. 超参等。
.. image:: ../../../img/webui-img/default-metric.png
:width: 100%
* 打开 ``Optimization curve`` 来查看 Experiment 的优化曲线。
.. image:: ../../../img/webui-img/best-curve.png
:width: 100%
查看超参
^^^^^^^^^^
单击 ``Hyper-parameter`` 标签查看平行坐标系图。
* 可以点击 ``添加/删除`` 按钮来添加或删减纵坐标轴。
* 直接在图上拖动轴线来交换轴线位置。
* 通过调节百分比来查看 top trial。
.. image:: ../../../img/webui-img/hyperPara.png
:width: 100%
查看 Trial 运行时间
^^^^^^^^^^^^^^^^^^^^^^
点击 ``Trial Duration`` 标签来查看柱状图。
.. image:: ../../../img/webui-img/trial_duration.png
:width: 100%
查看 Trial 中间结果
^^^^^^^^^^^^^^^^^^^^^^
单击 ``Intermediate Result`` 标签查看折线图。
.. image:: ../../../img/webui-img/trials_intermeidate.png
:width: 100%
Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解一些 Trial 的趋势,可以为中间结果图设置过滤功能。
这样可以发现 Trial 在某个中间结果上会变得更好或更差。 这表明它是一个重要的并相关的中间结果。 如果要仔细查看这个点,可以在 #Intermediate 中输入其 X 坐标。 并输入这个中间结果的指标范围。 在下图中,选择了第四个中间结果并将指标范围设置为了 0.8 -1。
.. image:: ../../../img/webui-img/filter-intermediate.png
:width: 100%
查看 Trial 状态
^^^^^^^^^^^^^^^^^^
点击 ``Trials Detail`` 标签查看所有 Trial 的状态。具体如下:
* Trial 详情:Trial id,持续时间,开始时间,结束时间,状态,精度和 search space 文件。
.. image:: ../../../img/webui-img/detail-local.png
:width: 100%
* * 支持通过 id,状态,Trial 编号以及参数来搜索。
**Trial id:**
.. image:: ../../../img/webui-img/detail/searchId.png
:width: 80%
**Trial No.:**
.. image:: ../../../img/webui-img/detail/searchNo.png
:width: 80%
**Trial status:**
.. image:: ../../../img/webui-img/detail/searchStatus.png
:width: 80%
**Trial parameters:**
``类型为 choice 的参数:``
.. image:: ../../../img/webui-img/detail/searchParameterChoice.png
:width: 80%
``类型不是 choice 的参数:``
.. image:: ../../../img/webui-img/detail/searchParameterRange.png
:width: 80%
* ``Add column`` 按钮可选择在表格中显示的列。 如果 Experiment 的最终结果是 dict,则可以在表格中查看其它键。可选择 ``Intermediate count`` 列来查看 Trial 进度。
.. image:: ../../../img/webui-img/addColumn.png
:width: 40%
* 如果要比较某些 Trial,可选择并点击 ``Compare`` 来查看结果。
.. image:: ../../../img/webui-img/select-trial.png
:width: 100%
.. image:: ../../../img/webui-img/compare.png
:width: 80%
* 可使用 ``Copy as python`` 按钮来拷贝 Trial 的参数。
.. image:: ../../../img/webui-img/copyParameter.png
:width: 100%
* 中间结果图:可在此图中通过点击 intermediate 按钮来查看默认指标。
.. image:: ../../../img/webui-img/intermediate.png
:width: 100%
* Kill: 可终止正在运行的 trial。
.. image:: ../../../img/webui-img/kill-running.png
:width: 100%
GBDTSelector
------------
GBDTSelector is based on `LightGBM <https://github.com/microsoft/LightGBM>`__\ , which is a gradient boosting framework that uses tree-based learning algorithms.
When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.
We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.
For now, we support the ``importance_type`` is ``split`` and ``gain``. But we will support customized ``importance_type`` in the future, which means the user could define how to calculate the ``feature score`` by themselves.
Usage
^^^^^
First you need to install dependency:
.. code-block:: bash
pip install lightgbm
Then
.. code-block:: python
from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = GBDTSelector()
# fit data
fgs.fit(X_train, y_train, ...)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(10))
...
And you could reference the examples in ``/examples/feature_engineering/gbdt_selector/``\ , too.
**Requirement of fit FuncArgs**
*
**X** (array-like, require) - The training input samples which shape = [n_samples, n_features]
*
**y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].
*
**lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__
*
**eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.
*
**early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/Parameters.html>`__.
*
**importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance>`__.
*
**num_boost_round** (int, require) - number of boost round. The detail you could reference `here <https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train>`__.
**Requirement of get_selected_features FuncArgs**
* **topk** (int, require) - the topK impotance features you want to selected.
GradientFeatureSelector
-----------------------
The algorithm in GradientFeatureSelector comes from `Feature Gradients: Scalable Feature Selection via Discrete Relaxation <https://arxiv.org/pdf/1908.10382.pdf>`__.
GradientFeatureSelector, a gradient-based search algorithm
for feature selection.
1) This approach extends a recent result on the estimation of
learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N.
2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.
3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.
Usage
^^^^^
.. code-block:: python
from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = FeatureGradientSelector(n_features=10)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features())
...
And you could reference the examples in ``/examples/feature_engineering/gradient_feature_selector/``\ , too.
**Parameters of class FeatureGradientSelector constructor**
*
**order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.
*
**penalty** (int, optional, default = 1) - Constant that multiplies the regularization term.
*
**n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.
*
**max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.
*
**learning_rate** (float, optional, default = 1e-1) - learning rate
*
**init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*\ ) - How to initialize the vector of scores. 'zero' is the default.
*
**n_epochs** (int, optional, default = 1) - number of epochs to run
*
**shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.
*
**batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.
*
**target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.
*
**classification** (bool, optional, default = True) - If True, problem is classification, else regression.
*
**ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.
*
**balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.
*
**prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.
*
**soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.
*
**verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.
*
**device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU
**Requirement of fit FuncArgs**
*
**X** (array-like, require) - The training input samples which shape = [n_samples, n_features]. `np.ndarry` recommended.
*
**y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples]. `np.ndarry` recommended.
*
**groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].
**Requirement of get_selected_features FuncArgs**
For now, the ``get_selected_features`` function has no parameters.
Feature Engineering with NNI
============================
.. note::
We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.
For now, we support the following feature selector:
* :doc:`GradientFeatureSelector <./gradient_feature_selector>`
* :doc:`GBDTSelector <./gbdt_selector>`
These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).
In addition, those selector only for feature selection. If you want to:
1) generate high-order combined features on nni while doing feature selection;
2) leverage your distributed resources;
you could try this :githublink:`example <examples/feature_engineering/auto-feature-engineering>`.
How to use?
-----------
.. code-block:: python
from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
# from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# initlize a selector
fgs = FeatureGradientSelector(...)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(...))
...
When using the built-in Selector, you first need to ``import`` a feature selector, and ``initialize`` it. You could call the function ``fit`` in the selector to pass the data to the selector. After that, you could use ``get_seleteced_features`` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it.
How to customize?
-----------------
NNI provides *state-of-the-art* feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.
If you want to implement a customized feature selector, you need to:
#. Inherit the base FeatureSelector class
#. Implement *fit* and _get_selected *features* function
#. Integrate with sklearn (Optional)
Here is an example:
**1. Inherit the base Featureselector Class**
.. code-block:: python
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector):
def __init__(self, *args, **kwargs):
...
**2. Implement fit and _get_selected features Function**
.. code-block:: python
from nni.tuner import Tuner
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector):
def __init__(self, *args, **kwargs):
...
def fit(self, X, y, **kwargs):
"""
Fit the training data to FeatureSelector
Parameters
------------
X : array-like numpy matrix
The training input samples, which shape is [n_samples, n_features].
y: array-like numpy matrix
The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
"""
self.X = X
self.y = y
...
def get_selected_features(self):
"""
Get important feature
Returns
-------
list :
Return the index of the important feature.
"""
...
return self.selected_features_
...
**3. Integrate with Sklearn**
``sklearn.pipeline.Pipeline`` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow.
The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a module of the pipeline.
#. Inherit the calss *sklearn.base.BaseEstimator*
#. Implement _get\ *params* and _set*params* function in *BaseEstimator*
#. Inherit the class _sklearn.feature\ *selection.base.SelectorMixin*
#. Implement _get\ *support*\ , *transform* and _inverse*transform* Function in *SelectorMixin*
Here is an example:
**1. Inherit the BaseEstimator Class and its Function**
.. code-block:: python
from sklearn.base import BaseEstimator
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector, BaseEstimator):
def __init__(self, *args, **kwargs):
...
def get_params(self, *args, **kwargs):
"""
Get parameters for this estimator.
"""
params = self.__dict__
params = {key: val for (key, val) in params.items() if not key.endswith('_')}
return params
def set_params(self, **params):
"""
Set the parameters of this estimator.
"""
for param in params:
if hasattr(self, param):
setattr(self, param, params[param])
return self
**2. Inherit the SelectorMixin Class and its Function**
.. code-block:: python
from sklearn.base import BaseEstimator
from sklearn.feature_selection.base import SelectorMixin
from nni.feature_engineering.feature_selector import FeatureSelector
class CustomizedSelector(FeatureSelector, BaseEstimator, SelectorMixin):
def __init__(self, *args, **kwargs):
...
def get_params(self, *args, **kwargs):
"""
Get parameters for this estimator.
"""
params = self.__dict__
params = {key: val for (key, val) in params.items()
if not key.endswith('_')}
return params
def set_params(self, **params):
"""
Set the parameters of this estimator.
"""
for param in params:
if hasattr(self, param):
setattr(self, param, params[param])
return self
def get_support(self, indices=False):
"""
Get a mask, or integer index, of the features selected.
Parameters
----------
indices : bool
Default False. If True, the return value will be an array of integers, rather than a boolean mask.
Returns
-------
list :
returns support: An index that selects the retained features from a feature vector.
If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
If indices are True, this is an integer array of shape [# output features] whose values
are indices into the input feature vector.
"""
...
return mask
def transform(self, X):
"""Reduce X to the selected features.
Parameters
----------
X : array
which shape is [n_samples, n_features]
Returns
-------
X_r : array
which shape is [n_samples, n_selected_features]
The input samples with only the selected features.
"""
...
return X_r
def inverse_transform(self, X):
"""
Reverse the transformation operation
Parameters
----------
X : array
shape is [n_samples, n_selected_features]
Returns
-------
X_r : array
shape is [n_samples, n_original_features]
"""
...
return X_r
After integrating with Sklearn, we could use the feature selector as follows:
.. code-block:: python
from sklearn.linear_model import LogisticRegression
# load data
...
X_train, y_train = ...
# build a ppipeline
pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
pipeline.fit(X_train, y_train)
# score
print("Pipeline Score: ", pipeline.score(X_train, y_train))
Benchmark
---------
``Baseline`` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
.. list-table::
:header-rows: 1
:widths: auto
* - Dataset
- All Features + LR (acc, time, memory)
- GradientFeatureSelector + LR (acc, time, memory)
- TreeBasedClassifier + LR (acc, time, memory)
- #Train
- #Feature
* - colon-cancer
- 0.7547, 890ms, 348MiB
- 0.7368, 363ms, 286MiB
- 0.7223, 171ms, 1171 MiB
- 62
- 2,000
* - gisette
- 0.9725, 215ms, 584MiB
- 0.89416, 446ms, 397MiB
- 0.9792, 911ms, 234MiB
- 6,000
- 5,000
* - avazu
- 0.8834, N/A, N/A
- N/A, N/A, N/A
- N/A, N/A, N/A
- 40,428,967
- 1,000,000
* - rcv1
- 0.9644, 557ms, 241MiB
- 0.7333, 401ms, 281MiB
- 0.9615, 752ms, 284MiB
- 20,242
- 47,236
* - news20.binary
- 0.9208, 707ms, 361MiB
- 0.6870, 565ms, 371MiB
- 0.9070, 904ms, 364MiB
- 19,996
- 1,355,191
* - real-sim
- 0.9681, 433ms, 274MiB
- 0.7969, 251ms, 274MiB
- 0.9591, 643ms, 367MiB
- 72,309
- 20,958
The dataset of benchmark could be download in `here <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/>`__
The code could be refenrence ``/examples/feature_engineering/gradient_feature_selector/benchmark_test.py``.
Feature Engineering
===================
.. toctree::
:maxdepth: 2
Overview <overview>
GradientFeatureSelector <gradient_feature_selector>
GBDTSelector <gbdt_selector>
Advanced Usage
==============
.. toctree::
:hidden:
Command Line Tool Example </tutorials/hpo_nnictl/nnictl>
Implement Custom Tuners and Assessors <custom_algorithm>
Install Custom or 3rd-party Tuners and Assessors <custom_algorithm_installation>
Tuner Benchmark <hpo_benchmark>
Tuner Benchmark Example Statistics <hpo_benchmark_stats>
Assessor: Early Stopping
========================
In HPO, some hyperparameter sets may have obviously poor performance and it will be unnecessary to finish the evaluation.
This is called *early stopping*, and in NNI early stopping algorithms are called *assessors*.
An assessor monitors *intermediate results* of each *trial*.
If a trial is predicted to produce suboptimal final result, the assessor will stop that trial immediately,
to save computing resources for other hyperparameter sets.
As introduced in quickstart tutorial, a trial is the evaluation process of a hyperparameter set,
and intermediate results are reported with :func:`nni.report_intermediate_result` API in trial code.
Typically, intermediate results are accuracy or loss metrics of each epoch.
Using an assessor will increase the efficiency of computing resources,
but may slightly reduce the predicition accuracy of tuners.
It is recommended to use an assessor when computing resources are insufficient.
Common Usage
------------
The usage of assessors are similar to tuners.
To use a built-in assessor you need to specify its name and arguments:
.. code-block:: python
config.assessor.name = 'Medianstop'
config.tuner.class_args = {'optimize_mode': 'maximize'}
Built-in Assessors
------------------
.. list-table::
:header-rows: 1
:widths: auto
* - Assessor
- Brief Introduction of Algorithm
* - :class:`Median Stop <nni.algorithms.hpo.medianstop_assessor.MedianstopAssessor>`
- Stop if the hyperparameter set performs worse than median at any step.
* - :class:`Curve Fitting <nni.algorithms.hpo.curvefitting_assessor.CurvefittingAssessor>`
- Stop if the learning curve will likely converge to suboptimal result.
Customizing Algorithms
======================
Customize Tuner
---------------
NNI provides state-of-the-art tuning algorithm in builtin-tuners. NNI supports to build a tuner by yourself for tuning demand.
If you want to implement your own tuning algorithm, you can implement a customized Tuner, there are three things to do:
#. Inherit the base Tuner class
#. Implement receive_trial_result, generate_parameter and update_search_space function
#. Configure your customized tuner in experiment YAML config file
Here is an example:
**1. Inherit the base Tuner class**
.. code-block:: python
from nni.tuner import Tuner
class CustomizedTuner(Tuner):
def __init__(self, *args, **kwargs):
...
**2. Implement receive_trial_result, generate_parameter and update_search_space function**
.. code-block:: python
from nni.tuner import Tuner
class CustomizedTuner(Tuner):
def __init__(self, *args, **kwargs):
...
def receive_trial_result(self, parameter_id, parameters, value, **kwargs):
'''
Receive trial's final result.
parameter_id: int
parameters: object created by 'generate_parameters()'
value: final metrics of the trial, including default metric
'''
# your code implements here.
...
def generate_parameters(self, parameter_id, **kwargs):
'''
Returns a set of trial (hyper-)parameters, as a serializable object
parameter_id: int
'''
# your code implements here.
return your_parameters
...
def update_search_space(self, search_space):
'''
Tuners are advised to support updating search space at run-time.
If a tuner can only set search space once before generating first hyper-parameters,
it should explicitly document this behaviour.
search_space: JSON object created by experiment owner
'''
# your code implements here.
...
``receive_trial_result`` will receive the ``parameter_id, parameters, value`` as parameters input. Also, Tuner will receive the ``value`` object are exactly same value that Trial send.
The ``your_parameters`` return from ``generate_parameters`` function, will be package as json object by NNI SDK. NNI SDK will unpack json object so the Trial will receive the exact same ``your_parameters`` from Tuner.
For example:
If the you implement the ``generate_parameters`` like this:
.. code-block:: python
def generate_parameters(self, parameter_id, **kwargs):
'''
Returns a set of trial (hyper-)parameters, as a serializable object
parameter_id: int
'''
# your code implements here.
return {"dropout": 0.3, "learning_rate": 0.4}
It means your Tuner will always generate parameters ``{"dropout": 0.3, "learning_rate": 0.4}``. Then Trial will receive ``{"dropout": 0.3, "learning_rate": 0.4}`` by calling API ``nni.get_next_parameter()``. Once the trial ends with a result (normally some kind of metrics), it can send the result to Tuner by calling API ``nni.report_final_result()``, for example ``nni.report_final_result(0.93)``. Then your Tuner's ``receive_trial_result`` function will receied the result like:
.. code-block:: python
parameter_id = 82347
parameters = {"dropout": 0.3, "learning_rate": 0.4}
value = 0.93
**Note that** The working directory of your tuner is ``<home>/nni-experiments/<experiment_id>/log``, which can be retrieved with environment variable ``NNI_LOG_DIRECTORY``, therefore, if you want to access a file (e.g., ``data.txt``) in the directory of your own tuner, you cannot use ``open('data.txt', 'r')``. Instead, you should use the following:
.. code-block:: python
_pwd = os.path.dirname(__file__)
_fd = open(os.path.join(_pwd, 'data.txt'), 'r')
This is because your tuner is not executed in the directory of your tuner (i.e., ``pwd`` is not the directory of your own tuner).
**3. Configure your customized tuner in experiment YAML config file**
NNI needs to locate your customized tuner class and instantiate the class, so you need to specify the location of the customized tuner class and pass literal values as parameters to the __init__ constructor.
.. code-block:: yaml
tuner:
codeDir: /home/abc/mytuner
classFileName: my_customized_tuner.py
className: CustomizedTuner
# Any parameter need to pass to your tuner class __init__ constructor
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
More detail example you could see:
..
* :githublink:`evolution-tuner <nni/algorithms/hpo/evolution_tuner.py>`
* :githublink:`hyperopt-tuner <nni/algorithms/hpo/hyperopt_tuner.py>`
* :githublink:`evolution-based-customized-tuner <examples/tuners/ga_customer_tuner>`
Write a more advanced automl algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The methods above are usually enough to write a general tuner. However, users may also want more methods, for example, intermediate results, trials' state (e.g., the methods in assessor), in order to have a more powerful automl algorithm. Therefore, we have another concept called ``advisor`` which directly inherits from ``MsgDispatcherBase`` in :githublink:`msg_dispatcher_base.py <nni/runtime/msg_dispatcher_base.py>`.
Customize Assessor
------------------
NNI supports to build an assessor by yourself for tuning demand.
If you want to implement a customized Assessor, there are three things to do:
#. Inherit the base Assessor class
#. Implement assess_trial function
#. Configure your customized Assessor in experiment YAML config file
**1. Inherit the base Assessor class**
.. code-block:: python
from nni.assessor import Assessor
class CustomizedAssessor(Assessor):
def __init__(self, *args, **kwargs):
...
**2. Implement assess trial function**
.. code-block:: python
from nni.assessor import Assessor, AssessResult
class CustomizedAssessor(Assessor):
def __init__(self, *args, **kwargs):
...
def assess_trial(self, trial_history):
"""
Determines whether a trial should be killed. Must override.
trial_history: a list of intermediate result objects.
Returns AssessResult.Good or AssessResult.Bad.
"""
# you code implement here.
...
**3. Configure your customized Assessor in experiment YAML config file**
NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the __init__ constructor.
.. code-block:: yaml
assessor:
codeDir: /home/abc/myassessor
classFileName: my_customized_assessor.py
className: CustomizedAssessor
# Any parameter need to pass to your Assessor class __init__ constructor
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
Please noted in **2**. The object ``trial_history`` are exact the object that Trial send to Assessor by using SDK ``report_intermediate_result`` function.
The working directory of your assessor is ``<home>/nni-experiments/<experiment_id>/log``\ , which can be retrieved with environment variable ``NNI_LOG_DIRECTORY``\ ,
More detail example you could see:
* :githublink:`medianstop-assessor <nni/algorithms/hpo/medianstop_assessor.py>`
* :githublink:`curvefitting-assessor <nni/algorithms/hpo/curvefitting_assessor/>`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment