Unverified Commit 3ec26b40 authored by liuzhe-lz's avatar liuzhe-lz Committed by GitHub
Browse files

Merge master into dev-retiarii (#3178)

parent d165905d
Research and Publications
=========================
We are intensively working on both tool chain and research to make automatic model design and tuning really practical and powerful. On the one hand, our main work is tool chain oriented development. On the other hand, our research works aim to improve this tool chain, rethink challenging problems in AutoML (on both system and algorithm) and propose elegant solutions. Below we list some of our research works, we encourage more research works on this topic and encourage collaboration with us.
System Research
---------------
* `Retiarii: A Deep Learning Exploratory-Training Framework <https://www.usenix.org/system/files/osdi20-zhang_quanlu.pdf>`__
.. code-block:: bibtex
@inproceedings{zhang2020retiarii,
title={Retiarii: A Deep Learning Exploratory-Training Framework},
author={Zhang, Quanlu and Han, Zhenhua and Yang, Fan and Zhang, Yuge and Liu, Zhe and Yang, Mao and Zhou, Lidong},
booktitle={14th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 20)},
pages={919--936},
year={2020}
}
* `AutoSys: The Design and Operation of Learning-Augmented Systems <https://www.usenix.org/system/files/atc20-liang-chieh-jan.pdf>`__
.. code-block:: bibtex
@inproceedings{liang2020autosys,
title={AutoSys: The Design and Operation of Learning-Augmented Systems},
author={Liang, Chieh-Jan Mike and Xue, Hui and Yang, Mao and Zhou, Lidong and Zhu, Lifei and Li, Zhao Lucis and Wang, Zibo and Chen, Qi and Zhang, Quanlu and Liu, Chuanjie and others},
booktitle={2020 $\{$USENIX$\}$ Annual Technical Conference ($\{$USENIX$\}$$\{$ATC$\}$ 20)},
pages={323--336},
year={2020}
}
* `Gandiva: Introspective Cluster Scheduling for Deep Learning <https://www.usenix.org/system/files/osdi18-xiao.pdf>`__
.. code-block:: bibtex
@inproceedings{xiao2018gandiva,
title={Gandiva: Introspective cluster scheduling for deep learning},
author={Xiao, Wencong and Bhardwaj, Romil and Ramjee, Ramachandran and Sivathanu, Muthian and Kwatra, Nipun and Han, Zhenhua and Patel, Pratyush and Peng, Xuan and Zhao, Hanyu and Zhang, Quanlu and others},
booktitle={13th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 18)},
pages={595--610},
year={2018}
}
Algorithm Research
------------------
New Algorithms
^^^^^^^^^^^^^^
* `TextNAS: A Neural Architecture Search Space Tailored for Text Representation <https://arxiv.org/pdf/1912.10729.pdf>`__
.. code-block:: bibtex
@inproceedings{wang2020textnas,
title={TextNAS: A Neural Architecture Search Space Tailored for Text Representation.},
author={Wang, Yujing and Yang, Yaming and Chen, Yiren and Bai, Jing and Zhang, Ce and Su, Guinan and Kou, Xiaoyu and Tong, Yunhai and Yang, Mao and Zhou, Lidong},
booktitle={AAAI},
pages={9242--9249},
year={2020}
}
* `Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search <https://papers.nips.cc/paper/2020/file/d072677d210ac4c03ba046120f0802ec-Paper.pdf>`__
.. code-block:: bibtex
@article{peng2020cream,
title={Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search},
author={Peng, Houwen and Du, Hao and Yu, Hongyuan and Li, Qi and Liao, Jing and Fu, Jianlong},
journal={Advances in Neural Information Processing Systems},
volume={33},
year={2020}
}
* `Metis: Robustly tuning tail latencies of cloud systems <https://www.usenix.org/system/files/conference/atc18/atc18-li-zhao.pdf>`__
.. code-block:: bibtex
@inproceedings{li2018metis,
title={Metis: Robustly tuning tail latencies of cloud systems},
author={Li, Zhao Lucis and Liang, Chieh-Jan Mike and He, Wenjia and Zhu, Lianjie and Dai, Wenjun and Jiang, Jin and Sun, Guangzhong},
booktitle={2018 $\{$USENIX$\}$ Annual Technical Conference ($\{$USENIX$\}$$\{$ATC$\}$ 18)},
pages={981--992},
year={2018}
}
* `OpEvo: An Evolutionary Method for Tensor Operator Optimization <https://arxiv.org/abs/2006.05664>`__
.. code-block:: bibtex
@article{gao2020opevo,
title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
author={Gao, Xiaotian and Wei, Cui and Zhang, Lintao and Yang, Mao},
journal={arXiv preprint arXiv:2006.05664},
year={2020}
}
Measurement and Understanding
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Deeper insights into weight sharing in neural architecture search <https://arxiv.org/pdf/2001.01431.pdf>`__
.. code-block:: bibtex
@article{zhang2020deeper,
title={Deeper insights into weight sharing in neural architecture search},
author={Zhang, Yuge and Lin, Zejun and Jiang, Junyang and Zhang, Quanlu and Wang, Yujing and Xue, Hui and Zhang, Chen and Yang, Yaming},
journal={arXiv preprint arXiv:2001.01431},
year={2020}
}
* `How Does Supernet Help in Neural Architecture Search? <https://arxiv.org/abs/2010.08219>`__
.. code-block:: bibtex
@article{zhang2020does,
title={How Does Supernet Help in Neural Architecture Search?},
author={Zhang, Yuge and Zhang, Quanlu and Yang, Yaming},
journal={arXiv preprint arXiv:2010.08219},
year={2020}
}
Applications
^^^^^^^^^^^^
* `AutoADR: Automatic Model Design for Ad Relevance <https://arxiv.org/pdf/2010.07075.pdf>`__
.. code-block:: bibtex
@inproceedings{chen2020autoadr,
title={AutoADR: Automatic Model Design for Ad Relevance},
author={Chen, Yiren and Yang, Yaming and Sun, Hong and Wang, Yujing and Xu, Yu and Shen, Wei and Zhou, Rong and Tong, Yunhai and Bai, Jing and Zhang, Ruofei},
booktitle={Proceedings of the 29th ACM International Conference on Information \& Knowledge Management},
pages={2365--2372},
year={2020}
}
.. role:: raw-html(raw)
:format: html
Framework and Library Supports
==============================
With the built-in Python API, NNI naturally supports the hyper parameter tuning and neural network search for all the AI frameworks and libraries who support Python models(\ ``version >= 3.6``\ ). NNI had also provided a set of examples and tutorials for some of the popular scenarios to make jump start easier.
Supported AI Frameworks
-----------------------
* :raw-html:`<b>[PyTorch]</b>` https://github.com/pytorch/pytorch
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-distributed-pytorch">MNIST-pytorch</a><br/></li>
<li><a href="TrialExample/Cifar10Examples.md">CIFAR-10</a><br/></li>
<li><a href="../../examples/trials/kaggle-tgs-salt/README.md">TGS salt identification chanllenge</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* :raw-html:`<b>[TensorFlow]</b>` https://github.com/tensorflow/tensorflow
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-distributed">MNIST-tensorflow</a><br/></li>
<li><a href="../../examples/trials/ga_squad/README.md">Squad</a><br/></li>
</ul>
* :raw-html:`<b>[Keras]</b>` https://github.com/keras-team/keras
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-keras">MNIST-keras</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* :raw-html:`<b>[MXNet]</b>` https://github.com/apache/incubator-mxnet
* :raw-html:`<b>[Caffe2]</b>` https://github.com/BVLC/caffe
* :raw-html:`<b>[CNTK (Python language)]</b>` https://github.com/microsoft/CNTK
* :raw-html:`<b>[Spark MLlib]</b>` http://spark.apache.org/mllib/
* :raw-html:`<b>[Chainer]</b>` https://chainer.org/
* :raw-html:`<b>[Theano]</b>` https://pypi.org/project/Theano/ :raw-html:`<br/>`
You are encouraged to `contribute more examples <Tutorial/Contributing.rst>`__ for other NNI users.
Supported Library
-----------------
NNI also supports all libraries written in python.Here are some common libraries, including some algorithms based on GBDT: XGBoost, CatBoost and lightGBM.
* :raw-html:`<b>[Scikit-learn]</b>` https://scikit-learn.org/stable/
.. raw:: html
<ul>
<li><a href="TrialExample/SklearnExamples.md">Scikit-learn</a><br/></li>
</ul>
* :raw-html:`<b>[XGBoost]</b>` https://xgboost.readthedocs.io/en/latest/
* :raw-html:`<b>[CatBoost]</b>` https://catboost.ai/
* :raw-html:`<b>[LightGBM]</b>` https://lightgbm.readthedocs.io/en/latest/
:raw-html:`<ul>
<li><a href="TrialExample/GbdtExample.md">Auto-gbdt</a><br/></li>
</ul>`
Here is just a small list of libraries that supported by NNI. If you are interested in NNI, you can refer to the `tutorial <TrialExample/Trials.rst>`__ to complete your own hacks.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.md>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
**Run an Experiment on Azure Machine Learning**
===================================================
NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
Setup environment
-----------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Step 2. Create an Azure account/subscription using this `link <https://azure.microsoft.com/en-us/free/services/machine-learning/>`__. If you already have an Azure account/subscription, skip this step.
Step 3. Install the Azure CLI on your machine, follow the install guide `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__.
Step 4. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
.. code-block:: bash
az login
Step 5. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download ``config.json`` which will be used later.
.. image:: ../../img/aml_workspace.png
:target: ../../img/aml_workspace.png
:alt:
Step 6. Create an AML cluster as the computeTarget.
.. image:: ../../img/aml_cluster.png
:target: ../../img/aml_cluster.png
:alt:
Step 7. Open a command line and install AML package environment.
.. code-block:: bash
python3 -m pip install azureml
python3 -m pip install azureml-sdk
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: aml
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
image: msranni/nni
gpuNum: 1
amlConfig:
subscriptionId: ${replace_to_your_subscriptionId}
resourceGroup: ${replace_to_your_resourceGroup}
workspaceName: ${replace_to_your_workspaceName}
computeTarget: ${replace_to_your_computeTarget}
Note: You should set ``trainingServicePlatform: aml`` in NNI config YAML file if you want to start experiment in aml mode.
Compared with `LocalMode <LocalMode.rst>`__ trial configuration in aml mode have these additional keys:
* image
* required key. The docker image name used in job. NNI support image ``msranni/nni`` for running aml jobs.
.. code-block:: bash
Note: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
amlConfig:
* subscriptionId
* required key, the subscriptionId of your account
* resourceGroup
* required key, the resourceGroup of your account
* workspaceName
* required key, the workspaceName of your account
* computeTarget
* required key, the compute cluster name you want to use in your AML workspace. `refer <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ See Step 6.
* maxTrialNumPerGpu
* optional key, default 1. Used to specify the max concurrency trial number on a GPU device.
* useActiveGpu
* optional key, default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU.
The required information of amlConfig could be found in the downloaded ``config.json`` in Step 5.
Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-tfv1
# modify config_aml.yml ...
nnictl create --config config_aml.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v1.9``.
Run an Experiment on AdaptDL
============================
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__. Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
Prerequisite for Kubernetes Service
-----------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Verify Prerequisites
^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
nnictl --version
# Expected: <version_number>
.. code-block:: bash
kubectl version
# Expected that the kubectl client version matches the server version.
.. code-block:: bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
Run an experiment
-----------------
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under ``examples/trials/cifar10_pytorch`` folder. (\ ``main_adl.py`` and ``config_adl.yaml``\ )
Here is a template configuration specification to use AdaptDL as a training service.
.. code-block:: yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: microk8s-hostpath
storageSize: 1Gi
Those configs not mentioned below, are following the
`default specs defined in the NNI doc </Tutorial/ExperimentConfig.html#configuration-spec>`__.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**\ : *Required* to get the correct info and metrics back from the cluster, for ``adl`` training service.
IP address of the machine with NNI manager (NNICTL) that launches NNI experiment.
* **logCollection**\ : *Recommended* to set as ``http``. It will collect the trial logs on cluster back to your machine via http.
* **tuner**\ : It supports the Tuun tuner and all NNI built-in tuners (only except for the checkpoint feature of the NNI PBT tuners).
* **trial**\ : It defines the specs of an ``adl`` trial.
* **adaptive**\ : (*Optional*\ ) Boolean for AdaptDL trainer. While ``true``\ , it the job is preemptible and adaptive.
* **image**\ : Docker image for the trial
* **imagePullSecret**\ : (*Optional*\ ) If you are using a private registry,
you need to provide the secret to successfully pull the image.
* **codeDir**\ : the working directory of the container. ``.`` means the default working directory defined by the image.
* **command**\ : the bash command to start the trial
* **gpuNum**\ : the number of GPUs requested for this trial. It must be non-negative integer.
* **cpuNum**\ : (*Optional*\ ) the number of CPUs requested for this trial. It must be non-negative integer.
* **memorySize**\ : (*Optional*\ ) the size of memory requested for this trial. It must follow the Kubernetes
`default format <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory>`__.
* **nfs**\ : (*Optional*\ ) mounting external storage. For more information about using NFS please check the below paragraph.
* **checkpoint** (*Optional*\ ) `storage settings <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for AdaptDL internal checkpoints. You can keep it optional if you are not dev users.
NFS Storage
^^^^^^^^^^^
As you may have noticed in the above configuration spec,
an *optional* section is available to configure NFS external storage. It is optional when no external storage is required, when for example an docker image is sufficient with codes and data inside.
Note that ``adl`` training service does NOT help mount an NFS to the local dev machine, so that one can manually mount it to local, manage the filesystem, copy the data or code etc.
The ``adl`` training service can then mount it to the kubernetes for every trials, with the proper configurations:
* **server**\ : NFS server address, e.g. IP address or domain
* **path**\ : NFS server export path, i.e. the absolute path in NFS that can be mounted to trials
* **containerMountPath**\ : In container absolute path to mount the NFS **path** above,
so that every trial will have the access to the NFS.
In the trial containers, you can access the NFS with this path.
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
So if you want to export your trained models,
you may mount the NFS to the trial to persist and export your trained models.
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
Monitor via Log Stream
----------------------
Follow the log streaming of a certain trial:
.. code-block:: bash
nnictl log trial --trial_id=<trial_id>
.. code-block:: bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
However you may still be able to access the past trial logs
according to the following approach.
Monitor via TensorBoard
-----------------------
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
we support TensorBoard integration. Here one experiment has
an independent TensorBoard logging directory thus dashboard.
You can only use the TensorBoard while the monitored experiment is running.
In other words, it is not supported to monitor stopped experiments.
In the trial container you may have access to two environment variables:
* ``ADAPTDL_TENSORBOARD_LOGDIR``\ : the TensorBoard logging directory for the current experiment,
* ``NNI_TRIAL_JOB_ID``\ : the ``trial`` job id for the current trial.
It is recommended for to have them joined as the directory for trial,
for example in Python:
.. code-block:: python
import os
tensorboard_logdir = os.path.join(
os.getenv("ADAPTDL_TENSORBOARD_LOGDIR"),
os.getenv("NNI_TRIAL_JOB_ID")
)
If an experiment is stopped, the data logged here
(defined by *the above envs* for monitoring with the following commands)
will be lost. To persist the logged data, you can use the external storage (e.g. to mount an NFS)
to export it and view the TensorBoard locally.
With the above setting, you can monitor the experiment easily
via TensorBoard by
.. code-block:: bash
nnictl tensorboard start
If having multiple experiment running at the same time, you may use
.. code-block:: bash
nnictl tensorboard start <experiment_id>
It will provide you the web url to access the tensorboard.
Note that you have the flexibility to set up the local ``--port``
for the TensorBoard.
**Run an Experiment on DLTS**
=================================
NNI supports running an experiment on `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
Setup Environment
-----------------
Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.
.. image:: ../../img/dlts-step1.png
:target: ../../img/dlts-step1.png
:alt: Choose Cluster
Step 2. Prepare a NNI config YAML like the following:
.. code-block:: yaml
# Set this field to "dlts"
trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
trialConcurrency: 2
maxExecDuration: 3h
maxTrialNum: 100
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
# Configuration to access DLTS
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
Remember to fill the cluster dashboard URL to the last line.
Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.
.. image:: ../../img/dlts-step3.png
:target: ../../img/dlts-step3.png
:alt: Copy Config
Step 4. Submit a NNI manager job to the specified cluster.
.. image:: ../../img/dlts-step4.png
:target: ../../img/dlts-step4.png
:alt: Submit Job
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to check trial's information.
.. image:: ../../img/dlts-step5.png
:target: ../../img/dlts-step5.png
:alt: View NNI WebUI
Run an Experiment on FrameworkController
========================================
===
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copies files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or **Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
#. NNI support Kubeflow based on Azure Kubernetes Service, follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
#. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**. Use ``az login`` to set azure account, and connect kubectl client to AKS, refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Setup FrameworkController
-------------------------
Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run>`__ to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode. If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController, and then pass the name of the FrameworkController service account to the NNI Experiment Config. `refer <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset>`__
Design
------
Please refer the design of `Kubeflow training service <KubeflowMode.rst>`__\ , FrameworkController training service pipeline is similar.
Example
-------
The FrameworkController config file format is:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: ~/nni/examples/trials/mnist-tfv1
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
server: {your_nfs_server}
path: {your_nfs_server_exported_path}
If you use Azure Kubernetes Service, you should set ``frameworkcontrollerConfig`` in your config YAML file as follows:
.. code-block:: yaml
frameworkcontrollerConfig:
storage: azureStorage
serviceAccountName: {your_frameworkcontroller_service_account_name}
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
Note: You should explicitly set ``trainingServicePlatform: frameworkcontroller`` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in Kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
* taskNum: the replica number of the task role.
* command: the users' command to be used in the container.
* gpuNum: the number of gpu device used in container.
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.rst#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
How to run example
------------------
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on FrameworkController is similar to Kubeflow, please refer the `document <KubeflowMode.rst>`__ for more information.
version check
-------------
NNI support version check feature in since version 0.6, `refer <PaiMode.rst>`__
How to Implement Training Service in NNI
========================================
Overview
--------
TrainingService is a module related to platform management and job schedule in NNI. TrainingService is designed to be easily implemented, we define an abstract class TrainingService as the parent class of all kinds of TrainingService, users just need to inherit the parent class and complete their own child class if they want to implement customized TrainingService.
System architecture
-------------------
.. image:: ../../img/NNIDesign.jpg
:target: ../../img/NNIDesign.jpg
:alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.md>`__\ , `remote platfrom <RemoteMachineMode.md>`__\ , `PAI platfrom <PaiMode.md>`__\ , `kubeflow platform <KubeflowMode.md>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
Folder structure of code
------------------------
NNI's folder structure is shown below:
.. code-block:: bash
nni
|- deployment
|- docs
|- examaples
|- src
| |- nni_manager
| | |- common
| | |- config
| | |- core
| | |- coverage
| | |- dist
| | |- rest_server
| | |- training_service
| | | |- common
| | | |- kubernetes
| | | |- local
| | | |- pai
| | | |- remote_machine
| | | |- test
| |- sdk
| |- webui
|- test
|- tools
| |-nni_annotation
| |-nni_cmd
| |-nni_gpu_tool
| |-nni_trial_tool
``nni/src/`` folder stores the most source code of NNI. The code in this folder is related to NNIManager, TrainingService, SDK, WebUI and other modules. Users could find the abstract class of TrainingService in ``nni/src/nni_manager/common/trainingService.ts`` file, and they should put their own implemented TrainingService in ``nni/src/nni_manager/training_service`` folder. If users have implemented their own TrainingService code, they should also supplement the unit test of the code, and place them in ``nni/src/nni_manager/training_service/test`` folder.
Function annotation of TrainingService
--------------------------------------
.. code-block:: bash
abstract class TrainingService {
public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
public abstract getTrialJob(trialJobId: string): Promise<TrialJobDetail>;
public abstract addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void): void;
public abstract submitTrialJob(form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail>;
public abstract get isMultiPhaseJobSupported(): boolean;
public abstract cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean): Promise<void>;
public abstract setClusterMetadata(key: string, value: string): Promise<void>;
public abstract getClusterMetadata(key: string): Promise<string>;
public abstract cleanUp(): Promise<void>;
public abstract run(): Promise<void>;
}
The parent class of TrainingService has a few abstract functions, users need to inherit the parent class and implement all of these abstract functions.
**setClusterMetadata(key: string, value: string)**
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
.. code-block:: bash
export class RemoteMachineMeta {
public readonly ip : string;
public readonly port : number;
public readonly username : string;
public readonly passwd?: string;
public readonly sshKeyPath?: string;
public readonly passphrase?: string;
public gpuSummary : GPUSummary | undefined;
/* GPU Reservation info, the key is GPU index, the value is the job id which reserves this GPU*/
public gpuReservation : Map<number, string>;
constructor(ip : string, port : number, username : string, passwd : string,
sshKeyPath : string, passphrase : string) {
this.ip = ip;
this.port = port;
this.username = username;
this.passwd = passwd;
this.sshKeyPath = sshKeyPath;
this.passphrase = passphrase;
this.gpuReservation = new Map<number, string>();
}
}
The metadata includes the host address, the username or other configuration related to the platform. Users need to define their own metadata format, and set the metadata instance in this function. This function is called before the experiment is started to set the configuration of remote machines.
**getClusterMetadata(key: string)**
This function will return the metadata value according to the values, it could be left empty if users don't need to use it.
**submitTrialJob(form: JobApplicationForm)**
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
.. code-block:: bash
interface TrialJobDetail {
readonly id: string;
readonly status: TrialJobStatus;
readonly submitTime: number;
readonly startTime?: number;
readonly endTime?: number;
readonly tags?: string[];
readonly url?: string;
readonly workingDirectory: string;
readonly form: JobApplicationForm;
readonly sequenceId: number;
isEarlyStopped?: boolean;
}
According to different kinds of implementation, users could put the job detail into a job queue, and keep fetching the job from the queue and start preparing and running them. Or they could finish preparing and running process in this function, and return job detail after the submit work.
**cancelTrialJob(trialJobId: string, isEarlyStopped?: boolean)**
If this function is called, the trial started by the platform should be canceled. Different kind of platform has diffenent methods to calcel a running job, this function should be implemented according to specific platform.
**updateTrialJob(trialJobId: string, form: JobApplicationForm)**
This function is called to update the trial job's status, trial job's status should be detected according to different platform, and be updated to ``RUNNING``\ , ``SUCCEED``\ , ``FAILED`` etc.
**getTrialJob(trialJobId: string)**
This function returns a trialJob detail instance according to trialJobId.
**listTrialJobs()**
Users should put all of trial job detail information into a list, and return the list.
**addTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
NNI will hold an EventEmitter to get job metrics, if there is new job metrics detected, the EventEmitter will be triggered. Users should start the EventEmitter in this function.
**removeTrialJobMetricListener(listener: (metric: TrialJobMetric) => void)**
Close the EventEmitter.
**run()**
The run() function is a main loop function in TrainingService, users could set a while loop to execute their logic code, and finish executing them when the experiment is stopped.
**cleanUp()**
This function is called to clean up the environment when a experiment is stopped. Users should do the platform-related cleaning operation in this function.
TrialKeeper tool
----------------
NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the source code in ``nni/tools/nni_trial_tool``. If users want to run trial jobs in cloud platform, this tool will be a fine choice to help keeping trial running in the platform.
The running architecture of TrialKeeper is show as follow:
.. image:: ../../img/trialkeeper.jpg
:target: ../../img/trialkeeper.jpg
:alt:
When users submit a trial job to cloud platform, they should wrap their trial command into TrialKeeper, and start a TrialKeeper process in cloud platform. Notice that TrialKeeper use restful server to communicate with TrainingService, users should start a restful server in local machine to receive metrics sent from TrialKeeper. The source code about restful server could be found in ``nni/src/nni_manager/training_service/common/clusterJobRestServer.ts``.
Reference
---------
For more information about how to debug, please `refer <../Tutorial/HowToDebug.rst>`__.
The guideline of how to contribute, please `refer <../Tutorial/Contributing.rst>`__.
Run an Experiment on Kubeflow
=============================
===
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or**Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
#. NNI support Kubeflow based on Azure Kubernetes Service, follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
#. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**. Use ``az login`` to set azure account, and connect kubectl client to AKS, refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
#. Deploy Kubeflow on Azure Kubernetes Service, follow the `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__.
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI use `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Design
------
.. image:: ../../img/kubeflow_training_design.png
:target: ../../img/kubeflow_training_design.png
:alt:
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumes: `nfs <https://en.wikipedia.org/wiki/Network_File_System>`__ and `azure file storage <https://azure.microsoft.com/en-us/services/storage/files/>`__\ , you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs (\ `tf-operator <https://github.com/kubeflow/tf-operator>`__ job or `pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ job) in K8S, and mount your storage volume into the job's pod. Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
Supported operator
------------------
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators is not tested.
Users could set operator type in config file.
The setting of tf-operator:
.. code-block:: yaml
kubeflowConfig:
operator: tf-operator
The setting of pytorch-operator:
.. code-block:: yaml
kubeflowConfig:
operator: pytorch-operator
If users want to use tf-operator, he could set ``ps`` and ``worker`` in trial config. If users want to use pytorch-operator, he could set ``master`` and ``worker`` in trial config.
Supported storage type
----------------------
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
.. code-block:: yaml
kubeflowConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
If you use Azure storage, you should set ``kubeflowConfig`` in your config YAML file as follows:
.. code-block:: yaml
kubeflowConfig:
storage: azureStorage
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. This is a tensorflow job, and use tf-operator of Kubeflow. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 20
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 2
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
ps:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
apiVersion: v1alpha2
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
Note: You should explicitly set ``trainingServicePlatform: kubeflow`` in NNI config YAML file if you want to start experiment in kubeflow mode.
If you want to run PyTorch jobs, you could set your config files as follow:
.. code-block:: yaml
authorName: default
experimentName: example_mnist_distributed_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
codeDir: .
master:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
worker:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
kubeflowConfig:
operator: pytorch-operator
apiVersion: v1alpha2
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
Trial configuration in kubeflow mode have the following configuration keys:
* codeDir
* code directory, where you put training code and config files
* worker (required). This config section is used to configure tensorflow worker role
* replicas
* Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
* command
* Required key. Command to launch your trial job, like ``python mnist.py``
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in `Pod <https://kubernetes.io/docs/concepts/workloads/pods/pod/>`__. This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image :githublink:`msranni/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* privateRegistryAuthPath
* Optional field, specify ``config.json`` file path that holds an authorization token of docker registry, used to pull image from private registry. `Refer <https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/>`__.
* apiVersion
* Required key. The API version of your Kubeflow.
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_kubeflow.yml
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see the Kubeflow tfjob created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
version check
-------------
NNI support version check feature in since version 0.6, `refer <PaiMode.rst>`__
Any problems when using NNI in Kubeflow mode, please create issues on `NNI Github repo <https://github.com/Microsoft/nni>`__.
**Tutorial: Create and Run an Experiment on local with NNI API**
====================================================================
In this tutorial, we will use the example in [~/examples/trials/mnist-tfv1] to explain how to create and run an experiment on local with NNI API.
..
Before starts
You have an implementation for MNIST classifer using convolutional layers, the Python code is in ``mnist_before.py``.
..
Step 1 - Update model codes
To enable NNI API, make the following changes:
* Declare NNI API: include ``import nni`` in your trial code to use NNI APIs.
* Get predefined parameters
Use the following code snippet:
.. code-block:: python
RECEIVED_PARAMS = nni.get_next_parameter()
to get hyper-parameters' values assigned by tuner. ``RECEIVED_PARAMS`` is an object, for example:
.. code-block:: json
{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}
* Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor.
Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
We had made the changes and saved it to ``mnist.py``.
**NOTE**\ :
.. code-block:: bash
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
tuner - The tuner will generate next parameters/architecture based on the explore history (final result of all trials).
..
Step 2 - Define SearchSpace
The hyper-parameters used in ``Step 1.2 - Get predefined parameters`` is defined in a ``search_space.json`` file like below:
.. code-block:: bash
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search space.
..
Step 3 - Define Experiment
..
3.1 enable NNI API mode
To enable NNI API mode, you need to set useAnnotation to *false* and provide the path of SearchSpace file (you just defined in step 1):
.. code-block:: bash
useAnnotation: false
searchSpacePath: /path/to/your/search_space.json
To run an experiment in NNI, you only needed:
* Provide a runnable trial
* Provide or choose a tuner
* Provide a YAML experiment configure file
* (optional) Provide or choose an assessor
**Prepare trial**\ :
..
A set of examples can be found in ~/nni/examples after your installation, run ``ls ~/nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
.. code-block:: bash
python ~/nni/examples/trials/mnist-annotation/mnist.py
This command will be filled in the YAML configure file below. Please refer to `here <../TrialExample/Trials.rst>`__ for how to write your own trial.
**Prepare tuner**\ : NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to `here <../Tuner/CustomizeTuner.rst>`__\ ), but for simplicity, here we choose a tuner provided by NNI as below:
.. code-block:: bash
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
*builtinTunerName* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found `here <../Tuner/BuiltinTuner.rst>`__\ ), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**\ : Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, ``cat ~/nni/examples/trials/mnist-annotation/config.yml`` to see it. Its content is basically shown below:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 1
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote
trainingServicePlatform: local
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
Here *useAnnotation* is true because this trial example uses our python annotation (refer to `here <../Tutorial/AnnotationSpec.rst>`__ for details). For trial, we should provide *trialCommand* which is the command to run the trial, provide *trialCodeDir* where the trial code is. The command will be executed in this directory. We should also provide how many GPUs a trial requires.
With all these steps done, we can run the experiment with the following command:
.. code-block:: bash
nnictl create --config ~/nni/examples/trials/mnist-annotation/config.yml
You can refer to `here <../Tutorial/Nnictl.rst>`__ for more usage guide of *nnictl* command line tool.
View experiment results
-----------------------
The experiment has been running now. Other than *nnictl*\ , NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
Using multiple local GPUs to speed up search
--------------------------------------------
The following steps assume that you have 4 NVIDIA GPUs installed at local and `tensorflow with GPU support <https://www.tensorflow.org/install/gpu>`__. The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**\ : NNI provides a demo configuration file for the setting above, ``cat ~/nni/examples/trials/mnist-annotation/config_gpu.yml`` to see it. The trailConcurrency and gpuNum are different from the basic configure file:
.. code-block:: bash
...
# how many trials could be concurrently running
trialConcurrency: 4
...
trial:
command: python mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 1
We can run the experiment with the following command:
.. code-block:: bash
nnictl create --config ~/nni/examples/trials/mnist-annotation/config_gpu.yml
You can use *nnictl* command line tool or WebUI to trace the training progress. *nvidia_smi* command line tool can also help you to monitor the GPU usage during training.
Training Service
================
What is Training Service?
-------------------------
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.md>`__\ , `remote machines <./RemoteMachineMode.md>`__\ , and on clusters like `PAI <./PaiMode.md>`__\ , `Kubeflow <./KubeflowMode.md>`__\ , `AdaptDL <./AdaptDLMode.md>`__\ , `FrameworkController <./FrameworkControllerMode.md>`__\ , `DLTS <./DLTSMode.md>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "\ `how to implement training service <./HowToImplementTrainingService>`__\ " for details.
How to use Training Service?
----------------------------
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig>`__ provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as ``codeDir`` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see :githublink:`this example <examples/trials/mnist-tfv1/.nniignore>` and the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
In case users intend to use large files in their experiment (like large-scaled datasets) and they are not using local mode, they can either: 1) download the data before each trial launches by putting it into trial command; or 2) use a shared storage that is accessible to worker nodes. Usually, training platforms are equipped with shared storage, and NNI allows users to easily use them. Refer to docs of each built-in training service for details.
Built-in Training Services
--------------------------
.. list-table::
:header-rows: 1
:widths: auto
* - TrainingService
- Brief Introduction
* - `**Local** <./LocalMode.rst>`__
- NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.
* - `**Remote** <./RemoteMachineMode.rst>`__
- NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.
* - `**PAI** <./PaiMode.rst>`__
- NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__ (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.
* - `**Kubeflow** <./KubeflowMode.rst>`__
- NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
* - `**AdaptDL** <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.
* - `**FrameworkController** <./FrameworkControllerMode.rst>`__
- NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
* - `**DLTS** <./DLTSMode.rst>`__
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `**AML** <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
What does Training Service do?
------------------------------
.. raw:: html
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
</p>
According to the architecture shown in `Overview <../Overview>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (\ ``codeDir``\ ) accessible to training platform.
.. Note:: Different training services have different ways to handle ``codeDir``. For example, local training service directly runs trials in ``codeDir``. Remote training service packs ``codeDir`` into a zip and uploads it to each machine. K8S-based training services copy ``codeDir`` onto a shared storage, which is either provided by training platform itself, or configured by users in config file.
Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from ``WAITING`` to ``RUNNING``\ , ``RUNNING`` to ``SUCCEEDED``\ ) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.
.. role:: raw-html(raw)
:format: html
**Run an Experiment on OpenPAI**
====================================
NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__\ , called pai mode. Before starting to use NNI pai mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
.. toctree::
Setup environment
-----------------
**Step 1. Install NNI, follow the install guide** `here <../Tutorial/QuickStart.rst>`__.
**Step 2. Get token.**
Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
.. image:: ../../img/pai_profile.jpg
:scale: 80%
Click ``copy`` button in the page to copy a jwt token.
.. image:: ../../img/pai_token.jpg
:scale: 67%
**Step 3. Mount NFS storage to local machine.**
Click ``Submit job`` button in web portal.
.. image:: ../../img/pai_job_submission_page.jpg
:scale: 50%
Find the data management region in job submission page.
.. image:: ../../img/pai_data_management_page.jpg
:scale: 33%
The ``Preview container paths`` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage.\ :raw-html:`<br>`
For example, use the following command:
.. code-block:: bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
Then the ``/data`` folder in container will be mounted to ``/local/mnt`` folder in your local machine.\ :raw-html:`<br>`
You could use the following configuration in your NNI's config file:
.. code-block:: yaml
nniManagerNFSMountPath: /local/mnt
**Step 4. Get OpenPAI's storage config name and nniManagerMountPath**
The ``Team share storage`` field is storage configuration used to specify storage value in OpenPAI. You can get ``paiStorageConfigName`` and ``containerNFSMountPath`` field in ``Team share storage``\ , for example:
.. code-block:: yaml
paiStorageConfigName: confignfs-data
containerNFSMountPath: /mnt/confignfs-data
Run an experiment
-----------------
Use ``examples/trials/mnist-annotation`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
virtualCluster: default
nniManagerNFSMountPath: /local/mnt
containerNFSMountPath: /mnt/confignfs-data
paiStorageConfigName: confignfs-data
# Configuration to access OpenPAI Cluster
paiConfig:
userName: your_pai_nni_user
token: your_pai_token
host: 10.1.1.1
# optional, experimental feature.
reuse: true
Note: You should set ``trainingServicePlatform: pai`` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like ``10.10.5.1``\ , the default http protocol in NNI is ``http``\ , if your PAI's cluster enabled https, please use the uri in ``https://10.10.5.1`` format.
Trial configurations
^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
*
cpuNum
Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
memoryMB
Optional key. Should be positive number based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
image
Optional key. In pai mode, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
virtualCluster
Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
*
nniManagerNFSMountPath
Required key. Set the mount path in your nniManager machine.
*
containerNFSMountPath
Required key. Set the mount path in your container used in OpenPAI.
*
paiStorageConfigName:
Optional key. Set the storage name used in OpenPAI. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
*
command
Optional key. Set the commands used in OpenPAI container.
*
paiConfigPath
Optional key. Set the file path of OpenPAI job configuration, the file is in yaml format.
If users set ``paiConfigPath`` in NNI's configuration file, no need to specify the fields ``command``\ , ``paiStorageConfigName``\ , ``virtualCluster``\ , ``image``\ , ``memoryMB``\ , ``cpuNum``\ , ``gpuNum`` in ``trial`` configuration. These fields will use the values from the config file specified by ``paiConfigPath``.
Note:
#.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni\ *exp*\ ${this.experimentId}*trial*\ ${trialJobId}.
#.
If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
OpenPAI configurations
^^^^^^^^^^^^^^^^^^^^^^
``paiConfig`` includes OpenPAI specific configurations,
*
userName
Required key. User name of OpenPAI platform.
*
token
Required key. Authentication key of OpenPAI platform.
*
host
Required key. The host of OpenPAI platform. It's OpenPAI's job submission page uri, like ``10.10.5.1``\ , the default http protocol in NNI is ``http``\ , if your OpenPAI cluster enabled https, please use the uri in ``https://10.10.5.1`` format.
*
reuse (experimental feature)
Optional key, default is false. If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_pai.yml
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
:scale: 30%
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:scale: 80%
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by ``paiStorageConfigName`` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the ``nniManagerNFSMountPath`` field in configuration file, NNI will generate bash files and copy data in ``codeDir`` to the ``nniManagerNFSMountPath`` folder, then NNI will start a trial job. The data in ``nniManagerNFSMountPath`` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in ``containerNFSMountPath``\ , NNI will enter this folder first, and then run scripts to start a trial job.
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/version_check.png
:scale: 80%
.. role:: raw-html(raw)
:format: html
**Run an Experiment on OpenpaiYarn**
========================================
The original ``pai`` mode is modificated to ``paiYarn`` mode, which is a distributed training platform based on Yarn.
Setup environment
-----------------
Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai, paiYarn
trainingServicePlatform: paiYarn
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-tfv1
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
# Configuration to access OpenpaiYarn Cluster
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
Note: You should set ``trainingServicePlatform: paiYarn`` in NNI config YAML file if you want to start experiment in paiYarn mode.
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , trial configuration in paiYarn mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In paiYarn mode, your trial program will be scheduled by OpenpaiYarn to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* virtualCluster
* Optional key. Set the virtualCluster of OpenpaiYarn. If omitted, the job will run on default virtual cluster.
* shmMB
* Optional key. Set the shmMB configuration of OpenpaiYarn, it set the shared memory for one task in the task role.
* authFile
* Optional key, Set the auth file path for private registry while using paiYarn mode, `Refer <https://github.com/microsoft/paiYarn/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpaiYarn-job>`__\ , you can prepare the authFile and simply provide the local path of this file, NNI will upload this file to HDFS for you.
*
portList
*
Optional key. Set the portList configuration of OpenpaiYarn, it specifies a list of port used in container, `Refer <https://github.com/microsoft/paiYarn/blob/b2324866d0280a2d22958717ea6025740f71b9f0/docs/job_tutorial.rst#specification>`__.\ :raw-html:`<br>`
The config schema in NNI is shown below:
.. code-block:: bash
portList:
- label: test
beginAt: 8080
portNumber: 2
Let's say you want to launch a tensorboard in the mnist example using the port. So the first step is to write a wrapper script ``launch_paiYarn.sh`` of ``mnist.py``.
.. code-block:: bash
export TENSORBOARD_PORT=paiYarn_PORT_LIST_${paiYarn_CURRENT_TASK_ROLE_NAME}_0_tensorboard
tensorboard --logdir . --port ${!TENSORBOARD_PORT} &
python3 mnist.py
The config file of portList should be filled as following:
.. code-block:: yaml
trial:
command: bash launch_paiYarn.sh
portList:
- label: tensorboard
beginAt: 0
portNumber: 1
NNI support two kind of authorization method in paiYarn, including password and paiYarn token, `refer <https://github.com/microsoft/paiYarn/blob/b6bd2ab1c8890f91b7ac5859743274d2aa923c22/docs/rest-server/API.rst#2-authentication>`__. The authorization is configured in ``paiYarnConfig`` field.\ :raw-html:`<br>`
For password authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
For paiYarn token authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
token: your_paiYarn_token
host: 10.1.1.1
Once complete to fill NNI experiment config file and save (for example, save as exp_paiYarn.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_paiYarn.yml
to start the experiment in paiYarn mode. NNI will create OpenpaiYarn job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenpaiYarn cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In paiYarn mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
:target: ../../img/nni_webui_joblist.jpg
:alt:
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:target: ../../img/nni_trial_hdfs_output.jpg
:alt:
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
If your training data is not too large, it could be put into codeDir, and nni will upload the data to hdfs, or you could build your own docker image with the data. If you have large dataset, it's not appropriate to put the data in codeDir, and you could follow the `guidance <https://github.com/microsoft/paiYarn/blob/master/docs/user/storage.rst>`__ to mount the data folder in container.
If you also want to save trial's other output into HDFS, like model files, you can use environment variable ``NNI_OUTPUT_DIR`` in your trial code to save your own output files, and NNI SDK will copy all the files in ``NNI_OUTPUT_DIR`` from trial's container to HDFS, the target path is ``hdfs://host:port/{username}/nni/{experiments}/{experimentId}/trials/{trialId}/nnioutput``
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/version_check.png
:target: ../../img/version_check.png
:alt:
Run an Experiment on Remote Machines
====================================
NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.
Requirements
------------
*
Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.
*
Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usages, please refer to `machineList part of configuration <../Tutorial/ExperimentConfig.rst>`__.
*
Make sure the NNI version on each machine is consistent.
*
Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
Linux
^^^^^
* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine.
Windows
^^^^^^^
*
Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine.
*
Install and start ``OpenSSH Server``.
#.
Open ``Settings`` app on Windows.
#.
Click ``Apps``\ , then click ``Optional features``.
#.
Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
#.
Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
sc config sshd start=auto
net start sshd
*
Make sure remote account is administrator, so that it can stop running trials.
*
Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.
The output like below is ok, when opening a new command window.
.. code-block:: text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
Run an experiment
-----------------
e.g. there are three machines, which can be logged in with username and password.
.. list-table::
:header-rows: 1
:widths: auto
* - IP
- Username
- Password
* - 10.1.1.1
- bob
- bob123
* - 10.1.1.2
- bob
- bob123
* - 10.1.1.3
- bob
- bob123
Install and run NNI on one of those three machines or another machine, which has network access to them.
Use ``examples/trials/mnist-annotation`` as the example. Below is content of ``examples/trials/mnist-annotation/config_remote.yml``\ :
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
# search space file
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: 10.1.1.1
username: bob
passwd: bob123
#port can be skip if using default ssh port 22
#port: 22
- ip: 10.1.1.2
username: bob
passwd: bob123
- ip: 10.1.1.3
username: bob
passwd: bob123
Files in ``codeDir`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
.. code-block:: bash
nnictl create --config examples/trials/mnist-annotation/config_remote.yml
Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **preCommand** to specify a python environment on your remote machine.
Use ``examples/trials/mnist-tfv2`` as the example. Below is content of ``examples/trials/mnist-tfv2/config_remote.yml``\ :
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: ${replace_to_your_remote_machine_ip}
username: ${replace_to_your_remote_machine_username}
sshKeyPath: ${replace_to_your_remote_machine_sshKeyPath}
# Pre-command will be executed before the remote machine executes other commands.
# Below is an example of specifying python environment.
# If you want to execute multiple commands, please use "&&" to connect them.
# preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
The **preCommand** will be executed before the remote machine executes other commands. So you can configure python environment path like this:
.. code-block:: yaml
# Linux remote machine
preCommand: export PATH=${replace_to_python_environment_path_in_your_remote_machine}:$PATH
# Windows remote machine
preCommand: set path=${replace_to_python_environment_path_in_your_remote_machine};%path%
Or if you want to activate the ``virtualenv`` environment:
.. code-block:: yaml
# Linux remote machine
preCommand: source ${replace_to_absolute_path_recommended_here}/bin/activate
# Windows remote machine
preCommand: ${replace_to_absolute_path_recommended_here}\\scripts\\activate
Or if you want to activate the ``conda`` environment:
.. code-block:: yaml
# Linux remote machine
preCommand: source ${replace_to_conda_path}/bin/activate ${replace_to_conda_env_name}
# Windows remote machine
preCommand: call activate ${replace_to_conda_env_name}
If you want multiple commands to be executed, you can use ``&&`` to connect these commands:
.. code-block:: yaml
preCommand: command1 && command2 && command3
**Note**\ : Because **preCommand** will execute before other commands each time, it is strongly not recommended to set **preCommand** that will make changes to system, i.e. ``mkdir`` or ``touch``.
CIFAR-10 examples
=================
Overview
--------
`CIFAR-10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__ classification is a common benchmark problem in machine learning. The CIFAR-10 dataset is the collection of images. It is one of the most widely used datasets for machine learning research which contains 60,000 32x32 color images in 10 different classes. Thus, we use CIFAR-10 classification as an example to introduce NNI usage.
**Goals**
^^^^^^^^^^^^^
As we all know, the choice of model optimizer is directly affects the performance of the final metrics. The goal of this tutorial is to **tune a better performace optimizer** to train a relatively small convolutional neural network (CNN) for recognizing images.
In this example, we have selected the following common deep learning optimizer:
..
"SGD", "Adadelta", "Adagrad", "Adam", "Adamax"
**Experimental**
^^^^^^^^^^^^^^^^^^^^
Preparations
^^^^^^^^^^^^
This example requires PyTorch. PyTorch install package should be chosen based on python version and cuda version.
Here is an example of the environment python==3.5 and cuda == 8.0, then using the following commands to install `PyTorch <https://pytorch.org/>`__\ :
.. code-block:: bash
python3 -m pip install http://download.pytorch.org/whl/cu80/torch-0.4.1-cp35-cp35m-linux_x86_64.whl
python3 -m pip install torchvision
CIFAR-10 with NNI
^^^^^^^^^^^^^^^^^
**Search Space**
As we stated in the target, we target to find out the best ``optimizer`` for training CIFAR-10 classification. When using different optimizers, we also need to adjust ``learning rates`` and ``network structure`` accordingly. so we chose these three parameters as hyperparameters and write the following search space.
.. code-block:: json
{
"lr":{"_type":"choice", "_value":[0.1, 0.01, 0.001, 0.0001]},
"optimizer":{"_type":"choice", "_value":["SGD", "Adadelta", "Adagrad", "Adam", "Adamax"]},
"model":{"_type":"choice", "_value":["vgg", "resnet18", "googlenet", "densenet121", "mobilenet", "dpn92", "senet18"]}
}
*Implemented code directory: :githublink:`search_space.json <examples/trials/cifar10_pytorch/search_space.json>`*
**Trial**
The code for CNN training of each hyperparameters set, paying particular attention to the following points are specific for NNI:
* Use ``nni.get_next_parameter()`` to get next training hyperparameter set.
* Use ``nni.report_intermediate_result(acc)`` to report the intermedian result after finish each epoch.
* Use ``nni.report_final_result(acc)`` to report the final result before the trial end.
*Implemented code directory: :githublink:`main.py <examples/trials/cifar10_pytorch/main.py>`*
You can also use your previous code directly, refer to `How to define a trial <Trials.rst>`__ for modify.
**Config**
Here is the example of running this experiment on local(with multiple GPUs):
code directory: :githublink:`examples/trials/cifar10_pytorch/config.yml <examples/trials/cifar10_pytorch/config.yml>`
Here is the example of running this experiment on OpenPAI:
code directory: :githublink:`examples/trials/cifar10_pytorch/config_pai.yml <examples/trials/cifar10_pytorch/config_pai.yml>`
*The complete examples we have implemented: :githublink:`examples/trials/cifar10_pytorch/ <examples/trials/cifar10_pytorch>`*
Launch the experiment
^^^^^^^^^^^^^^^^^^^^^
We are ready for the experiment, let's now **run the config.yml file from your command line to start the experiment**.
.. code-block:: bash
nnictl create --config nni/examples/trials/cifar10_pytorch/config.yml
EfficientNet
============
`EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks <https://arxiv.org/abs/1905.11946>`__
Use Grid search to find the best combination of alpha, beta and gamma for EfficientNet-B1, as discussed in Section 3.3 in paper. Search space, tuner, configuration examples are provided here.
Instructions
------------
:githublink:`Example code <examples/trials/efficientnet>`
#. Set your working directory here in the example code directory.
#. Run ``git clone https://github.com/ultmaster/EfficientNet-PyTorch`` to clone the `ultmaster modified version <https://github.com/ultmaster/EfficientNet-PyTorch>`__ of the original `EfficientNet-PyTorch <https://github.com/lukemelas/EfficientNet-PyTorch>`__. The modifications were done to adhere to the original `Tensorflow version <https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet>`__ as close as possible (including EMA, label smoothing and etc.); also added are the part which gets parameters from tuner and reports intermediate/final results. Clone it into ``EfficientNet-PyTorch``\ ; the files like ``main.py``\ , ``train_imagenet.sh`` will appear inside, as specified in the configuration files.
#. Run ``nnictl create --config config_local.yml`` (use ``config_pai.yml`` for OpenPAI) to find the best EfficientNet-B1. Adjust the training service (PAI/local/remote), batch size in the config files according to the environment.
For training on ImageNet, read ``EfficientNet-PyTorch/train_imagenet.sh``. Download ImageNet beforehand and extract it adhering to `PyTorch format <https://pytorch.org/docs/stable/torchvision/datasets.html#imagenet>`__ and then replace ``/mnt/data/imagenet`` in with the location of the ImageNet storage. This file should also be a good example to follow for mounting ImageNet into the container on OpenPAI.
Results
-------
The follow image is a screenshot, demonstrating the relationship between acc@1 and alpha, beta, gamma.
.. image:: ../../img/efficientnet_search_result.png
:target: ../../img/efficientnet_search_result.png
:alt:
GBDT in nni
===========
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree has many popular implementations, such as `lightgbm <https://github.com/Microsoft/LightGBM>`__\ , `xgboost <https://github.com/dmlc/xgboost>`__\ , and `catboost <https://github.com/catboost/catboost>`__\ , etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
1. Search Space in GBDT
-----------------------
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):
..
* For better accuracy
* ``learning_rate``. The range of ``learning rate`` could be [0.001, 0.9].
*
``num_leaves``. ``num_leaves`` is related to ``max_depth``\ , you don't have to tune both of them.
*
``bagging_freq``. ``bagging_freq`` could be [1, 2, 4, 8, 10]
*
``num_iterations``. May larger if underfitting.
..
* For speed up
* ``bagging_fraction``. The range of ``bagging_fraction`` could be [0.7, 1.0].
*
``feature_fraction``. The range of ``feature_fraction`` could be [0.6, 1.0].
*
``max_bin``.
..
* To avoid overfitting
* ``min_data_in_leaf``. This depends on your dataset.
*
``min_sum_hessian_in_leaf``. This depend on your dataset.
*
``lambda_l1`` and ``lambda_l2``.
*
``min_gain_to_split``.
*
``num_leaves``.
Reference link:
`lightgbm <https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html>`__ and `autoxgoboost <https://github.com/ja-thomas/autoxgboost/blob/master/poster_2018.pdf>`__
2. Task description
-------------------
Now we come back to our example "auto-gbdt" which run in lightgbm and nni. The data including :githublink:`train data <examples/trials/auto-gbdt/data/regression.train>` and :githublink:`test data <examples/trials/auto-gbdt/data/regression.train>`.
Given the features and label in train data, we train a GBDT regression model and use it to predict.
3. How to run in nni
--------------------
3.1 Install all the requirments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
pip install lightgbm
pip install pandas
3.2 Prepare your trial code
^^^^^^^^^^^^^^^^^^^^^^^^^^^
You need to prepare a basic code as following:
.. code-block:: python
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
PARAMS = get_default_parameters()
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
3.3 Prepare your search space.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you like to tune ``num_leaves``\ , ``learning_rate``\ , ``bagging_fraction`` and ``bagging_freq``\ , you could write a :githublink:`search_space.json <examples/trials/auto-gbdt/search_space.json>` as follow:
.. code-block:: json
{
"num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]},
"learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
"bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
"bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
More support variable type you could reference `here <../Tutorial/SearchSpaceSpec.rst>`__.
3.4 Add SDK of nni into your code.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: diff
+import nni
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
+ nni.report_final_result(rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
+ RECEIVED_PARAMS = nni.get_next_parameter()
PARAMS = get_default_parameters()
+ PARAMS.update(RECEIVED_PARAMS)
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
3.5 Write a config file and run it.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the config file, you could set some settings including:
* Experiment setting: ``trialConcurrency``\ , ``maxExecDuration``\ , ``maxTrialNum``\ , ``trial gpuNum``\ , etc.
* Platform setting: ``trainingServicePlatform``\ , etc.
* Path seeting: ``searchSpacePath``\ , ``trial codeDir``\ , etc.
* Algorithm setting: select ``tuner`` algorithm, ``tuner optimize_mode``\ , etc.
An config.yml as follow:
.. code-block:: yaml
authorName: default
experimentName: example_auto-gbdt
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
command: python3 main.py
codeDir: .
gpuNum: 0
Run this experiment with command as follow:
.. code-block:: bash
nnictl create --config ./config.yml
Knowledge Distillation on NNI
=============================
KnowledgeDistill
----------------
Knowledge distillation support, in `Distilling the Knowledge in a Neural Network <https://arxiv.org/abs/1503.02531>`__\ , the compressed model is trained to mimic a pre-trained, larger model. This training setting is also referred to as "teacher-student", where the large model is the teacher and the small model is the student.
.. image:: ../../img/distill.png
:target: ../../img/distill.png
:alt:
Usage
^^^^^
PyTorch code
.. code-block:: python
from knowledge_distill.knowledge_distill import KnowledgeDistill
kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
alpha = 1
beta = 0.8
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
# you only to add the following line to fine-tune with knowledge distillation
loss = alpha * loss + beta * kd.loss(data=data, student_out=output)
loss.backward()
User configuration for KnowledgeDistill
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* **kd_teacher_model:** The pre-trained teacher model
* **kd_T:** Temperature for smoothing teacher model's output
The complete code can be found `here <https://github.com/microsoft/nni/tree/v1.3/examples/model_compress/knowledge_distill/>`__
.. role:: raw-html(raw)
:format: html
MNIST examples
==============
CNN MNIST classifier for deep learning is similar to ``hello world`` for programming languages. Thus, we use MNIST as example to introduce different features of NNI. The examples are listed below:
* `MNIST with NNI API (TensorFlow v1.x) <#mnist-tfv1>`__
* `MNIST with NNI API (TensorFlow v2.x) <#mnist-tfv2>`__
* `MNIST with NNI annotation <#mnist-annotation>`__
* `MNIST in keras <#mnist-keras>`__
* `MNIST -- tuning with batch tuner <#mnist-batch>`__
* `MNIST -- tuning with hyperband <#mnist-hyperband>`__
* `MNIST -- tuning within a nested search space <#mnist-nested>`__
* `distributed MNIST (tensorflow) using kubeflow <#mnist-kubeflow-tf>`__
* `distributed MNIST (pytorch) using kubeflow <#mnist-kubeflow-pytorch>`__
:raw-html:`<a name="mnist-tfv1"></a>`
**MNIST with NNI API (TensorFlow v1.x)**
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer. We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc. It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random. We also provide an exmaple YAML file which enables assessor.
``code directory: examples/trials/mnist-tfv1/``
:raw-html:`<a name="mnist-tfv2"></a>`
**MNIST with NNI API (TensorFlow v2.x)**
Same network to the example above, but written in TensorFlow v2.x Keras API.
``code directory: examples/trials/mnist-tfv2/``
:raw-html:`<a name="mnist-annotation"></a>`
**MNIST with NNI annotation**
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
``code directory: examples/trials/mnist-annotation/``
:raw-html:`<a name="mnist-keras"></a>`
**MNIST in keras**
This example is implemented in keras. It is also a network for MNIST dataset, with two convolution layers, one pooling layer, and two fully connected layers.
``code directory: examples/trials/mnist-keras/``
:raw-html:`<a name="mnist-batch"></a>`
**MNIST -- tuning with batch tuner**
This example is to show how to use batch tuner. Users simply list all the configurations they want to try in the search space file. NNI will try all of them.
``code directory: examples/trials/mnist-batch-tune-keras/``
:raw-html:`<a name="mnist-hyperband"></a>`
**MNIST -- tuning with hyperband**
This example is to show how to use hyperband to tune the model. There is one more key ``STEPS`` in the received configuration for trials to control how long it can run (e.g., number of iterations).
``code directory: examples/trials/mnist-hyperband/``
:raw-html:`<a name="mnist-nested"></a>`
**MNIST -- tuning within a nested search space**
This example is to show that NNI also support nested search space. The search space file is an example of how to define nested search space.
``code directory: examples/trials/mnist-nested-search-space/``
:raw-html:`<a name="mnist-kubeflow-tf"></a>`
**distributed MNIST (tensorflow) using kubeflow**
This example is to show how to run distributed training on kubeflow through NNI. Users can simply provide distributed training code and a configure file which specifies the kubeflow mode. For example, what is the command to run ps and what is the command to run worker, and how many resources they consume. This example is implemented in tensorflow, thus, uses kubeflow tensorflow operator.
``code directory: examples/trials/mnist-distributed/``
:raw-html:`<a name="mnist-kubeflow-pytorch"></a>`
**distributed MNIST (pytorch) using kubeflow**
Similar to the previous example, the difference is that this example is implemented in pytorch, thus, it uses kubeflow pytorch operator.
``code directory: examples/trials/mnist-distributed-pytorch/``
.. role:: raw-html(raw)
:format: html
Tuning Tensor Operators on NNI
==============================
Overview
--------
Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graph-level optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardware-specific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for large-scale deployment and application of deep learning technologies in the real-world problems.
Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyper-parameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. Unlike many typical problems of this type, such as travelling salesman problem, the objective function of tensor operator optimization is a black box and expensive to sample. One has to compile a device code with a specific configuration and run it on real hardware to get the corresponding performance metric. Therefore, a desired method for optimizing tensor operators should find the best configuration with as few samples as possible.
The expensive objective function makes solving tensor operator optimization problem with traditional combinatorial optimization methods, for example, simulated annealing and evolutionary algorithms, almost impossible. Although these algorithms inherently support combinatorial search spaces, they do not take sample-efficiency into account,
thus thousands of or even more samples are usually needed, which is unacceptable when tuning tensor operators in product environments. On the other hand, sequential model based optimization (SMBO) methods are proved sample-efficient for optimizing black-box functions with continuous search spaces. However, when optimizing ones with combinatorial search spaces, SMBO methods are not as sample-efficient as their continuous counterparts, because there is lack of prior assumptions about the objective functions, such as continuity and differentiability in the case of continuous search spaces. For example, if one could assume an objective function with a continuous search space is infinitely differentiable, a Gaussian process with a radial basis function (RBF) kernel could be used to model the objective function. In this way, a sample provides not only a single value at a point but also the local properties of the objective function in its neighborhood or even global properties,
which results in a high sample-efficiency. In contrast, SMBO methods for combinatorial optimization suffer poor sample-efficiency due to the lack of proper prior assumptions and surrogate models which can leverage them.
OpEvo is recently proposed for solving this challenging problem. It efficiently explores the search spaces of tensor operators by introducing a topology-aware mutation operation based on q-random walk distribution to leverage the topological structures over the search spaces. Following this example, you can use OpEvo to tune three representative types of tensor operators selected from two popular neural networks, BERT and AlexNet. Three comparison baselines, AutoTVM, G-BFS and N-A2C, are also provided. Please refer to `OpEvo: An Evolutionary Method for Tensor Operator Optimization <https://arxiv.org/abs/2006.05664>`__ for detailed explanation about these algorithms.
Environment Setup
-----------------
We prepared a dockerfile for setting up experiment environments. Before starting, please make sure the Docker daemon is running and the driver of your GPU accelerator is properly installed. Enter into the example folder ``examples/trials/systems/opevo`` and run below command to build and instantiate a Docker image from the dockerfile.
.. code-block:: bash
# if you are using Nvidia GPU
make cuda-env
# if you are using AMD GPU
make rocm-env
Run Experiments:
----------------
Three representative kinds of tensor operators, **matrix multiplication**\ ,** batched matrix multiplication** and **2D convolution**\ , are chosen from BERT and AlexNet, and tuned with NNI. The ``Trial`` code for all tensor operators is ``/root/compiler_auto_tune_stable.py``\ , and ``Search Space`` files and ``config`` files for each tuning algorithm locate in ``/root/experiments/``\ , which are categorized by tensor operators. Here ``/root`` refers to the root of the container.
For tuning the operators of matrix multiplication, please run below commands from ``/root``\ :
.. code-block:: bash
# (N, K) x (K, M) represents a matrix of shape (N, K) multiplies a matrix of shape (K, M)
# (512, 1024) x (1024, 1024)
# tuning with OpEvo
nnictl create --config experiments/mm/N512K1024M1024/config_opevo.yml
# tuning with G-BFS
nnictl create --config experiments/mm/N512K1024M1024/config_gbfs.yml
# tuning with N-A2C
nnictl create --config experiments/mm/N512K1024M1024/config_na2c.yml
# tuning with AutoTVM
OP=matmul STEP=512 N=512 M=1024 K=1024 P=NN ./run.s
# (512, 1024) x (1024, 4096)
# tuning with OpEvo
nnictl create --config experiments/mm/N512K1024M4096/config_opevo.yml
# tuning with G-BFS
nnictl create --config experiments/mm/N512K1024M4096/config_gbfs.yml
# tuning with N-A2C
nnictl create --config experiments/mm/N512K1024M4096/config_na2c.yml
# tuning with AutoTVM
OP=matmul STEP=512 N=512 M=1024 K=4096 P=NN ./run.sh
# (512, 4096) x (4096, 1024)
# tuning with OpEvo
nnictl create --config experiments/mm/N512K4096M1024/config_opevo.yml
# tuning with G-BFS
nnictl create --config experiments/mm/N512K4096M1024/config_gbfs.yml
# tuning with N-A2C
nnictl create --config experiments/mm/N512K4096M1024/config_na2c.yml
# tuning with AutoTVM
OP=matmul STEP=512 N=512 M=4096 K=1024 P=NN ./run.sh
For tuning the operators of batched matrix multiplication, please run below commands from ``/root``\ :
.. code-block:: bash
# batched matrix with batch size 960 and shape of matrix (128, 128) multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
# tuning with OpEvo
nnictl create --config experiments/bmm/B960N128K128M64PNN/config_opevo.yml
# tuning with AutoTVM
OP=batch_matmul STEP=512 B=960 N=128 K=128 M=64 P=NN ./run.sh
# batched matrix with batch size 960 and shape of matrix (128, 128) is transposed first and then multiplies batched matrix with batch size 960 and shape of matrix (128, 64)
# tuning with OpEvo
nnictl create --config experiments/bmm/B960N128K128M64PTN/config_opevo.yml
# tuning with AutoTVM
OP=batch_matmul STEP=512 B=960 N=128 K=128 M=64 P=TN ./run.sh
# batched matrix with batch size 960 and shape of matrix (128, 64) is transposed first and then right multiplies batched matrix with batch size 960 and shape of matrix (128, 64).
# tuning with OpEvo
nnictl create --config experiments/bmm/B960N128K64M128PNT/config_opevo.yml
# tuning with AutoTVM
OP=batch_matmul STEP=512 B=960 N=128 K=64 M=128 P=NT ./run.sh
For tuning the operators of 2D convolution, please run below commands from ``/root``\ :
.. code-block:: bash
# image tensor of shape (512, 3, 227, 227) convolves with kernel tensor of shape (64, 3, 11, 11) with stride 4 and padding 0
# tuning with OpEvo
nnictl create --config experiments/conv/N512C3HW227F64K11ST4PD0/config_opevo.yml
# tuning with AutoTVM
OP=convfwd_direct STEP=512 N=512 C=3 H=227 W=227 F=64 K=11 ST=4 PD=0 ./run.sh
# image tensor of shape (512, 64, 27, 27) convolves with kernel tensor of shape (192, 64, 5, 5) with stride 1 and padding 2
# tuning with OpEvo
nnictl create --config experiments/conv/N512C64HW27F192K5ST1PD2/config_opevo.yml
# tuning with AutoTVM
OP=convfwd_direct STEP=512 N=512 C=64 H=27 W=27 F=192 K=5 ST=1 PD=2 ./run.sh
Please note that G-BFS and N-A2C are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces, thus not eligible to tune the operators of batched matrix multiplication and 2D convolution. Here, AutoTVM is implemented by its authors in the TVM project, so the tuning results are printed on the screen rather than reported to NNI manager. The port 8080 of the container is bind to the host on the same port, so one can access the NNI Web UI through ``host_ip_addr:8080`` and monitor tuning process as below screenshot.
:raw-html:`<img src="../../../examples/trials/systems/opevo/screenshot.png" />`
Citing OpEvo
------------
If you feel OpEvo is helpful, please consider citing the paper as follows:
.. code-block:: bash
@misc{gao2020opevo,
title={OpEvo: An Evolutionary Method for Tensor Operator Optimization},
author={Xiaotian Gao and Cui Wei and Lintao Zhang and Mao Yang},
year={2020},
eprint={2006.05664},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Tuning RocksDB on NNI
=====================
Overview
--------
`RocksDB <https://github.com/facebook/rocksdb>`__ is a popular high performance embedded key-value database used in production systems at various web-scale enterprises including Facebook, Yahoo!, and LinkedIn.. It is a fork of `LevelDB <https://github.com/google/leveldb>`__ by Facebook optimized to exploit many central processing unit (CPU) cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads.
The performance of RocksDB is highly contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. NNI can help to address this issue. NNI supports many kinds of tuning algorithms to search the best configuration of RocksDB, and support many kinds of environments like local machine, remote servers and cloud.
This example illustrates how to use NNI to search the best configuration of RocksDB for a ``fillrandom`` benchmark supported by a benchmark tool ``db_bench``\ , which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and `\ ``db_bench`` <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is in your ``PATH``. Please refer to `here <../Tutorial/QuickStart.md>`__ for detailed information about installation and preparing of NNI environment, and `here <https://github.com/facebook/rocksdb/blob/master/INSTALL.rst>`__ for compiling RocksDB as well as ``db_bench``.
We also provide a simple script :githublink:`db_bench_installation.sh <examples/trials/systems/rocksdb-fillrandom/db_bench_installation.sh>` helping to compile and install ``db_bench`` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom <examples/trials/systems/rocksdb-fillrandom>`*
Experiment setup
----------------
There are mainly three steps to setup an experiment of tuning systems on NNI. Define search space with a ``json`` file, write a benchmark code, and start NNI experiment by passing a config file to NNI manager.
Search Space
^^^^^^^^^^^^
For simplicity, this example tunes three parameters, ``write_buffer_size``\ , ``min_write_buffer_num`` and ``level0_file_num_compaction_trigger``\ , for writing 16M keys with 20 Bytes of key size and 100 Bytes of value size randomly, based on writing operations per second (OPS). ``write_buffer_size`` sets the size of a single memtable. Once memtable exceeds this size, it is marked immutable and a new one is created. ``min_write_buffer_num`` is the minimum number of memtables to be merged before flushing to storage. Once the number of files in level 0 reaches ``level0_file_num_compaction_trigger``\ , level 0 to level 1 compaction is triggered.
In this example, the search space is specified by a ``search_space.json`` file as shown below. Detailed explanation of search space could be found `here <../Tutorial/SearchSpaceSpec.rst>`__.
.. code-block:: json
{
"write_buffer_size": {
"_type": "quniform",
"_value": [2097152, 16777216, 1048576]
},
"min_write_buffer_number_to_merge": {
"_type": "quniform",
"_value": [2, 16, 1]
},
"level0_file_num_compaction_trigger": {
"_type": "quniform",
"_value": [2, 16, 1]
}
}
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/search_space.json <examples/trials/systems/rocksdb-fillrandom/search_space.json>`*
Benchmark code
^^^^^^^^^^^^^^
Benchmark code should receive a configuration from NNI manager, and report the corresponding benchmark result back. Following NNI APIs are designed for this purpose. In this example, writing operations per second (OPS) is used as a performance metric. Please refer to `here <Trials.rst>`__ for detailed information.
* Use ``nni.get_next_parameter()`` to get next system configuration.
* Use ``nni.report_final_result(metric)`` to report the benchmark result.
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/main.py <examples/trials/systems/rocksdb-fillrandom/main.py>`*
Config file
^^^^^^^^^^^
One could start a NNI experiment with a config file. A config file for NNI is a ``yaml`` file usually including experiment settings (\ ``trialConcurrency``\ , ``maxExecDuration``\ , ``maxTrialNum``\ , ``trial gpuNum``\ , etc.), platform settings (\ ``trainingServicePlatform``\ , etc.), path settings (\ ``searchSpacePath``\ , ``trial codeDir``\ , etc.) and tuner settings (\ ``tuner``\ , ``tuner optimize_mode``\ , etc.). Please refer to `here <../Tutorial/QuickStart.rst>`__ for more information.
Here is an example of tuning RocksDB with SMAC algorithm:
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/config_smac.yml <examples/trials/systems/rocksdb-fillrandom/config_smac.yml>`*
Here is an example of tuning RocksDB with TPE algorithm:
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/config_tpe.yml <examples/trials/systems/rocksdb-fillrandom/config_tpe.yml>`*
Other tuners can be easily adopted in the same way. Please refer to `here <../Tuner/BuiltinTuner.rst>`__ for more information.
Finally, we could enter the example folder and start the experiment using following commands:
.. code-block:: bash
# tuning RocksDB with SMAC tuner
nnictl create --config ./config_smac.yml
# tuning RocksDB with TPE tuner
nnictl create --config ./config_tpe.yml
Experiment results
------------------
We ran these two examples on the same machine with following details:
* 16 * Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
* 465 GB of rotational hard drive with ext4 file system
* 128 GB of RAM
* Kernel version: 4.15.0-58-generic
* NNI version: v1.0-37-g1bd24577
* RocksDB version: 6.4
* RocksDB DEBUG_LEVEL: 0
The detailed experiment results are shown in the below figure. Horizontal axis is sequential order of trials. Vertical axis is the metric, write OPS in this example. Blue dots represent trials for tuning RocksDB with SMAC tuner, and orange dots stand for trials for tuning RocksDB with TPE tuner.
.. image:: https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/plot.png
:target: https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/plot.png
:alt: image
Following table lists the best trials and corresponding parameters and metric obtained by the two tuners. Unsurprisingly, both of them found the same optimal configuration for ``fillrandom`` benchmark.
.. list-table::
:header-rows: 1
:widths: auto
* - Tuner
- Best trial
- Best OPS
- write_buffer_size
- min_write_buffer_number_to_merge
- level0_file_num_compaction_trigger
* - SMAC
- 255
- 779289
- 2097152
- 7.0
- 7.0
* - TPE
- 169
- 761456
- 2097152
- 7.0
- 7.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment