Unverified Commit 4784cc6c authored by liuzhe-lz's avatar liuzhe-lz Committed by GitHub
Browse files

Merge pull request #3302 from microsoft/v2.0-merge

Merge branch v2.0 into master (no squash)
parents 25db55ca 349ead41
This diff is collapsed.
......@@ -11,44 +11,30 @@ Supported AI Frameworks
-----------------------
* :raw-html:`<b>[PyTorch]</b>` https://github.com/pytorch/pytorch
* `PyTorch <https://github.com/pytorch/pytorch>`__
.. raw:: html
* :githublink:`MNIST-pytorch <examples/trials/mnist-distributed-pytorch>`
* `CIFAR-10 <./TrialExample/Cifar10Examples.rst>`__
* :githublink:`TGS salt identification chanllenge <examples/trials/kaggle-tgs-salt/README.md>`
* :githublink:`Network_morphism <examples/trials/network_morphism/README.md>`
<ul>
<li><a href="../../examples/trials/mnist-distributed-pytorch">MNIST-pytorch</a><br/></li>
<li><a href="TrialExample/Cifar10Examples.md">CIFAR-10</a><br/></li>
<li><a href="../../examples/trials/kaggle-tgs-salt/README.md">TGS salt identification chanllenge</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* `TensorFlow <https://github.com/tensorflow/tensorflow>`__
* :githublink:`MNIST-tensorflow <examples/trials/mnist-distributed>`
* :githublink:`Squad <examples/trials/ga_squad/README.md>`
* :raw-html:`<b>[TensorFlow]</b>` https://github.com/tensorflow/tensorflow
* `Keras <https://github.com/keras-team/keras>`__
.. raw:: html
* :githublink:`MNIST-keras <examples/trials/mnist-keras>`
* :githublink:`Network_morphism <examples/trials/network_morphism/README.md>`
<ul>
<li><a href="../../examples/trials/mnist-distributed">MNIST-tensorflow</a><br/></li>
<li><a href="../../examples/trials/ga_squad/README.md">Squad</a><br/></li>
</ul>
* :raw-html:`<b>[Keras]</b>` https://github.com/keras-team/keras
.. raw:: html
<ul>
<li><a href="../../examples/trials/mnist-keras">MNIST-keras</a><br/></li>
<li><a href="../../examples/trials/network_morphism/README.md">Network_morphism</a><br/></li>
</ul>
* :raw-html:`<b>[MXNet]</b>` https://github.com/apache/incubator-mxnet
* :raw-html:`<b>[Caffe2]</b>` https://github.com/BVLC/caffe
* :raw-html:`<b>[CNTK (Python language)]</b>` https://github.com/microsoft/CNTK
* :raw-html:`<b>[Spark MLlib]</b>` http://spark.apache.org/mllib/
* :raw-html:`<b>[Chainer]</b>` https://chainer.org/
* :raw-html:`<b>[Theano]</b>` https://pypi.org/project/Theano/ :raw-html:`<br/>`
* `MXNet <https://github.com/apache/incubator-mxnet>`__
* `Caffe2 <https://github.com/BVLC/caffe>`__
* `CNTK (Python language) <https://github.com/microsoft/CNTK>`__
* `Spark MLlib <http://spark.apache.org/mllib/>`__
* `Chainer <https://chainer.org/>`__
* `Theano <https://pypi.org/project/Theano/>`__
You are encouraged to `contribute more examples <Tutorial/Contributing.rst>`__ for other NNI users.
......@@ -58,22 +44,16 @@ Supported Library
NNI also supports all libraries written in python.Here are some common libraries, including some algorithms based on GBDT: XGBoost, CatBoost and lightGBM.
* :raw-html:`<b>[Scikit-learn]</b>` https://scikit-learn.org/stable/
.. raw:: html
* `Scikit-learn <https://scikit-learn.org/stable/>`__
<ul>
<li><a href="TrialExample/SklearnExamples.md">Scikit-learn</a><br/></li>
</ul>
* `Scikit-learn <TrialExample/SklearnExamples.rst>`__
* `XGBoost <https://xgboost.readthedocs.io/en/latest/>`__
* `CatBoost <https://catboost.ai/>`__
* `LightGBM <https://lightgbm.readthedocs.io/en/latest/>`__
* :raw-html:`<b>[XGBoost]</b>` https://xgboost.readthedocs.io/en/latest/
* :raw-html:`<b>[CatBoost]</b>` https://catboost.ai/
* :raw-html:`<b>[LightGBM]</b>` https://lightgbm.readthedocs.io/en/latest/
:raw-html:`<ul>
<li><a href="TrialExample/GbdtExample.md">Auto-gbdt</a><br/></li>
</ul>`
* `Auto-gbdt <TrialExample/GbdtExample.rst>`__
Here is just a small list of libraries that supported by NNI. If you are interested in NNI, you can refer to the `tutorial <TrialExample/Trials.rst>`__ to complete your own hacks.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.md>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.rst>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
......@@ -42,7 +42,7 @@ Step 7. Open a command line and install AML package environment.
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
......@@ -81,9 +81,8 @@ Compared with `LocalMode <LocalMode.rst>`__ trial configuration in aml mode have
* image
* required key. The docker image name used in job. NNI support image ``msranni/nni`` for running aml jobs.
.. code-block:: bash
Note: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
.. Note:: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
amlConfig:
......@@ -119,10 +118,10 @@ Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-tfv1
cd nni/examples/trials/mnist-pytorch
# modify config_aml.yml ...
nnictl create --config config_aml.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v1.9``.
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.0``.
......@@ -11,7 +11,7 @@ Prerequisite for Kubernetes Service
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the ** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
......@@ -76,7 +76,7 @@ Here is a template configuration specification to use AdaptDL as a training serv
storageSize: 1Gi
Those configs not mentioned below, are following the
`default specs defined in the NNI doc </Tutorial/ExperimentConfig.html#configuration-spec>`__.
`default specs defined </Tutorial/ExperimentConfig.rst#configuration-spec>`__ in the NNI doc.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
......
Run an Experiment on FrameworkController
========================================
===
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
......@@ -18,7 +17,7 @@ Prerequisite for on-premises Kubernetes Service
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
......@@ -101,7 +100,7 @@ If you use Azure Kubernetes Service, you should set ``frameworkcontrollerConfig
Note: You should explicitly set ``trainingServicePlatform: frameworkcontroller`` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/Microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
......@@ -115,7 +114,7 @@ Trial configuration in frameworkcontroller mode have the following configuration
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.rst#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
How to run example
------------------
......
......@@ -15,7 +15,7 @@ System architecture
:alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.md>`__\ , `remote platfrom <RemoteMachineMode.md>`__\ , `PAI platfrom <PaiMode.md>`__\ , `kubeflow platform <KubeflowMode.md>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.rst>`__\ , `remote platfrom <RemoteMachineMode.rst>`__\ , `PAI platfrom <PaiMode.rst>`__\ , `kubeflow platform <KubeflowMode.rst>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
......
**Run an Experiment on Heterogeneous Mode**
**Run an Experiment on Hybrid Mode**
===========================================
Run NNI on heterogeneous mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously
Run NNI on hybrid mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously.
## Setup environment
NNI has supported [local](./LocalMode.md), [remote](./RemoteMachineMode.md), [pai](./PaiMode.md) and [AML](./AMLMode.md) for heterogeneous training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Setup environment
-----------------
NNI has supported `local <./LocalMode.rst>`__\ , `remote <./RemoteMachineMode.rst>`__\ , `PAI <./PaiMode.rst>`__\ , and `AML <./AMLMode.rst>`__ for hybrid training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Run an experiment
-----------------
## Run an experiment
Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's content is like:
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
......@@ -18,7 +20,7 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 10
trainingServicePlatform: heterogeneous
trainingServicePlatform: hybrid
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
......@@ -31,7 +33,7 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
command: python3 mnist.py
codeDir: .
gpuNum: 1
heterogeneousConfig:
hybridConfig:
trainingServicePlatforms:
- local
- remote
......@@ -42,11 +44,11 @@ Use `examples/trials/mnist-tfv1` as an example. The NNI config YAML file's conte
username: bob
passwd: bob123
Configurations for heterogeneous mode:
Configurations for hybrid mode:
heterogeneousConfig:
* trainingServicePlatforms. required key. This field specify the platforms used in heterogeneous mode, the values using yaml list format. NNI support setting `local`, `remote`, `aml`, `pai` in this field.
hybridConfig:
* trainingServicePlatforms. required key. This field specify the platforms used in hybrid mode, the values using yaml list format. NNI support setting ``local``, ``remote``, ``aml``, ``pai`` in this field.
Note:
If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set `remote` as one of the platform, should also set `machineList` and `remoteConfig` configuration.
.. Note:: If setting a platform in trainingServicePlatforms mode, users should also set the corresponding configuration for the platform. For example, if set ``remote`` as one of the platform, should also set ``machineList`` and ``remoteConfig`` configuration. Local platform in hybrid mode does not support windows for now.
Run an Experiment on Kubeflow
=============================
===
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
......@@ -11,16 +9,16 @@ Prerequisite for on-premises Kubernetes Service
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or**Azure File Storage**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or **Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart>`__.
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
......@@ -231,6 +229,8 @@ Trial configuration in kubeflow mode have the following configuration keys:
* Required key. The API version of your Kubeflow.
.. cannot find :githublink:`msranni/nni <deployment/docker/Dockerfile>`
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
......
**Tutorial: Create and Run an Experiment on local with NNI API**
====================================================================
================================================================
In this tutorial, we will use the example in [~/examples/trials/mnist-tfv1] to explain how to create and run an experiment on local with NNI API.
In this tutorial, we will use the example in [nni/examples/trials/mnist-pytorch] to explain how to create and run an experiment on local with NNI API.
..
......@@ -17,23 +17,25 @@ You have an implementation for MNIST classifer using convolutional layers, the P
To enable NNI API, make the following changes:
* Declare NNI API: include ``import nni`` in your trial code to use NNI APIs.
* Get predefined parameters
1.1 Declare NNI API: include ``import nni`` in your trial code to use NNI APIs.
1.2 Get predefined parameters
Use the following code snippet:
.. code-block:: python
RECEIVED_PARAMS = nni.get_next_parameter()
tuner_params = nni.get_next_parameter()
to get hyper-parameters' values assigned by tuner. ``RECEIVED_PARAMS`` is an object, for example:
to get hyper-parameters' values assigned by tuner. ``tuner_params`` is an object, for example:
.. code-block:: json
{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}
{"batch_size": 32, "hidden_size": 128, "lr": 0.01, "momentum": 0.2029}
..
* Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor.
Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
1.3 Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor. Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
We had made the changes and saved it to ``mnist.py``.
......@@ -54,12 +56,12 @@ The hyper-parameters used in ``Step 1.2 - Get predefined parameters`` is defined
.. code-block:: bash
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
{
"batch_size": {"_type":"choice", "_value": [16, 32, 64, 128]},
"hidden_size":{"_type":"choice","_value":[128, 256, 512, 1024]},
"lr":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]},
"momentum":{"_type":"uniform","_value":[0, 1]}
}
Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search space.
......@@ -91,7 +93,7 @@ To run an experiment in NNI, you only needed:
..
A set of examples can be found in ~/nni/examples after your installation, run ``ls ~/nni/examples/trials`` to see all the trial examples.
You can download nni source code and a set of examples can be found in ``nni/examples``, run ``ls nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you installed NNI, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
......
......@@ -6,14 +6,14 @@ What is Training Service?
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.md>`__\ , `remote machines <./RemoteMachineMode.md>`__\ , and on clusters like `PAI <./PaiMode.md>`__\ , `Kubeflow <./KubeflowMode.md>`__\ , `AdaptDL <./AdaptDLMode.md>`__\ , `FrameworkController <./FrameworkControllerMode.md>`__\ , `DLTS <./DLTSMode.md>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.rst>`__\ , `remote machines <./RemoteMachineMode.rst>`__\ , and on clusters like `PAI <./PaiMode.rst>`__\ , `Kubeflow <./KubeflowMode.rst>`__\ , `AdaptDL <./AdaptDLMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ , `DLTS <./DLTSMode.rst>`__ and `AML <./AMLMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to "\ `how to implement training service <./HowToImplementTrainingService>`__\ " for details.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to `how to implement training service <./HowToImplementTrainingService.rst>`__ for details.
How to use Training Service?
----------------------------
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig>`__ provides more details on the specification of the experiment configuration file.
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig.rst>`__ provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as ``codeDir`` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see :githublink:`this example <examples/trials/mnist-tfv1/.nniignore>` and the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
......@@ -28,21 +28,21 @@ Built-in Training Services
* - TrainingService
- Brief Introduction
* - `**Local** <./LocalMode.rst>`__
* - `Local <./LocalMode.rst>`__
- NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.
* - `**Remote** <./RemoteMachineMode.rst>`__
* - `Remote <./RemoteMachineMode.rst>`__
- NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.
* - `**PAI** <./PaiMode.rst>`__
* - `PAI <./PaiMode.rst>`__
- NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__ (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.
* - `**Kubeflow** <./KubeflowMode.rst>`__
* - `Kubeflow <./KubeflowMode.rst>`__
- NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
* - `**AdaptDL** <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster.
* - `**FrameworkController** <./FrameworkControllerMode.rst>`__
* - `AdaptDL <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use AdaptDL mode, you should have a Kubernetes cluster.
* - `FrameworkController <./FrameworkControllerMode.rst>`__
- NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
* - `**DLTS** <./DLTSMode.rst>`__
* - `DLTS <./DLTSMode.rst>`__
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `**AML** <./AMLMode.rst>`__
* - `AML <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
......@@ -57,7 +57,7 @@ What does Training Service do?
</p>
According to the architecture shown in `Overview <../Overview>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
According to the architecture shown in `Overview <../Overview.rst>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (\ ``codeDir``\ ) accessible to training platform.
......@@ -65,6 +65,6 @@ Step 1. **Validate config and prepare the training platform.** Training service
Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have a differen path (even on differen machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have different paths (even on different machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from ``WAITING`` to ``RUNNING``\ , ``RUNNING`` to ``SUCCEEDED``\ ) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.
......@@ -110,7 +110,7 @@ Note: You should set ``trainingServicePlatform: pai`` in NNI config YAML file if
Trial configurations
^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trial`` configuration in pai mode has the following additional keys:
*
......@@ -130,6 +130,8 @@ Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachine
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config file specified in ``paiConfigPath`` field.
.. cannot find :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`
*
virtualCluster
......@@ -165,7 +167,7 @@ Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachine
#.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is nni\ *exp*\ ${this.experimentId}*trial*\ ${trialJobId}.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is ``nni_exp_{this.experimentId}_trial_{trialJobId}`` .
#.
If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
......@@ -216,7 +218,7 @@ Once a trial job is completed, you can goto NNI WebUI's overview page (like http
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
.. image:: ../../img/nni_webui_joblist.png
:scale: 30%
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
......@@ -245,5 +247,5 @@ Check policy:
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/version_check.png
.. image:: ../../img/webui-img/experimentError.png
:scale: 80%
.. role:: raw-html(raw)
:format: html
**Run an Experiment on OpenpaiYarn**
========================================
The original ``pai`` mode is modificated to ``paiYarn`` mode, which is a distributed training platform based on Yarn.
Setup environment
-----------------
Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai, paiYarn
trainingServicePlatform: paiYarn
# search space file
searchSpacePath: search_space.json
# choice: true, false
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-tfv1
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
# Configuration to access OpenpaiYarn Cluster
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
Note: You should set ``trainingServicePlatform: paiYarn`` in NNI config YAML file if you want to start experiment in paiYarn mode.
Compared with `LocalMode <LocalMode.md>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , trial configuration in paiYarn mode have these additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In paiYarn mode, your trial program will be scheduled by OpenpaiYarn to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
* We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* virtualCluster
* Optional key. Set the virtualCluster of OpenpaiYarn. If omitted, the job will run on default virtual cluster.
* shmMB
* Optional key. Set the shmMB configuration of OpenpaiYarn, it set the shared memory for one task in the task role.
* authFile
* Optional key, Set the auth file path for private registry while using paiYarn mode, `Refer <https://github.com/microsoft/paiYarn/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpaiYarn-job>`__\ , you can prepare the authFile and simply provide the local path of this file, NNI will upload this file to HDFS for you.
*
portList
*
Optional key. Set the portList configuration of OpenpaiYarn, it specifies a list of port used in container, `Refer <https://github.com/microsoft/paiYarn/blob/b2324866d0280a2d22958717ea6025740f71b9f0/docs/job_tutorial.rst#specification>`__.\ :raw-html:`<br>`
The config schema in NNI is shown below:
.. code-block:: bash
portList:
- label: test
beginAt: 8080
portNumber: 2
Let's say you want to launch a tensorboard in the mnist example using the port. So the first step is to write a wrapper script ``launch_paiYarn.sh`` of ``mnist.py``.
.. code-block:: bash
export TENSORBOARD_PORT=paiYarn_PORT_LIST_${paiYarn_CURRENT_TASK_ROLE_NAME}_0_tensorboard
tensorboard --logdir . --port ${!TENSORBOARD_PORT} &
python3 mnist.py
The config file of portList should be filled as following:
.. code-block:: yaml
trial:
command: bash launch_paiYarn.sh
portList:
- label: tensorboard
beginAt: 0
portNumber: 1
NNI support two kind of authorization method in paiYarn, including password and paiYarn token, `refer <https://github.com/microsoft/paiYarn/blob/b6bd2ab1c8890f91b7ac5859743274d2aa923c22/docs/rest-server/API.rst#2-authentication>`__. The authorization is configured in ``paiYarnConfig`` field.\ :raw-html:`<br>`
For password authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
passWord: your_paiYarn_password
host: 10.1.1.1
For paiYarn token authorization, the ``paiYarnConfig`` schema is:
.. code-block:: bash
paiYarnConfig:
userName: your_paiYarn_nni_user
token: your_paiYarn_token
host: 10.1.1.1
Once complete to fill NNI experiment config file and save (for example, save as exp_paiYarn.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_paiYarn.yml
to start the experiment in paiYarn mode. NNI will create OpenpaiYarn job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenpaiYarn cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In paiYarn mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.jpg
:target: ../../img/nni_webui_joblist.jpg
:alt:
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:target: ../../img/nni_trial_hdfs_output.jpg
:alt:
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
If your training data is not too large, it could be put into codeDir, and nni will upload the data to hdfs, or you could build your own docker image with the data. If you have large dataset, it's not appropriate to put the data in codeDir, and you could follow the `guidance <https://github.com/microsoft/paiYarn/blob/master/docs/user/storage.rst>`__ to mount the data folder in container.
If you also want to save trial's other output into HDFS, like model files, you can use environment variable ``NNI_OUTPUT_DIR`` in your trial code to save your own output files, and NNI SDK will copy all the files in ``NNI_OUTPUT_DIR`` from trial's container to HDFS, the target path is ``hdfs://host:port/{username}/nni/{experiments}/{experimentId}/trials/{trialId}/nnioutput``
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/version_check.png
:target: ../../img/version_check.png
:alt:
......@@ -217,3 +217,11 @@ If you want multiple commands to be executed, you can use ``&&`` to connect thes
preCommand: command1 && command2 && command3
**Note**\ : Because **preCommand** will execute before other commands each time, it is strongly not recommended to set **preCommand** that will make changes to system, i.e. ``mkdir`` or ``touch``.
Remote machine supports running experiment in reuse mode. In this mode, NNI will reuse remote machine jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in the same job, for example, avoid loading checkpoint from previous trials.
Follow the setting to enable reuse mode:
.. code-block:: yaml
remoteConfig:
reuse: true
\ No newline at end of file
......@@ -13,7 +13,7 @@ As we all know, the choice of model optimizer is directly affects the performanc
In this example, we have selected the following common deep learning optimizer:
..
.. code-block:: bash
"SGD", "Adadelta", "Adagrad", "Adam", "Adamax"
......@@ -48,7 +48,7 @@ As we stated in the target, we target to find out the best ``optimizer`` for tra
"model":{"_type":"choice", "_value":["vgg", "resnet18", "googlenet", "densenet121", "mobilenet", "dpn92", "senet18"]}
}
*Implemented code directory: :githublink:`search_space.json <examples/trials/cifar10_pytorch/search_space.json>`*
Implemented code directory: :githublink:`search_space.json <examples/trials/cifar10_pytorch/search_space.json>`
**Trial**
......@@ -59,7 +59,7 @@ The code for CNN training of each hyperparameters set, paying particular attenti
* Use ``nni.report_intermediate_result(acc)`` to report the intermedian result after finish each epoch.
* Use ``nni.report_final_result(acc)`` to report the final result before the trial end.
*Implemented code directory: :githublink:`main.py <examples/trials/cifar10_pytorch/main.py>`*
Implemented code directory: :githublink:`main.py <examples/trials/cifar10_pytorch/main.py>`
You can also use your previous code directly, refer to `How to define a trial <Trials.rst>`__ for modify.
......@@ -73,7 +73,7 @@ Here is the example of running this experiment on OpenPAI:
code directory: :githublink:`examples/trials/cifar10_pytorch/config_pai.yml <examples/trials/cifar10_pytorch/config_pai.yml>`
*The complete examples we have implemented: :githublink:`examples/trials/cifar10_pytorch/ <examples/trials/cifar10_pytorch>`*
The complete examples we have implemented: :githublink:`examples/trials/cifar10_pytorch/ <examples/trials/cifar10_pytorch>`
Launch the experiment
^^^^^^^^^^^^^^^^^^^^^
......
......@@ -8,8 +8,9 @@ MNIST examples
CNN MNIST classifier for deep learning is similar to ``hello world`` for programming languages. Thus, we use MNIST as example to introduce different features of NNI. The examples are listed below:
* `MNIST with NNI API (TensorFlow v1.x) <#mnist-tfv1>`__
* `MNIST with NNI API (PyTorch) <#mnist-pytorch>`__
* `MNIST with NNI API (TensorFlow v2.x) <#mnist-tfv2>`__
* `MNIST with NNI API (TensorFlow v1.x) <#mnist-tfv1>`__
* `MNIST with NNI annotation <#mnist-annotation>`__
* `MNIST in keras <#mnist-keras>`__
* `MNIST -- tuning with batch tuner <#mnist-batch>`__
......@@ -18,65 +19,70 @@ CNN MNIST classifier for deep learning is similar to ``hello world`` for program
* `distributed MNIST (tensorflow) using kubeflow <#mnist-kubeflow-tf>`__
* `distributed MNIST (pytorch) using kubeflow <#mnist-kubeflow-pytorch>`__
:raw-html:`<a name="mnist-tfv1"></a>`
**MNIST with NNI API (TensorFlow v1.x)**
:raw-html:`<a name="mnist-pytorch"></a>`
**MNIST with NNI API (PyTorch)**
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer. We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc. It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random. We also provide an exmaple YAML file which enables assessor.
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer.
We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc.
It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random.
We also provide an exmaple YAML file which enables assessor.
``code directory: examples/trials/mnist-tfv1/``
code directory: :githublink:`mnist-pytorch/ <examples/trials/mnist-pytorch/>`
:raw-html:`<a name="mnist-tfv2"></a>`
**MNIST with NNI API (TensorFlow v2.x)**
Same network to the example above, but written in TensorFlow v2.x Keras API.
Same network to the example above, but written in TensorFlow.
``code directory: examples/trials/mnist-tfv2/``
code directory: :githublink:`mnist-tfv2/ <examples/trials/mnist-tfv2/>`
:raw-html:`<a name="mnist-annotation"></a>`
**MNIST with NNI annotation**
:raw-html:`<a name="mnist-tfv1"></a>`
**MNIST with NNI API (TensorFlow v1.x)**
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
Same network to the example above, but written in TensorFlow v1.x API.
``code directory: examples/trials/mnist-annotation/``
code directory: :githublink:`mnist-tfv1/ <examples/trials/mnist-tfv1/>`
:raw-html:`<a name="mnist-keras"></a>`
**MNIST in keras**
:raw-html:`<a name="mnist-annotation"></a>`
**MNIST with NNI annotation**
This example is implemented in keras. It is also a network for MNIST dataset, with two convolution layers, one pooling layer, and two fully connected layers.
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
``code directory: examples/trials/mnist-keras/``
code directory: :githublink:`mnist-annotation/ <examples/trials/mnist-annotation/>`
:raw-html:`<a name="mnist-batch"></a>`
**MNIST -- tuning with batch tuner**
This example is to show how to use batch tuner. Users simply list all the configurations they want to try in the search space file. NNI will try all of them.
``code directory: examples/trials/mnist-batch-tune-keras/``
code directory: :githublink:`mnist-batch-tune-keras/ <examples/trials/mnist-batch-tune-keras/>`
:raw-html:`<a name="mnist-hyperband"></a>`
**MNIST -- tuning with hyperband**
This example is to show how to use hyperband to tune the model. There is one more key ``STEPS`` in the received configuration for trials to control how long it can run (e.g., number of iterations).
``code directory: examples/trials/mnist-hyperband/``
.. cannot find :githublink:`mnist-hyperband/ <examples/trials/mnist-hyperband/>`
code directory: :githublink:`mnist-hyperband/ <examples/trials/mnist-hyperband/>`
:raw-html:`<a name="mnist-nested"></a>`
**MNIST -- tuning within a nested search space**
This example is to show that NNI also support nested search space. The search space file is an example of how to define nested search space.
``code directory: examples/trials/mnist-nested-search-space/``
code directory: :githublink:`mnist-nested-search-space/ <examples/trials/mnist-nested-search-space/>`
:raw-html:`<a name="mnist-kubeflow-tf"></a>`
**distributed MNIST (tensorflow) using kubeflow**
This example is to show how to run distributed training on kubeflow through NNI. Users can simply provide distributed training code and a configure file which specifies the kubeflow mode. For example, what is the command to run ps and what is the command to run worker, and how many resources they consume. This example is implemented in tensorflow, thus, uses kubeflow tensorflow operator.
``code directory: examples/trials/mnist-distributed/``
code directory: :githublink:`mnist-distributed/ <examples/trials/mnist-distributed/>`
:raw-html:`<a name="mnist-kubeflow-pytorch"></a>`
**distributed MNIST (pytorch) using kubeflow**
Similar to the previous example, the difference is that this example is implemented in pytorch, thus, it uses kubeflow pytorch operator.
``code directory: examples/trials/mnist-distributed-pytorch/``
code directory: :githublink:`mnist-distributed-pytorch/ <examples/trials/mnist-distributed-pytorch/>`
......@@ -33,7 +33,7 @@ We prepared a dockerfile for setting up experiment environments. Before starting
Run Experiments:
----------------
Three representative kinds of tensor operators, **matrix multiplication**\ ,** batched matrix multiplication** and **2D convolution**\ , are chosen from BERT and AlexNet, and tuned with NNI. The ``Trial`` code for all tensor operators is ``/root/compiler_auto_tune_stable.py``\ , and ``Search Space`` files and ``config`` files for each tuning algorithm locate in ``/root/experiments/``\ , which are categorized by tensor operators. Here ``/root`` refers to the root of the container.
Three representative kinds of tensor operators, **matrix multiplication**\ , **batched matrix multiplication** and **2D convolution**\ , are chosen from BERT and AlexNet, and tuned with NNI. The ``Trial`` code for all tensor operators is ``/root/compiler_auto_tune_stable.py``\ , and ``Search Space`` files and ``config`` files for each tuning algorithm locate in ``/root/experiments/``\ , which are categorized by tensor operators. Here ``/root`` refers to the root of the container.
For tuning the operators of matrix multiplication, please run below commands from ``/root``\ :
......@@ -111,7 +111,7 @@ For tuning the operators of 2D convolution, please run below commands from ``/ro
Please note that G-BFS and N-A2C are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces, thus not eligible to tune the operators of batched matrix multiplication and 2D convolution. Here, AutoTVM is implemented by its authors in the TVM project, so the tuning results are printed on the screen rather than reported to NNI manager. The port 8080 of the container is bind to the host on the same port, so one can access the NNI Web UI through ``host_ip_addr:8080`` and monitor tuning process as below screenshot.
:raw-html:`<img src="../../../examples/trials/systems/opevo/screenshot.png" />`
.. image:: ../../img/opevo.png
Citing OpEvo
------------
......
......@@ -8,11 +8,11 @@ Overview
The performance of RocksDB is highly contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. NNI can help to address this issue. NNI supports many kinds of tuning algorithms to search the best configuration of RocksDB, and support many kinds of environments like local machine, remote servers and cloud.
This example illustrates how to use NNI to search the best configuration of RocksDB for a ``fillrandom`` benchmark supported by a benchmark tool ``db_bench``\ , which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and `\ ``db_bench`` <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is in your ``PATH``. Please refer to `here <../Tutorial/QuickStart.md>`__ for detailed information about installation and preparing of NNI environment, and `here <https://github.com/facebook/rocksdb/blob/master/INSTALL.rst>`__ for compiling RocksDB as well as ``db_bench``.
This example illustrates how to use NNI to search the best configuration of RocksDB for a ``fillrandom`` benchmark supported by a benchmark tool ``db_bench``\ , which is an official benchmark tool provided by RocksDB itself. Therefore, before running this example, please make sure NNI is installed and `db_bench <https://github.com/facebook/rocksdb/wiki/Benchmarking-tools>`__ is in your ``PATH``. Please refer to `here <../Tutorial/QuickStart.rst>`__ for detailed information about installation and preparing of NNI environment, and `here <https://github.com/facebook/rocksdb/blob/master/INSTALL.md>`__ for compiling RocksDB as well as ``db_bench``.
We also provide a simple script :githublink:`db_bench_installation.sh <examples/trials/systems/rocksdb-fillrandom/db_bench_installation.sh>` helping to compile and install ``db_bench`` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
We also provide a simple script :githublink:`db_bench_installation.sh <examples/trials/systems_auto_tuning/rocksdb-fillrandom/db_bench_installation.sh>` helping to compile and install ``db_bench`` as well as its dependencies on Ubuntu. Installing RocksDB on other systems can follow the same procedure.
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom <examples/trials/systems/rocksdb-fillrandom>`*
:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom>`
Experiment setup
----------------
......@@ -43,7 +43,7 @@ In this example, the search space is specified by a ``search_space.json`` file a
}
}
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/search_space.json <examples/trials/systems/rocksdb-fillrandom/search_space.json>`*
:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/search_space.json>`
Benchmark code
^^^^^^^^^^^^^^
......@@ -54,7 +54,7 @@ Benchmark code should receive a configuration from NNI manager, and report the c
* Use ``nni.get_next_parameter()`` to get next system configuration.
* Use ``nni.report_final_result(metric)`` to report the benchmark result.
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/main.py <examples/trials/systems/rocksdb-fillrandom/main.py>`*
:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/main.py>`
Config file
^^^^^^^^^^^
......@@ -63,11 +63,11 @@ One could start a NNI experiment with a config file. A config file for NNI is a
Here is an example of tuning RocksDB with SMAC algorithm:
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/config_smac.yml <examples/trials/systems/rocksdb-fillrandom/config_smac.yml>`*
:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/config_smac.yml>`
Here is an example of tuning RocksDB with TPE algorithm:
*code directory: :githublink:`example/trials/systems/rocksdb-fillrandom/config_tpe.yml <examples/trials/systems/rocksdb-fillrandom/config_tpe.yml>`*
:githublink:`code directory <examples/trials/systems_auto_tuning/rocksdb-fillrandom/config_tpe.yml>`
Other tuners can be easily adopted in the same way. Please refer to `here <../Tuner/BuiltinTuner.rst>`__ for more information.
......@@ -97,8 +97,8 @@ We ran these two examples on the same machine with following details:
The detailed experiment results are shown in the below figure. Horizontal axis is sequential order of trials. Vertical axis is the metric, write OPS in this example. Blue dots represent trials for tuning RocksDB with SMAC tuner, and orange dots stand for trials for tuning RocksDB with TPE tuner.
.. image:: https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/plot.png
:target: https://github.com/microsoft/nni/tree/v1.9/examples/trials/systems/rocksdb-fillrandom/plot.png
.. image:: ../../img/rocksdb-fillrandom-plot.png
:target: ../../img/rocksdb-fillrandom-plot.png
:alt: image
......
......@@ -45,7 +45,7 @@ using the downloading script:
Or Download manually
#. download "dev-v1.1.json" and "train-v1.1.json" in https://rajpurkar.github.io/SQuAD-explorer/
#. download ``dev-v1.1.json`` and ``train-v1.1.json`` `here <https://rajpurkar.github.io/SQuAD-explorer/>`__
.. code-block:: bash
......@@ -53,7 +53,7 @@ Or Download manually
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
#. download "glove.840B.300d.txt" in https://nlp.stanford.edu/projects/glove/
#. download ``glove.840B.300d.txt`` `here <https://nlp.stanford.edu/projects/glove/>`__
.. code-block:: bash
......@@ -87,7 +87,7 @@ Modify ``nni/examples/trials/ga_squad/config.yml``\ , here is the default config
codeDir: ~/nni/examples/trials/ga_squad
gpuNum: 0
In the "trial" part, if you want to use GPU to perform the architecture search, change ``gpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNum`` and ``maxExecDuration``\ , according to how long you want to wait for the search result.
In the **trial** part, if you want to use GPU to perform the architecture search, change ``gpuNum`` from ``0`` to ``1``. You need to increase the ``maxTrialNum`` and ``maxExecDuration``\ , according to how long you want to wait for the search result.
2.3 submit this job
^^^^^^^^^^^^^^^^^^^
......@@ -120,7 +120,7 @@ Modify ``nni/examples/trials/ga_squad/config_pai.yml``\ , here is the default co
#Your nni_manager ip
nniManagerIp: 10.10.10.10
tuner:
codeDir: https://github.com/Microsoft/nni/tree/v1.9/examples/tuners/ga_customer_tuner
codeDir: https://github.com/Microsoft/nni/tree/v2.0/examples/tuners/ga_customer_tuner
classFileName: customer_tuner.py
className: CustomerTuner
classArgs:
......
......@@ -28,7 +28,7 @@ An example is shown below:
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
Refer to `SearchSpaceSpec.md <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search spaces. Tuner will generate configurations from this search space, that is, choosing a value for each hyperparameter from the range.
Refer to `SearchSpaceSpec <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search spaces. Tuner will generate configurations from this search space, that is, choosing a value for each hyperparameter from the range.
Step 2 - Update model code
^^^^^^^^^^^^^^^^^^^^^^^^^^
......@@ -80,7 +80,7 @@ To enable NNI API mode, you need to set useAnnotation to *false* and provide the
You can refer to `here <../Tutorial/ExperimentConfig.rst>`__ for more information about how to set up experiment configurations.
Please refer to `here </sdk_reference.html>`__ for more APIs (e.g., ``nni.get_sequence_id()``\ ) provided by NNI.
Please refer to `here <../sdk_reference.rst>`__ for more APIs (e.g., ``nni.get_sequence_id()``\ ) provided by NNI.
:raw-html:`<a name="nni-annotation"></a>`
......@@ -167,7 +167,7 @@ NNI supports a standalone mode for trial code to run without starting an NNI exp
nni.get_trial_id # return "STANDALONE"
nni.get_sequence_id # return 0
You can try standalone mode with the :githublink:`mnist example <examples/trials/mnist-tfv1>`. Simply run ``python3 mnist.py`` under the code directory. The trial code should successfully run with the default hyperparameter values.
You can try standalone mode with the :githublink:`mnist example <examples/trials/mnist-pytorch>`. Simply run ``python3 mnist.py`` under the code directory. The trial code should successfully run with the default hyperparameter values.
For more information on debugging, please refer to `How to Debug <../Tutorial/HowToDebug.rst>`__
......
......@@ -73,7 +73,7 @@ BOHB advisor requires the `ConfigSpace <https://github.com/automl/ConfigSpace>`_
.. code-block:: bash
nnictl package install --name=BOHB
pip install nni[BOHB]
To use BOHB, you should add the following spec in your experiment's YAML config file:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment