Unverified Commit a911b856 authored by Yuge Zhang's avatar Yuge Zhang Committed by GitHub
Browse files

Resolve conflicts for #4760 (#4762)

parent 14d2966b
Quantizer in NNI
================
NNI implements the main part of the quantizaiton algorithm as quantizer. All quantizers are implemented as close as possible to what is described in the paper (if it has).
The following table provides a brief introduction to the quantizers implemented in nni, click the link in table to view a more detailed introduction and use cases.
.. list-table::
:header-rows: 1
:widths: auto
* - Name
- Brief Introduction of Algorithm
* - :ref:`naive-quantizer`
- Quantize weights to default 8 bits
* - :ref:`qat-quantizer`
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
* - :ref:`dorefa-quantizer`
- DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
* - :ref:`bnn-quantizer`
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
* - :ref:`lsq-quantizer`
- Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
* - :ref:`observer-quantizer`
- Post training quantizaiton. Collect quantization information during calibration with observers.
Compression
===========
.. toctree::
:hidden:
:maxdepth: 2
Overview <overview>
Pruning <toctree_pruning>
Quantization <toctree_quantization>
Config Specification <compression_config_list>
Advanced Usage <advanced_usage>
Pruning
=======
.. toctree::
:hidden:
:maxdepth: 2
Overview <pruning>
Quickstart </tutorials/pruning_quick_start_mnist>
Pruner <pruner>
Speedup </tutorials/pruning_speedup>
Quantization
============
.. toctree::
:hidden:
:maxdepth: 2
Overview <quantization>
Quickstart </tutorials/quantization_quick_start_mnist>
Quantizer <quantizer>
SpeedUp </tutorials/quantization_speedup>
......@@ -13,6 +13,7 @@
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import re
import subprocess
import sys
sys.path.insert(0, os.path.abspath('../..'))
......@@ -30,7 +31,7 @@ author = 'Microsoft'
version = ''
# The full version, including alpha/beta/rc tags
# FIXME: this should be written somewhere globally
release = 'v2.6'
release = 'v2.7'
# -- General configuration ---------------------------------------------------
......@@ -44,6 +45,8 @@ release = 'v2.6'
extensions = [
'sphinx_gallery.gen_gallery',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'sphinx.ext.mathjax',
'sphinxarg4nni.ext',
'sphinx.ext.napoleon',
......@@ -53,15 +56,64 @@ extensions = [
# 'nbsphinx', # nbsphinx has conflicts with sphinx-gallery.
'sphinx.ext.extlinks',
'IPython.sphinxext.ipython_console_highlighting',
'sphinx_tabs.tabs',
'sphinx_copybutton',
# Custom extensions in extension/ folder.
'tutorial_links', # this has to be after sphinx-gallery
'getpartialtext',
'inplace_translation',
'cardlinkitem',
'patch_docutils',
'codesnippetcard',
'patch_autodoc',
'toctree_check',
]
# Autosummary related settings
autosummary_imported_members = True
autosummary_ignore_module_all = False
# Auto-generate stub files before building docs
autosummary_generate = True
# Add mock modules
autodoc_mock_imports = ['apex', 'nni_node', 'tensorrt', 'pycuda', 'nn_meter']
autodoc_mock_imports = [
'apex', 'nni_node', 'tensorrt', 'pycuda', 'nn_meter', 'azureml',
'ConfigSpace', 'ConfigSpaceNNI', 'smac', 'statsmodels', 'pybnn',
]
# Some of our modules cannot generate summary
autosummary_mock_imports = [
'nni.retiarii.codegen.tensorflow',
'nni.nas.benchmarks.nasbench101.db_gen',
'nni.tools.jupyter_extension.management',
] + autodoc_mock_imports
autodoc_typehints = 'description'
autodoc_typehints_description_target = 'documented'
autodoc_inherit_docstrings = False
# Sphinx will warn about all references where the target cannot be found.
nitpicky = False # disabled for now
# A list of regular expressions that match URIs that should not be checked.
linkcheck_ignore = [
r'http://localhost:\d+',
r'.*://.*/#/', # Modern websites that has URLs like xxx.com/#/guide
r'https://github.com/JSong-Jia/Pic/', # Community links can't be found any more
]
# Ignore all links located in release.rst
linkcheck_exclude_documents = ['^release']
# Bibliography files
bibtex_bibfiles = ['refs.bib']
# Add a heading to bibliography
bibtex_footbibliography_header = '.. rubric:: Bibliography'
# Set bibliography style
bibtex_default_style = 'plain'
# Bibliography files
bibtex_bibfiles = ['refs.bib']
......@@ -86,6 +138,41 @@ sphinx_gallery_conf = {
'default_thumb_file': os.path.join(os.path.dirname(__file__), '../img/thumbnails/nni_icon_blue.png'),
}
# Copybutton: strip and configure input prompts for code cells.
copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
copybutton_prompt_is_regexp = True
# Copybutton: customize selector to exclude gallery outputs.
copybutton_selector = ":not(div.sphx-glr-script-out) > div.highlight pre"
# Allow additional builders to be considered compatible.
sphinx_tabs_valid_builders = ['linkcheck']
# Disallow the sphinx tabs css from loading.
sphinx_tabs_disable_css_loading = True
# Some tutorials might need to appear more than once in toc.
# In this list, we make source/target tutorial pairs.
# Each "source" tutorial rst will be copied to "target" tutorials.
# The anchors will be replaced to avoid dupilcate labels.
# Target should start with ``cp_`` to be properly ignored in git.
tutorials_copy_list = [
# Seems that we don't need it for now.
# Add tuples back if we need it in future.
]
# Toctree ensures that toctree docs do not contain any other contents.
# Home page should be an exception.
toctree_check_whitelist = [
'index',
# FIXME: Other exceptions should be correctly handled.
'compression/index',
'compression/pruning',
'compression/quantization',
'hpo/hpo_benchmark',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['../templates']
......@@ -103,6 +190,18 @@ master_doc = 'index'
# Usually you set "language" from the command line for these cases.
language = None
# Translation related settings
locale_dir = ['locales']
# Documents that requires translation: https://github.com/microsoft/nni/issues/4298
gettext_documents = [
r'^index$',
r'^quickstart$',
r'^installation$',
r'^(nas|hpo|compression)/overview$',
r'^tutorials/(hello_nas|pruning_quick_start_mnist|hpo_quickstart_pytorch/main)$',
]
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
......@@ -110,7 +209,6 @@ exclude_patterns = [
'_build',
'Thumbs.db',
'.DS_Store',
'Release_v1.0.md',
'**.ipynb_checkpoints',
# Exclude translations. They will be added back via replacement later if language is set.
'**_zh.rst',
......@@ -151,17 +249,20 @@ html_theme_options = {
'base_url': 'https://nni.readthedocs.io/',
# Set the color and the accent color
# We can't have our customized themes currently
# Remember to update static/css/material_custom.css when this is updated.
'color_primary': 'indigo',
'color_accent': 'pink',
# Set those colors in layout.html.
'color_primary': 'custom',
'color_accent': 'custom',
# Set the repo location to get a badge with stats
'repo_url': 'https://github.com/microsoft/nni/',
'repo_name': 'nni',
'repo_name': 'GitHub',
# Visible levels of the global TOC; -1 means unlimited
'globaltoc_depth': 3,
'globaltoc_depth': 5,
# Expand all toc so that they can be dynamically collapsed
'globaltoc_collapse': False,
'version_dropdown': True,
# This is a placeholder, which should be replaced later.
......@@ -171,8 +272,8 @@ html_theme_options = {
# Text to appear at the top of the home page in a "hero" div.
'heroes': {
# We can have heroes for the home pages of HPO, NAS, Compression in future.
'index': 'An open source AutoML toolkit for neural architecture search, model compression and hyper-parameter tuning.'
'index': 'An open source AutoML toolkit for hyperparameter optimization, neural architecture search, '
'model compression and feature engineering.'
}
}
......@@ -200,6 +301,7 @@ html_title = 'Neural Network Intelligence'
# Add extra css files and js files
html_css_files = [
'css/material_theme.css',
'css/material_custom.css',
'css/material_dropdown.css',
'css/sphinx_gallery.css',
......@@ -209,6 +311,7 @@ html_js_files = [
'js/version.js',
'js/github.js',
'js/sphinx_gallery.js',
'js/misc.js'
]
# HTML context that can be used in jinja templates
......
###############################
Contribute to NNI
###############################
.. toctree::
Development Setup<./Tutorial/SetupNniDeveloperEnvironment>
Contribution Guide<./Tutorial/Contributing>
\ No newline at end of file
.. 24da49b25d3d36c476a69aceb825cb94
###############################
贡献代码
###############################
.. toctree::
设置开发环境<./Tutorial/SetupNniDeveloperEnvironment>
贡献指南<./Tutorial/Contributing>
\ No newline at end of file
######################
Examples
######################
========
.. toctree::
:maxdepth: 2
More examples can be found in our :githublink:`GitHub repository <examples>`.
MNIST<./TrialExample/MnistExamples>
Cifar10<./TrialExample/Cifar10Examples>
Scikit-learn<./TrialExample/SklearnExamples>
GBDT<./TrialExample/GbdtExample>
Pix2pix<./TrialExample/Pix2pixExample>
.. cardlinkitem::
:header: HPO Quickstart with PyTorch
:description: Use HPO to tune a PyTorch FashionMNIST model
:link: tutorials/hpo_quickstart_pytorch/main
:image: ../img/thumbnails/hpo-pytorch.svg
:background: purple
:tags: HPO
.. cardlinkitem::
:header: HPO Quickstart with TensorFlow
:description: Use HPO to tune a TensorFlow MNIST model
:link: tutorials/hpo_quickstart_tensorflow/main
:image: ../img/thumbnails/hpo-tensorflow.svg
:background: purple
:tags: HPO
.. cardlinkitem::
:header: HPO using command line tool
:description: Run HPO experiment with nnictl
:link: tutorials/hpo_nnictl/nnictl
:image: ../img/thumbnails/hpo-pytorch.svg
:background: purple
:tags: HPO
.. cardlinkitem::
:header: Hello, NAS!
:description: Beginners' NAS tutorial on how to search for neural architectures for MNIST dataset.
:link: tutorials/hello_nas
:image: ../img/thumbnails/nas-tutorial.svg
:background: cyan
:tags: NAS
.. cardlinkitem::
:header: Use NAS Benchmarks as Datasets
:description: Query data from popular NAS benchmarks from our preprocessed benchmark database.
:link: tutorials/nasbench_as_dataset
:image: ../img/thumbnails/nas-benchmark.svg
:background: cyan
:tags: NAS
.. cardlinkitem::
:header: Get Started with Model Pruning on MNIST
:description: Familiarize yourself with pruning to compress your model
:link: tutorials/pruning_quick_start_mnist
:image: ../img/thumbnails/pruning-tutorial.svg
:background: blue
:tags: Compression
.. cardlinkitem::
:header: Get Started with Model Quantization on MNIST
:description: Familiarize yourself with quantization to compress your model
:link: tutorials/quantization_quick_start_mnist
:image: ../img/thumbnails/quantization-tutorial.svg
:background: indigo
:tags: Compression
.. cardlinkitem::
:header: Speedup Model with Mask
:description: Make your model real smaller and faster with speed-up after pruned by pruner
:link: tutorials/pruning_speedup
:image: ../img/thumbnails/pruning-speed-up.svg
:background: blue
:tags: Compression
.. cardlinkitem::
:header: Speedup Model with Calibration Config
:description: Make your model real smaller and faster with speed-up after quantized by quantizer
:link: tutorials/quantization_speedup
:image: ../img/thumbnails/quantization-speed-up.svg
:background: indigo
:tags: Compression
.. d19a00598b8eca71c825d80c0a7106f2
######################
示例
######################
.. toctree::
:maxdepth: 2
MNIST<./TrialExample/MnistExamples>
Cifar10<./TrialExample/Cifar10Examples>
Scikit-learn<./TrialExample/SklearnExamples>
GBDT<./TrialExample/GbdtExample>
Pix2pix<./TrialExample/Pix2pixExample>
\ No newline at end of file
Experiment Management
=====================
An experiment can be created with command line tool ``nnictl`` or python APIs. NNI provides both command line tool ``nnictl`` and web Portal to manage the experiments, such as, creating, stopping, resuming, deleting, ranking, and comparing the experiments.
Management with ``nnictl``
--------------------------
The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., web browser is not supported in your environment).
Management with web portal
--------------------------
Experiment management on web potral gives an quick overview of all the experiment on users' machine. Users can easily switch to one experiment from this page. Users can refer to the :ref:`exp-manage-webportal` page for details. The experiment management on web portal is still under intensive development to bring more user-friendly features.
\ No newline at end of file
Overview of NNI Experiment
==========================
An NNI experiment is a unit of one tuning process. For example, it is one run of hyper-parameter tuning on a specific search space, it is one run of neural architecture search on a search space, or it is one run of automatic model compression on user specified goal on latency and accuracy. Usually, the tuning process requires many trials to explore feasible and potentially good-performing models. Thus, an important component of NNI experiment is **training service**, which is a unified interface to abstract diverse computation resources (e.g., local machine, remote servers, AKS). Users can easily run the tuning process on their prefered computation resource and platform. On the other hand, NNI experiment provides **WebUI** to visualize the tuning process to users.
During developing a DNN model, users need to manage the tuning process, such as, creating an experiment, adjusting an experiment, kill or rerun a trial in an experiment, dumping experiment data for customized analysis. Also, users may create a new experiment for comparison, or concurrently for new model developing tasks. Thus, NNI provides the functionality of **experiment management**. Users can use :doc:`../reference/nnictl` to interact with experiments.
The relation of the components in NNI experiment is illustrated in the following figure. Hyper-parameter optimization (HPO), neural architecture search (NAS), and model compression are three key features in NNI that help users develop and tune their models. Training serivce provides the ability of parallel running trials on available computation resources. WebUI visualizes the tuning process. *nnictl* is for managing the experiments.
.. image:: ../../img/experiment_arch.png
:scale: 80 %
:align: center
Before reading the following content, you are recommended to go through either :doc:`the quickstart of HPO </tutorials/hpo_quickstart_pytorch/main>` or :doc:`quickstart of NAS </tutorials/hello_nas>` first.
* :doc:`Overview of NNI training service <training_service/overview>`
* :doc:`Introduction to Web Portal <web_portal/web_portal>`
* :doc:`Manange Multiple Experiments <experiment_management>`
Experiment
==========
.. toctree::
:maxdepth: 2
Overview <overview>
Training Service <training_service/toctree>
Web Portal <web_portal/toctree>
Experiment Management <experiment_management>
Run an Experiment on AdaptDL
============================
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__. Before starting to use NNI AdaptDL mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. In AdaptDL mode, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL Training Service
========================
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__, which is a resource-adaptive deep learning training and scheduling framework. With AdaptDL training service, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
Prerequisite for Kubernetes Service
-----------------------------------
.. note:: AdaptDL doesn't support :ref:`reuse mode <training-service-reuse>`.
Prerequisite
------------
Before starting to use NNI AdaptDL training service, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster.
#. A **Kubernetes** cluster using Kubernetes 1.14 or later with storage. Follow this guideline to set up Kubernetes `on Azure <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , or `on-premise <https://kubernetes.io/docs/setup/>`__ with `cephfs <https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd>`__\ , or `microk8s with storage add-on enabled <https://microk8s.io/docs/addons>`__.
#. Helm install **AdaptDL Scheduler** to your Kubernetes cluster. Follow this `guideline <https://adaptdl.readthedocs.io/en/latest/installation/install-adaptdl.html>`__ to setup AdaptDL scheduler.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the ** KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. (Optional) Prepare a **NFS server** and export a general purpose mount as external storage.
#. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
#. Install **NNI**.
Verify Prerequisites
^^^^^^^^^^^^^^^^^^^^
Verify the Prerequisites
^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
.. code-block:: bash
nnictl --version
# Expected: <version_number>
nnictl --version
# Expected: <version_number>
.. code-block:: bash
.. code-block:: bash
kubectl version
# Expected that the kubectl client version matches the server version.
kubectl version
# Expected that the kubectl client version matches the server version.
.. code-block:: bash
.. code-block:: bash
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
kubectl api-versions | grep adaptdl
# Expected: adaptdl.petuum.com/v1
Run an experiment
-----------------
Usage
-----
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under ``examples/trials/cifar10_pytorch`` folder. (\ ``main_adl.py`` and ``config_adl.yaml``\ )
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yml>`)
Here is a template configuration specification to use AdaptDL as a training service.
.. code-block:: yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: dfs
storageSize: 1Gi
Those configs not mentioned below, are following the
`default specs defined </Tutorial/ExperimentConfig.rst#configuration-spec>`__ in the NNI doc.
.. code-block:: yaml
authorName: default
experimentName: minimal_adl
trainingServicePlatform: adl
nniManagerIp: 10.1.10.11
logCollection: http
tuner:
builtinTunerName: GridSearch
searchSpacePath: search_space.json
trialConcurrency: 2
maxTrialNum: 2
trial:
adaptive: false # optional.
image: <image_tag>
imagePullSecrets: # optional
- name: stagingsecret
codeDir: .
command: python main.py
gpuNum: 1
cpuNum: 1 # optional
memorySize: 8Gi # optional
nfs: # optional
server: 10.20.41.55
path: /
containerMountPath: /nfs
checkpoint: # optional
storageClass: dfs
storageSize: 1Gi
.. warning::
This configuration is written following the specification of `legacy experiment configuration <https://nni.readthedocs.io/en/v2.6/Tutorial/ExperimentConfig.html>`__. It is still supported, and will be updated to the latest version in future release.
The following explains the configuration fields of AdaptDL training service.
* **trainingServicePlatform**\ : Choose ``adl`` to use the Kubernetes cluster with AdaptDL scheduler.
* **nniManagerIp**\ : *Required* to get the correct info and metrics back from the cluster, for ``adl`` training service.
......@@ -103,6 +106,9 @@ Those configs not mentioned below, are following the
* **storageClass**\ : check `Kubernetes storage documentation <https://kubernetes.io/docs/concepts/storage/storage-classes/>`__ for how to use the appropriate ``storageClass``.
* **storageSize**\ : this value should be large enough to fit your model's checkpoints, or it could cause "disk quota exceeded" error.
More Features
-------------
NFS Storage
^^^^^^^^^^^
......@@ -121,7 +127,6 @@ The ``adl`` training service can then mount it to the kubernetes for every trial
Use cases:
* If your training trials depend on a dataset of large size, you may want to download it first onto the NFS first,
and mount it so that it can be shared across multiple trials.
* The storage for containers are ephemeral and the trial containers will be deleted after a trial's lifecycle is over.
......@@ -131,17 +136,17 @@ Use cases:
In short, it is not limited how a trial wants to read from or write on the NFS storage, so you may use it flexibly as per your needs.
Monitor via Log Stream
----------------------
^^^^^^^^^^^^^^^^^^^^^^
Follow the log streaming of a certain trial:
.. code-block:: bash
nnictl log trial --trial_id=<trial_id>
nnictl log trial --trial_id=TRIAL_ID
.. code-block:: bash
nnictl log trial <experiment_id> --trial_id=<trial_id>
nnictl log trial EXPERIMENT_ID --trial_id=TRIAL_ID
Note that *after* a trial has done and its pod has been deleted,
no logs can be retrieved then via this command.
......@@ -149,7 +154,7 @@ However you may still be able to access the past trial logs
according to the following approach.
Monitor via TensorBoard
-----------------------
^^^^^^^^^^^^^^^^^^^^^^^
In the context of NNI, an experiment has multiple trials.
For easy comparison across trials for a model tuning process,
......@@ -192,7 +197,7 @@ If having multiple experiment running at the same time, you may use
.. code-block:: bash
nnictl tensorboard start <experiment_id>
nnictl tensorboard start EXPERIMENT_ID
It will provide you the web url to access the tensorboard.
......
AML Training Service
====================
To run your trials on `AzureML <https://azure.microsoft.com/en-us/services/machine-learning/>`__, you can use AML training service. AML training service can programmatically submit runs to AzureML platform and collect their metrics.
Prerequisite
------------
1. Create an Azure account/subscription using this `link <https://azure.microsoft.com/en-us/free/services/machine-learning/>`__. If you already have an Azure account/subscription, skip this step.
2. Install the Azure CLI on your machine, follow the install guide `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__.
3. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
.. code-block:: bash
az login
4. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download ``config.json`` which will be used later.
.. image:: ../../../img/aml_workspace.png
5. Create an AML cluster as the compute target.
.. image:: ../../../img/aml_cluster.png
6. Open a command line and install AML package environment.
.. code-block:: bash
python3 -m pip install azureml
python3 -m pip install azureml-sdk
Usage
-----
We show an example configuration here with YAML (Python configuration should be similar).
.. code-block:: yaml
trialConcurrency: 1
maxTrialNumber: 10
...
trainingService:
platform: aml
dockerImage: msranni/nni
subscriptionId: ${your subscription ID}
resourceGroup: ${your resource group}
workspaceName: ${your workspace name}
computeTarget: ${your compute target}
Configuration References
------------------------
Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports the following additional configurations.
.. list-table::
:header-rows: 1
:widths: auto
* - Field name
- Description
* - dockerImage
- Required field. The docker image name used in job. If you don't want to build your own, NNI has provided a docker image `msranni/nni <https://hub.docker.com/r/msranni/nni>`__, which is up-to-date with every NNI release.
* - subscriptionId
- Required field. The subscription id of your account, can be found in ``config.json`` described above.
* - resourceGroup
- Required field. The resource group of your account, can be found in ``config.json`` described above.
* - workspaceName
- Required field. The workspace name of your account, can be found in ``config.json`` described above.
* - computeTarget
- Required field. The compute cluster name you want to use in your AML workspace. See `reference <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ and Step 5 above.
* - maxTrialNumberPerGpu
- Optional field. Default 1. Used to specify the max concurrency trial number on a GPU device.
* - useActiveGpu
- Optional field. Default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. See :doc:`local` for details.
Monitor your trial on the cloud by using AML studio
---------------------------------------------------
To see your trial job's detailed status on the cloud, you need to visit your studio which you create at Step 5 above. Once the job completes, go to the **Outputs + logs** tab. There you can see a ``70_driver_log.txt`` file, This file contains the standard output from a run and can be useful when you're debugging remote runs in the cloud. Learn more about aml from `here <https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world>`__.
How to Implement Training Service in NNI
========================================
Customize a Training Service
============================
Overview
--------
......@@ -10,12 +10,12 @@ System architecture
-------------------
.. image:: ../../img/NNIDesign.jpg
:target: ../../img/NNIDesign.jpg
.. image:: ../../../img/NNIDesign.jpg
:target: ../../../img/NNIDesign.jpg
:alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.rst>`__\ , `remote platfrom <RemoteMachineMode.rst>`__\ , `PAI platfrom <PaiMode.rst>`__\ , `kubeflow platform <KubeflowMode.rst>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__.
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./kubeflow` and :doc:`./frameworkcontroller`.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
......@@ -24,7 +24,7 @@ Folder structure of code
NNI's folder structure is shown below:
.. code-block:: bash
.. code-block:: text
nni
|- deployment
......@@ -59,7 +59,7 @@ NNI's folder structure is shown below:
Function annotation of TrainingService
--------------------------------------
.. code-block:: bash
.. code-block:: typescript
abstract class TrainingService {
public abstract listTrialJobs(): Promise<TrialJobDetail[]>;
......@@ -82,7 +82,7 @@ The parent class of TrainingService has a few abstract functions, users need to
ClusterMetadata is the data related to platform details, for examples, the ClusterMetadata defined in remote machine server is:
.. code-block:: bash
.. code-block:: typescript
export class RemoteMachineMeta {
public readonly ip : string;
......@@ -117,7 +117,7 @@ This function will return the metadata value according to the values, it could b
SubmitTrialJob is a function to submit new trial jobs, users should generate a job instance in TrialJobDetail type. TrialJobDetail is defined as follow:
.. code-block:: bash
.. code-block:: typescript
interface TrialJobDetail {
readonly id: string;
......@@ -175,8 +175,8 @@ NNI offers a TrialKeeper tool to help maintaining trial jobs. Users can find the
The running architecture of TrialKeeper is show as follow:
.. image:: ../../img/trialkeeper.jpg
:target: ../../img/trialkeeper.jpg
.. image:: ../../../img/trialkeeper.jpg
:target: ../../../img/trialkeeper.jpg
:alt:
......@@ -185,6 +185,4 @@ When users submit a trial job to cloud platform, they should wrap their trial co
Reference
---------
For more information about how to debug, please `refer <../Tutorial/HowToDebug.rst>`__.
The guideline of how to contribute, please `refer <../Tutorial/Contributing.rst>`__.
The guideline of how to contribute, please refer to :doc:`/notes/contributing`.
FrameworkController Training Service
====================================
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__,
called frameworkcontroller mode.
FrameworkController is built to orchestrate all kinds of applications on Kubernetes,
you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator.
Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
1. A **Kubernetes** cluster using Kubernetes 1.8 or later.
Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes.
2. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server.
By default, NNI manager will use ``~/.kube/config`` as kubeconfig file's path.
You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable.
Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__
to learn more about kubeconfig.
3. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__
to configure **Nvidia device plugin for Kubernetes**.
4. Prepare a **NFS server** and export a general purpose mount
(we recommend to map your NFS server path in ``root_squash option``,
otherwise permission issue may raise when NNI copies files to NFS.
Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is),
or **Azure File Storage**.
5. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment.
Run this command to install NFSv4 client:
.. code-block:: bash
apt install nfs-common
6. Install **NNI**:
.. code-block:: bash
python -m pip install nni
Prerequisite for Azure Kubernetes Service
-----------------------------------------
1. NNI support FrameworkController based on Azure Kubernetes Service,
follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
2. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**.
Use ``az login`` to set azure account, and connect kubectl client to AKS,
refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
3. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__
to create azure file storage account.
If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
4. To access Azure storage service, NNI need the access key of the storage account,
and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key.
Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account.
Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Setup FrameworkController
-------------------------
Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run>`__
to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController,
and then pass the name of the FrameworkController service account to the NNI Experiment Config.
If the k8s cluster enforces Authorization, you also need to create a ServiceAccount with granted permission for FrameworkController.
Design
------
Please refer the design of :doc:`Kubeflow training service <kubeflow>`,
FrameworkController training service pipeline is similar.
Example
-------
The FrameworkController config format is:
.. code-block:: python
from nni.experiment import (
Experiment,
FrameworkAttemptCompletionPolicy,
FrameworkControllerRoleConfig,
K8sNfsConfig,
)
experiment = Experiment('frameworkcontroller')
experiment.config.trial_code_directory = '.'
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2
experiment.config.training_service.storage = K8sNfsConfig()
experiment.config.training_service.storage.server = '10.20.30.40'
experiment.config.training_service.storage.path = '/mnt/nfs/nni'
experiment.config.training_service.task_roles = [FrameworkControllerRoleConfig()]
experiment.config.training_service.task_roles[0].name = 'worker'
experiment.config.training_service.task_roles[0].task_number = 1
experiment.config.training_service.task_roles[0].command = 'python3 model.py'
experiment.config.training_service.task_roles[0].gpuNumber = 1
experiment.config.training_service.task_roles[0].cpuNumber = 1
experiment.config.training_service.task_roles[0].memorySize = '4g'
experiment.config.training_service.task_roles[0].framework_attempt_completion_policy = \
FrameworkAttemptCompletionPolicy(min_failed_task_count = 1, min_succeed_task_count = 1)
If you use Azure Kubernetes Service, you should set storage config as follows:
.. code-block:: python
experiment.config.training_service.storage = K8sAzureStorageConfig()
experiment.config.training_service.storage.azure_account = 'your_storage_account_name'
experiment.config.training_service.storage.azure_share = 'your_azure_share_name'
experiment.config.training_service.storage.key_vault_name = 'your_vault_name'
experiment.config.training_service.storage.key_vault_key = 'your_secret_name'
If you set `ServiceAccount <https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/>`__ in your k8s,
please set ``serviceAccountName`` in your config:
.. code-block:: python
experiment.config.training_service.service_account_name = 'frameworkcontroller'
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config,
you could refer the `Tensorflow example of FrameworkController
<https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__
for deep understanding.
Once it's ready, run:
.. code-block:: python
experiment.run(8080)
Notice: In frameworkcontroller mode,
NNIManager will start a rest server and listen on a port which is your NNI web portal's port plus 1.
For example, if your web portal port is ``8080``, the rest server will listen on ``8081``,
to receive metrics from trial job running in Kubernetes.
So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Hybrid Training Service
=======================
Hybrid training service is for aggregating different types of computation resources into a virtually unified resource pool, in which trial jobs are dispatched. Hybrid training service is for collecting user's all available computation resources to jointly work on an AutoML task, it is flexibile enough to switch among different types of computation resources. For example, NNI could submit trial jobs to multiple remote machines and AML simultaneously.
Prerequisite
------------
NNI has supported :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./aml`, :doc:`./kubeflow`, :doc:`./frameworkcontroller`, for hybrid training service. Before starting an experiment using using hybrid training service, users should first setup their chosen (sub) training services (e.g., remote training service) according to each training service's own document page.
.. note:: Reuse mode is disabled by default for local training service. But if you are using local training service in hybrid, :ref:`reuse mode <training-service-reuse>` is enabled by default.
Usage
-----
Unlike other training services (e.g., ``platform: remote`` in remote training service), there is no dedicated keyword for hybrid training service, users can simply list the configurations of their chosen training services under the ``trainingService`` field. Below is an example of a hybrid training service containing remote training service and local training service in experiment configuration yaml.
.. code-block:: yaml
# the experiment config yaml file
...
trainingService:
- platform: remote
machineList:
- host: 127.0.0.1 # your machine's IP address
user: bob
password: bob
- platform: local
...
A complete example configuration file can be found in :githublink:`examples/trials/mnist-pytorch/config_hybrid.yml`.
\ No newline at end of file
Kubeflow Training Service
=========================
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__, called kubeflow mode.
Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster,
either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__,
a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__
is setup to connect to your Kubernetes cluster.
If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start.
In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
1. A **Kubernetes** cluster using Kubernetes 1.8 or later.
Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes.
2. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster.
Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
3. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server.
By default, NNI manager will use ``~/.kube/config`` as kubeconfig file's path.
You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable.
Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__
to learn more about kubeconfig.
4. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__
to configure **Nvidia device plugin for Kubernetes**.
5. Prepare a **NFS server** and export a general purpose mount
(we recommend to map your NFS server path in ``root_squash option``,
otherwise permission issue may raise when NNI copy files to NFS.
Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is),
or **Azure File Storage**.
6. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment.
Run this command to install NFSv4 client:
.. code-block:: bash
apt install nfs-common
7. Install **NNI**:
.. code-block:: bash
python -m pip install nni
Prerequisite for Azure Kubernetes Service
-----------------------------------------
1. NNI support Kubeflow based on Azure Kubernetes Service,
follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
2. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**.
Use ``az login`` to set azure account, and connect kubectl client to AKS,
refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
3. Deploy Kubeflow on Azure Kubernetes Service, follow the `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__.
4. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__
to create azure file storage account.
If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
5. To access Azure storage service, NNI need the access key of the storage account,
and NNI use `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key.
Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account.
Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Design
------
.. image:: ../../../img/kubeflow_training_design.png
:target: ../../../img/kubeflow_training_design.png
:alt:
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local ``trial_code_directory``
together with NNI generated files like parameter.cfg into a storage volumn.
Right now we support two kinds of storage volumes:
`nfs <https://en.wikipedia.org/wiki/Network_File_System>`__
and `azure file storage <https://azure.microsoft.com/en-us/services/storage/files/>`__,
you should configure the storage volumn in experiment config.
After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs
(`tf-operator <https://github.com/kubeflow/tf-operator>`__ job
or `pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ job)
in K8S, and mount your storage volume into the job's pod.
Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn.
NNI will show the storage volumn's URL for each trial in web portal, to allow user browse the log files and job's output files.
Supported operator
------------------
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators are not tested.
Users can set operator type in experiment config.
The setting of tf-operator:
.. code-block:: yaml
config.training_service.operator = 'tf-operator'
The setting of pytorch-operator:
.. code-block:: yaml
config.training_service.operator = 'pytorch-operator'
If users want to use tf-operator, he could set ``ps`` and ``worker`` in trial config.
If users want to use pytorch-operator, he could set ``master`` and ``worker`` in trial config.
Supported storage type
----------------------
NNI support NFS and Azure Storage to store the code and output files,
users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
.. code-block:: python
config.training_service.storage = K8sNfsConfig(
server = '10.20.30.40', # your NFS server IP
path = '/mnt/nfs/nni' # your NFS server export path
)
If you use Azure storage, you should set ``storage`` in your config as follows:
.. code-block:: python
config.training_service.storage = K8sAzureStorageConfig(
azure_account = your_azure_account_name,
azure_share = your_azure_share_name,
key_vault_name = your_vault_name,
key_vault_key = your_secret_name
)
Run an experiment
-----------------
Use :doc:`PyTorch quickstart </tutorials/hpo_quickstart_pytorch/main>` as an example.
This is a PyTorch job, and use pytorch-operator of Kubeflow.
The experiment config is like:
.. code-block:: python
from nni.experiment import Experiment, K8sNfsConfig, KubeflowRowConfig
experiment = Experiment('kubeflow')
experiment.config.search_space = search_space
experiment.config.tuner.name = 'TPE'
experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
experiment.config.max_trial_number = 10
experiment.config.trial_concurrency = 2
experiment.config.operator = 'pytorch-operator'
experiment.config.api_version = 'v1alpha2'
experiment.config.training_service.storage = K8sNfsConfig()
experiment.config.training_service.storage.server = '10.20.30.40'
experiment.config.training_service.storage.path = '/mnt/nfs/nni'
experiment.config.training_service.worker = KubeflowRowConfig()
experiment.config.training_service.worker.replicas = 2
experiment.config.training_service.worker.command = 'python3 model.py'
experiment.config.training_service.worker.gpuNumber = 1
experiment.config.training_service.worker.cpuNumber = 1
experiment.config.training_service.worker.memorySize = '4g'
experiment.config.training_service.worker.code_directory = '.'
experiment.config.training_service.master = KubeflowRowConfig()
experiment.config.training_service.master.replicas = 1
experiment.config.training_service.master.command = 'python3 model.py'
experiment.config.training_service.master.gpuNumber = 0
experiment.config.training_service.master.cpuNumber = 1
experiment.config.training_service.master.memorySize = '4g'
experiment.config.training_service.master.code_directory = '.'
experiment.config.training_service.worker.docker_image = 'msranni/nni:latest' # default
Once it's ready, run:
.. code-block:: python
experiment.run(8080)
NNI will create Kubeflow pytorchjob for each trial,
and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see the Kubeflow jobs created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI web portal's port plus 1.
For example, if your web portal port is ``8080``, the rest server will listen on ``8081``,
to receive metrics from trial job running in Kubernetes.
So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI web portal's overview page (like http://localhost:8080/oview)
to check trials' information.
Local Training Service
======================
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
.. note:: Currently, :ref:`reuse mode <training-service-reuse>` remains disabled by default in local training service.
Prerequisite
------------
You are recommended to go through quick start first, as this document page only explains the configuration of local training service, one part of the experiment configuration yaml file.
Usage
-----
.. code-block:: yaml
# the experiment config yaml file
...
trainingService:
platform: local
useActiveGpu: false # optional
...
There are other supported fields for local training service, such as ``maxTrialNumberPerGpu``, ``gpuIndices``, for concurrently running multiple trials on one GPU, and running trials on a subset of GPUs on your machine. Please refer to :ref:`reference-local-config-label` in reference for detailed usage.
.. note::
Users should set **useActiveGpu** to `true`, if the local machine has GPUs and your trial uses GPU, but generated trials keep waiting. This is usually the case when you are using graphical OS like Windows 10 and Ubuntu desktop.
Then we explain how local training service works with different configurations of ``trialGpuNumber`` and ``trialConcurrency``. Suppose user's local machine has 4 GPUs, with configuration ``trialGpuNumber: 1`` and ``trialConcurrency: 4``, there will be 4 trials run on this machine concurrently, each of which uses 1 GPU. If the configuration is ``trialGpuNumber: 2`` and ``trialConcurrency: 2``, there will be 2 trials run on this machine concurrently, each of which uses 2 GPUs. Which GPU is allocated to which trial is decided by local training service, users do not need to worry about it. An exmaple configuration below.
.. code-block:: yaml
...
trialGpuNumber: 1
trialConcurrency: 4
...
trainingService:
platform: local
useActiveGpu: false
A complete example configuration file can be found :githublink:`examples/trials/mnist-pytorch/config.yml`.
\ No newline at end of file
OpenPAI Training Service
========================
NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__. OpenPAI manages computing resources and is optimized for deep learning. Through docker technology, the computing hardware are decoupled with software, so that it's easy to run distributed jobs, switch with different deep learning frameworks, or run other kinds of jobs on consistent environments.
Prerequisite
------------
1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers.
2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
.. image:: ../../../img/pai_profile.jpg
:scale: 80%
Click ``copy`` button in the page to copy a jwt token.
.. image:: ../../../img/pai_token.jpg
:scale: 67%
3. Mount NFS storage to local machine. If you don't know where to find the NFS storage, please click ``Submit job`` button in web portal.
.. image:: ../../../img/pai_job_submission_page.jpg
:scale: 50%
Find the data management region in job submission page.
.. image:: ../../../img/pai_data_management_page.jpg
:scale: 33%
The ``Preview container paths`` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage to upload data/code to or download from OpenPAI cluster. To mount the storage, please use ``mount`` command, for example:
.. code-block:: bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
Then the ``/data`` folder in container will be mounted to ``/local/mnt`` folder in your local machine. Please keep in mind that ``localStorageMountPoint`` should be set to ``/local/mnt`` in this case.
4. Get OpenPAI's storage config name and ``containerStorageMountPoint``. They can also be found in data management region in job submission page. Please find the ``Name`` and ``Path`` of your ``Team share storage``. They should be put into ``storageConfigName`` and ``containerStorageMountPoint``. For example,
.. code-block:: yaml
storageConfigName: confignfs-data
containerStorageMountPoint: /mnt/confignfs-data
Usage
-----
We show an example configuration here with YAML (Python configuration should be similar).
.. code-block:: yaml
trialGpuNumber: 0
trialConcurrency: 1
...
trainingService:
platform: openpai
host: http://123.123.123.123
username: ${your user name}
token: ${your token}
dockerImage: msranni/nni
trialCpuNumber: 1
trialMemorySize: 8GB
storageConfigName: confignfs-data
localStorageMountPoint: /local/mnt
containerStorageMountPoint: /mnt/confignfs-data
Once completing the configuration and run nnictl / use Python to launch the experiment. NNI will start to spawn trials to your specified OpenPAI platform.
The job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``. You can see jobs created by NNI on the OpenPAI cluster's web portal, like:
.. image:: ../../../img/nni_pai_joblist.jpg
.. note:: For OpenPAI training service, NNI will start an additional rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``, the rest server will listen on ``8081``, to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like ``http://localhost:8080/oview``) to check trial's information. For example, you can expand a trial information in trial list view, click the logPath link like:
.. image:: ../../../img/nni_webui_joblist.png
:scale: 30%
Configuration References
------------------------
Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports the following additional configurations.
.. list-table::
:header-rows: 1
:widths: auto
* - Field name
- Description
* - username
- Required field. User name of OpenPAI platform.
* - token
- Required field. Authentication key of OpenPAI platform.
* - host
- Required field. The host of OpenPAI platform. It's PAI's job submission page URI, like ``10.10.5.1``. The default protocol in NNI is HTTPS. If your PAI's cluster has disabled https, please use the URI in ``http://10.10.5.1`` format.
* - trialCpuNumber
- Optional field. Should be positive number based on your trial program's CPU requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - trialMemorySize
- Optional field. Should be in format like ``2gb`` based on your trial program's memory requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - dockerImage
- Optional field. In OpenPAI training service, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run. Upon every NNI release, we build `a docker image <https://hub.docker.com/r/msranni/nni>`__ with `this Dockerfile <https://hub.docker.com/r/msranni/nni>`__. You can either use this image directly in your config file, or build your own image. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - virtualCluster
- Optional field. Set the virtualCluster of OpenPAI. If omitted, the job will run on ``default`` virtual cluster.
* - localStorageMountPoint
- Required field. Set the mount path in the machine you start the experiment.
* - containerStorageMountPoint
- Optional field. Set the mount path in your container used in OpenPAI.
* - storageConfigName
- Optional field. Set the storage name used in OpenPAI. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - openpaiConfigFile
- Optional field. Set the file path of OpenPAI job configuration, the file is in yaml format. If users set ``openpaiConfigFile`` in NNI's configuration file, there's no need to specify the fields ``storageConfigName``, ``virtualCluster``, ``dockerImage``, ``trialCpuNumber``, ``trialGpuNumber``, ``trialMemorySize`` in configuration. These fields will use the values from the config file specified by ``openpaiConfigFile``.
* - openpaiConfig
- Optional field. Similar to ``openpaiConfigFile``, but instead of referencing an external file, using this field you embed the content into NNI's config YAML.
.. note::
#. The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is ``nni_exp_{this.experimentId}_trial_{trialJobId}`` .
#. If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taskRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
Data management
---------------
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage (NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by ``paiStorageConfigName`` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the ``nniManagerNFSMountPath`` field in configuration file, NNI will generate bash files and copy data in ``codeDir`` to the ``nniManagerNFSMountPath`` folder, then NNI will start a trial job. The data in ``nniManagerNFSMountPath`` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in ``containerNFSMountPath``, NNI will enter this folder first, and then run scripts to start a trial job.
Version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../../img/webui-img/experimentError.png
:scale: 80%
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment