Unverified Commit f5b89bb6 authored by J-shang's avatar J-shang Committed by GitHub
Browse files

Merge pull request #4776 from microsoft/v2.7

parents 7aa44612 1546962f
...@@ -20,8 +20,8 @@ NNI automates feature engineering, neural architecture search, hyperparameter tu ...@@ -20,8 +20,8 @@ NNI automates feature engineering, neural architecture search, hyperparameter tu
## What's NEW! &nbsp;<a href="#nni-released-reminder"><img width="48" src="docs/img/release_icon.png"></a> ## What's NEW! &nbsp;<a href="#nni-released-reminder"><img width="48" src="docs/img/release_icon.png"></a>
* **New release**: [v2.6 is available](https://github.com/microsoft/nni/releases/tag/v2.6) - _released on Jan-19-2022_ * **New release**: [v2.7 is available](https://github.com/microsoft/nni/releases/tag/v2.7) - _released on Apr-18-2022_
* **New demo available**: [Youtube entry](https://www.youtube.com/channel/UCKcafm6861B2mnYhPbZHavw) | [Bilibili 入口](https://space.bilibili.com/1649051673) - _last updated on May-26-2021_ * **New demo available**: [Youtube entry](https://www.youtube.com/channel/UCKcafm6861B2mnYhPbZHavw) | [Bilibili 入口](https://space.bilibili.com/1649051673) - _last updated on Apr-18-2022_
* **New webinar**: [Introducing Retiarii: A deep learning exploratory-training framework on NNI](https://note.microsoft.com/MSR-Webinar-Retiarii-Registration-Live.html) - _scheduled on June-24-2021_ * **New webinar**: [Introducing Retiarii: A deep learning exploratory-training framework on NNI](https://note.microsoft.com/MSR-Webinar-Retiarii-Registration-Live.html) - _scheduled on June-24-2021_
* **New community channel**: [Discussions](https://github.com/microsoft/nni/discussions) * **New community channel**: [Discussions](https://github.com/microsoft/nni/discussions)
* **New emoticons release**: [nnSpider](https://nni.readthedocs.io/en/latest/sharings/nn_spider.html) * **New emoticons release**: [nnSpider](https://nni.readthedocs.io/en/latest/sharings/nn_spider.html)
......
This diff is collapsed.
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
## 安全
微软非常重视软件产品和服务的安全性,包括通过我们的 GitHub 组织管理的所有源代码库,其中涵盖 [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin),和 [我们 GitHub 的组织](https://opensource.microsoft.com/)
如果你在任何微软拥有的资源库中发现了安全漏洞,并且符合 [微软对安全漏洞的定义](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)),请按照下文所述向我们报告。
## 报告安全问题
**请不要通过公开的 GitHub 问题报告安全漏洞。**
相反,请向微软安全响应中心(MSRC)报告,链接是 [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report)
如果您希望在不登录的情况下提交,请发送电子邮件至 [secure@microsoft.com](mailto:secure@microsoft.com)。 如果可能的话,请用我们的 PGP 密钥对您的信息进行加密;请从以下网站下载该密钥 [微软安全响应中心 PGP 密钥页面](https://www.microsoft.com/en-us/msrc/pgp-key-msrc)
你应该在24小时内收到回复。 如果由于某些原因你没有收到,请通过电子邮件跟进,以确保我们收到你的原始信息。 其他信息可以在以下网站找到 [microsoft.com/msrc](https://www.microsoft.com/msrc)
请包括以下所要求的信息(尽可能多地提供),以帮助我们更好地了解可能的问题的性质和范围。
* 问题类型(如缓冲区溢出、SQL 注入、跨站脚本等)
* 与问题表现有关的源文件的完整路径
* 受影响的源代码位置(标签/分支/提交或 URL)
* 重现该问题所需的任何特殊配置
* 重现该问题的分步骤说明
* 概念证明或漏洞代码(如果可能的话)
* 该问题的影响,包括攻击者如何利用该问题
这些信息将帮助我们更快地对你的报告进行分流。
如果您需要报告错误赏金,更完整的报告可有助于获得更高的赏金奖励。 请访问我们的[微软漏洞赏金计划](https://microsoft.com/msrc/bounty)页面,以了解有关我们活动计划的更多详细信息。
## 首选语言
我们希望所有的交流都是用英语进行的。
## 政策
微软遵循[协调漏洞披露](https://www.microsoft.com/en-us/msrc/cvd)的原则。
<!-- END MICROSOFT SECURITY.MD BLOCK -->
...@@ -160,6 +160,8 @@ If the output shape of the pruned conv layer is not divisible by 1024(for exampl ...@@ -160,6 +160,8 @@ If the output shape of the pruned conv layer is not divisible by 1024(for exampl
not_safe = not_safe_to_prune(model, dummy_input) not_safe = not_safe_to_prune(model, dummy_input)
.. _flops-counter:
Model FLOPs/Parameters Counter Model FLOPs/Parameters Counter
------------------------------ ------------------------------
......
...@@ -4,7 +4,7 @@ Pruner in NNI ...@@ -4,7 +4,7 @@ Pruner in NNI
NNI implements the main part of the pruning algorithm as pruner. All pruners are implemented as close as possible to what is described in the paper (if it has). NNI implements the main part of the pruning algorithm as pruner. All pruners are implemented as close as possible to what is described in the paper (if it has).
The following table provides a brief introduction to the pruners implemented in nni, click the link in table to view a more detailed introduction and use cases. The following table provides a brief introduction to the pruners implemented in nni, click the link in table to view a more detailed introduction and use cases.
There are two kinds of pruners in NNI, please refer to `basic pruner <basic-pruner>`_ and `scheduled pruner <scheduled-pruner>`_ for details. There are two kinds of pruners in NNI, please refer to :ref:`basic pruner <basic-pruner>` and :ref:`scheduled pruner <scheduled-pruner>` for details.
.. list-table:: .. list-table::
:header-rows: 1 :header-rows: 1
......
...@@ -31,7 +31,7 @@ author = 'Microsoft' ...@@ -31,7 +31,7 @@ author = 'Microsoft'
version = '' version = ''
# The full version, including alpha/beta/rc tags # The full version, including alpha/beta/rc tags
# FIXME: this should be written somewhere globally # FIXME: this should be written somewhere globally
release = 'v2.6' release = 'v2.7'
# -- General configuration --------------------------------------------------- # -- General configuration ---------------------------------------------------
...@@ -95,6 +95,16 @@ autodoc_inherit_docstrings = False ...@@ -95,6 +95,16 @@ autodoc_inherit_docstrings = False
# Sphinx will warn about all references where the target cannot be found. # Sphinx will warn about all references where the target cannot be found.
nitpicky = False # disabled for now nitpicky = False # disabled for now
# A list of regular expressions that match URIs that should not be checked.
linkcheck_ignore = [
r'http://localhost:\d+',
r'.*://.*/#/', # Modern websites that has URLs like xxx.com/#/guide
r'https://github.com/JSong-Jia/Pic/', # Community links can't be found any more
]
# Ignore all links located in release.rst
linkcheck_exclude_documents = ['^release']
# Bibliography files # Bibliography files
bibtex_bibfiles = ['refs.bib'] bibtex_bibfiles = ['refs.bib']
......
Examples Examples
======== ========
More examples can be found in our :githublink:`GitHub repository <nni/examples>`. More examples can be found in our :githublink:`GitHub repository <examples>`.
.. cardlinkitem:: .. cardlinkitem::
:header: HPO Quickstart with PyTorch :header: HPO Quickstart with PyTorch
...@@ -19,6 +19,14 @@ More examples can be found in our :githublink:`GitHub repository <nni/examples>` ...@@ -19,6 +19,14 @@ More examples can be found in our :githublink:`GitHub repository <nni/examples>`
:background: purple :background: purple
:tags: HPO :tags: HPO
.. cardlinkitem::
:header: HPO using command line tool
:description: Run HPO experiment with nnictl
:link: tutorials/hpo_nnictl/nnictl
:image: ../img/thumbnails/hpo-pytorch.svg
:background: purple
:tags: HPO
.. cardlinkitem:: .. cardlinkitem::
:header: Hello, NAS! :header: Hello, NAS!
:description: Beginners' NAS tutorial on how to search for neural architectures for MNIST dataset. :description: Beginners' NAS tutorial on how to search for neural architectures for MNIST dataset.
......
...@@ -6,7 +6,7 @@ An experiment can be created with command line tool ``nnictl`` or python APIs. N ...@@ -6,7 +6,7 @@ An experiment can be created with command line tool ``nnictl`` or python APIs. N
Management with ``nnictl`` Management with ``nnictl``
-------------------------- --------------------------
The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., no GUI on your machine). The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., web browser is not supported in your environment).
Management with web portal Management with web portal
-------------------------- --------------------------
......
...@@ -4,6 +4,8 @@ AdaptDL Training Service ...@@ -4,6 +4,8 @@ AdaptDL Training Service
Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__, which is a resource-adaptive deep learning training and scheduling framework. With AdaptDL training service, your trial program will run as AdaptDL job in Kubernetes cluster. Now NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__, which is a resource-adaptive deep learning training and scheduling framework. With AdaptDL training service, your trial program will run as AdaptDL job in Kubernetes cluster.
AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud. AdaptDL aims to make distributed deep learning easy and efficient in dynamic-resource environments such as shared clusters and the cloud.
.. note:: AdaptDL doesn't support :ref:`reuse mode <training-service-reuse>`.
Prerequisite Prerequisite
------------ ------------
...@@ -37,7 +39,7 @@ Verify the Prerequisites ...@@ -37,7 +39,7 @@ Verify the Prerequisites
Usage Usage
----- -----
We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yaml>`) We have a CIFAR10 example that fully leverages the AdaptDL scheduler under :githublink:`examples/trials/cifar10_pytorch` folder. (:githublink:`main_adl.py <examples/trials/cifar10_pytorch/main_adl.py>` and :githublink:`config_adl.yaml <examples/trials/cifar10_pytorch/config_adl.yml>`)
Here is a template configuration specification to use AdaptDL as a training service. Here is a template configuration specification to use AdaptDL as a training service.
......
...@@ -15,7 +15,7 @@ System architecture ...@@ -15,7 +15,7 @@ System architecture
:alt: :alt:
The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports `local platfrom <LocalMode.rst>`__\ , `remote platfrom <RemoteMachineMode.rst>`__\ , `PAI platfrom <PaiMode.rst>`__\ , `kubeflow platform <KubeflowMode.rst>`__ and `FrameworkController platfrom <FrameworkControllerMode.rst>`__. The brief system architecture of NNI is shown in the picture. NNIManager is the core management module of system, in charge of calling TrainingService to manage trial jobs and the communication between different modules. Dispatcher is a message processing center responsible for message dispatch. TrainingService is a module to manage trial jobs, it communicates with nniManager module, and has different instance according to different training platform. For the time being, NNI supports :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./kubeflow` and :doc:`./frameworkcontroller`.
In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules. In this document, we introduce the brief design of TrainingService. If users want to add a new TrainingService instance, they just need to complete a child class to implement TrainingService, don't need to understand the code detail of NNIManager, Dispatcher or other modules.
...@@ -185,6 +185,4 @@ When users submit a trial job to cloud platform, they should wrap their trial co ...@@ -185,6 +185,4 @@ When users submit a trial job to cloud platform, they should wrap their trial co
Reference Reference
--------- ---------
For more information about how to debug, please `refer <../Tutorial/HowToDebug.rst>`__. The guideline of how to contribute, please refer to :doc:`/notes/contributing`.
The guideline of how to contribute, please `refer <../Tutorial/Contributing.rst>`__.
...@@ -60,14 +60,12 @@ Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/mas ...@@ -60,14 +60,12 @@ Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/mas
to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode. to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController, If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController,
and then pass the name of the FrameworkController service account to the NNI Experiment Config. and then pass the name of the FrameworkController service account to the NNI Experiment Config.
`refer <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset>`__. If the k8s cluster enforces Authorization, you also need to create a ServiceAccount with granted permission for FrameworkController.
If the k8s cluster enforces Authorization, you also need to create a ServiceAccount with granted permission for FrameworkController,
`refer <https://github.com/microsoft/frameworkcontroller/tree/master/example/run#prerequisite>`__.
Design Design
------ ------
Please refer the design of `Kubeflow training service <KubeflowMode.rst>`__, Please refer the design of :doc:`Kubeflow training service <kubeflow>`,
FrameworkController training service pipeline is similar. FrameworkController training service pipeline is similar.
Example Example
...@@ -115,7 +113,7 @@ If you use Azure Kubernetes Service, you should set storage config as follows: ...@@ -115,7 +113,7 @@ If you use Azure Kubernetes Service, you should set storage config as follows:
experiment.config.training_service.storage.key_vault_name = 'your_vault_name' experiment.config.training_service.storage.key_vault_name = 'your_vault_name'
experiment.config.training_service.storage.key_vault_key = 'your_secret_name' experiment.config.training_service.storage.key_vault_key = 'your_secret_name'
If you set `ServiceAccount <https://github.com/microsoft/frameworkcontroller/tree/master/example/run#prerequisite>`__ in your k8s, If you set `ServiceAccount <https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/>`__ in your k8s,
please set ``serviceAccountName`` in your config: please set ``serviceAccountName`` in your config:
.. code-block:: python .. code-block:: python
......
...@@ -8,6 +8,8 @@ Prerequisite ...@@ -8,6 +8,8 @@ Prerequisite
NNI has supported :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./aml`, :doc:`./kubeflow`, :doc:`./frameworkcontroller`, for hybrid training service. Before starting an experiment using using hybrid training service, users should first setup their chosen (sub) training services (e.g., remote training service) according to each training service's own document page. NNI has supported :doc:`./local`, :doc:`./remote`, :doc:`./openpai`, :doc:`./aml`, :doc:`./kubeflow`, :doc:`./frameworkcontroller`, for hybrid training service. Before starting an experiment using using hybrid training service, users should first setup their chosen (sub) training services (e.g., remote training service) according to each training service's own document page.
.. note:: Reuse mode is disabled by default for local training service. But if you are using local training service in hybrid, :ref:`reuse mode <training-service-reuse>` is enabled by default.
Usage Usage
----- -----
......
...@@ -3,6 +3,8 @@ Local Training Service ...@@ -3,6 +3,8 @@ Local Training Service
With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler. With local training service, the whole experiment (e.g., tuning algorithms, trials) runs on a single machine, i.e., user's dev machine. The generated trials run on this machine following ``trialConcurrency`` set in the configuration yaml file. If GPUs are used by trial, local training service will allocate required number of GPUs for each trial, like a resource scheduler.
.. note:: Currently, :ref:`reuse mode <training-service-reuse>` remains disabled by default in local training service.
Prerequisite Prerequisite
------------ ------------
......
...@@ -6,7 +6,7 @@ NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai ...@@ -6,7 +6,7 @@ NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai
Prerequisite Prerequisite
------------ ------------
1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers. 1. Before starting to use OpenPAI training service, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. Please note that, on OpenPAI, your trial program will run in Docker containers.
2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side. 2. Get token. Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
...@@ -100,7 +100,7 @@ Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports ...@@ -100,7 +100,7 @@ Compared with :doc:`local` and :doc:`remote`, OpenPAI training service supports
* - trialMemorySize * - trialMemorySize
- Optional field. Should be in format like ``2gb`` based on your trial program's memory requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field. - Optional field. Should be in format like ``2gb`` based on your trial program's memory requirement. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - dockerImage * - dockerImage
- Optional field. In OpenPAI training service, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run. Upon every NNI release, we build `a docker image <https://hub.docker.com/r/msranni/nni>`__ with :githublink:`this Dockerfile <https://hub.docker.com/r/msranni/nni>`. You can either use this image directly in your config file, or build your own image. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field. - Optional field. In OpenPAI training service, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run. Upon every NNI release, we build `a docker image <https://hub.docker.com/r/msranni/nni>`__ with `this Dockerfile <https://hub.docker.com/r/msranni/nni>`__. You can either use this image directly in your config file, or build your own image. If it's not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
* - virtualCluster * - virtualCluster
- Optional field. Set the virtualCluster of OpenPAI. If omitted, the job will run on ``default`` virtual cluster. - Optional field. Set the virtualCluster of OpenPAI. If omitted, the job will run on ``default`` virtual cluster.
* - localStorageMountPoint * - localStorageMountPoint
......
...@@ -9,21 +9,30 @@ NNI has supported many training services listed below. Users can go through each ...@@ -9,21 +9,30 @@ NNI has supported many training services listed below. Users can go through each
* - Training Service * - Training Service
- Description - Description
* - Local * - :doc:`Local <local>`
- The whole experiment runs on your dev machine (i.e., a single local machine) - The whole experiment runs on your dev machine (i.e., a single local machine)
* - Remote * - :doc:`Remote <remote>`
- The trials are dispatched to your configured remote servers - The trials are dispatched to your configured SSH servers
* - OpenPAI * - :doc:`OpenPAI <openpai>`
- Running trials on OpenPAI, a DNN model training platform based on Kubernetes - Running trials on OpenPAI, a DNN model training platform based on Kubernetes
* - Kubeflow * - :doc:`Kubeflow <kubeflow>`
- Running trials with Kubeflow, a DNN model training framework based on Kubernetes - Running trials with Kubeflow, a DNN model training framework based on Kubernetes
* - AdaptDL * - :doc:`AdaptDL <adaptdl>`
- Running trials on AdaptDL, an elastic DNN model training platform - Running trials on AdaptDL, an elastic DNN model training platform
* - FrameworkController * - :doc:`FrameworkController <frameworkcontroller>`
- Running trials with FrameworkController, a DNN model training framework on Kubernetes - Running trials with FrameworkController, a DNN model training framework on Kubernetes
* - AML * - :doc:`AML <aml>`
- Running trials on AML cloud service - Running trials on Azure Machine Learning (AML) cloud service
* - PAI-DLC * - :doc:`PAI-DLC <paidlc>`
- Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK - Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK
* - Hybrid * - :doc:`Hybrid <hybrid>`
- Support jointly using multiple above training services - Support jointly using multiple above training services
\ No newline at end of file
.. _training-service-reuse:
Training Service Under Reuse Mode
---------------------------------
Since NNI v2.0, there are two sets of training service implementations in NNI. The new one is called *reuse mode*. When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.
.. note:: In the reuse mode, users need to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).
...@@ -8,7 +8,7 @@ PAI-DSW server performs the role to submit a job while PAI-DLC is where the trai ...@@ -8,7 +8,7 @@ PAI-DSW server performs the role to submit a job while PAI-DLC is where the trai
Prerequisite Prerequisite
------------ ------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__. Step 1. Install NNI, follow the :doc:`install guide </installation>`.
Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU. Step 2. Create PAI-DSW server following this `link <https://help.aliyun.com/document_detail/163684.html?section-2cw-lsi-es9#title-ji9-re9-88x>`__. Note as the training service will be run on PAI-DLC, it won't cost many resources to run and you may just need a PAI-DSW server with CPU.
...@@ -60,7 +60,7 @@ Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's ...@@ -60,7 +60,7 @@ Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's
Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode. Note: You should set ``platform: dlc`` in NNI config YAML file if you want to start experiment in dlc mode.
Compared with `LocalMode <LocalMode.rst>`__ training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__. Compared with :doc:`local`, training service configuration in dlc mode have these additional keys like ``type/image/jobType/podCount/ecsSpec/region/nasDataSourceId/accessKeyId/accessKeySecret``, for detailed explanation ref to this `link <https://help.aliyun.com/document_detail/203111.html#h2-url-3>`__.
Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``. Also, as dlc mode requires DSW/DLC to mount the same NAS disk to share information, there are two extra keys related to this: ``localStorageMountPoint`` and ``containerStorageMountPoint``.
......
...@@ -13,7 +13,7 @@ Prerequisite ...@@ -13,7 +13,7 @@ Prerequisite
2. Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usage, please refer to :ref:`reference-remote-config-label` in reference for detailed usage. 2. Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usage, please refer to :ref:`reference-remote-config-label` in reference for detailed usage.
3. Make sure the NNI version on each machine is consistent. Follow the install guide `here <../Tutorial/QuickStart.rst>`__ to install NNI. 3. Make sure the NNI version on each machine is consistent. Follow the install guide :doc:`here </installation>` to install NNI.
4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows. 4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
...@@ -21,18 +21,18 @@ In addition, there are several steps for Windows server. ...@@ -21,18 +21,18 @@ In addition, there are several steps for Windows server.
1. Install and start ``OpenSSH Server``. 1. Install and start ``OpenSSH Server``.
1) Open ``Settings`` app on Windows. 1) Open ``Settings`` app on Windows.
2) Click ``Apps``\ , then click ``Optional features``. 2) Click ``Apps``\ , then click ``Optional features``.
3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``. 3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
4) Once it's installed, run below command to start and set to automatic start. 4) Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat .. code-block:: bat
sc config sshd start=auto sc config sshd start=auto
net start sshd net start sshd
2. Make sure remote account is administrator, so that it can stop running trials. 2. Make sure remote account is administrator, so that it can stop running trials.
...@@ -85,7 +85,7 @@ You can run below command on Windows, Linux, or macOS to spawn trials on remote ...@@ -85,7 +85,7 @@ You can run below command on Windows, Linux, or macOS to spawn trials on remote
.. _nniignore: .. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__. .. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your trial code directory contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>` *Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
...@@ -111,4 +111,4 @@ Remote training service support shared storage, which can help use your own stor ...@@ -111,4 +111,4 @@ Remote training service support shared storage, which can help use your own stor
Monitor via TensorBoard Monitor via TensorBoard
^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support trial visualization via TensorBoard. Follow the guide `here <./tensorboard.rst>`__ to learn how to use TensorBoard. Remote training service support trial visualization via TensorBoard. Follow the guide :doc:`/experiment/web_portal/tensorboard` to learn how to use TensorBoard.
...@@ -7,7 +7,7 @@ All the information generated by the experiment will be stored under ``/nni`` fo ...@@ -7,7 +7,7 @@ All the information generated by the experiment will be stored under ``/nni`` fo
All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage. All the output produced by the trial will be located under ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}/nnioutput`` folder in your shared storage.
This saves you from finding for experiment-related information in various places. This saves you from finding for experiment-related information in various places.
Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it. Remember that your trial working directory is ``/nni/{EXPERIMENT_ID}/trials/{TRIAL_ID}``, so if you upload your data in this shared storage, you can open it like a local file in your trial code without downloading it.
And we will develop more practical features in the future based on shared storage. The config reference can be found `here <../reference/experiment_config.html#sharedstorageconfig>`_. And we will develop more practical features in the future based on shared storage. The config reference can be found :ref:`here <reference-sharedstorage-config-label>`.
.. note:: .. note::
Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote. Shared storage is currently in the experimental stage. We suggest use AzureBlob under Ubuntu/CentOS/RHEL, and NFS under Ubuntu/CentOS/RHEL/Fedora/Debian for remote.
......
########################### Advanced Usage
Hyperparameter Optimization ==============
###########################
.. toctree:: .. toctree::
:maxdepth: 2 :hidden:
Command Line Tool Example </tutorials/hpo_nnictl/nnictl>
Implement Custom Tuners and Assessors <custom_algorithm> Implement Custom Tuners and Assessors <custom_algorithm>
Install Custom or 3rd-party Tuners and Assessors <custom_algorithm_installation> Install Custom or 3rd-party Tuners and Assessors <custom_algorithm_installation>
Tuner Benchmark <hpo_benchmark> Tuner Benchmark <hpo_benchmark>
Tuner Benchmark Example Statistics <hpo_benchmark_stats>
...@@ -125,7 +125,7 @@ More detail example you could see: ...@@ -125,7 +125,7 @@ More detail example you could see:
Write a more advanced automl algorithm Write a more advanced automl algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The methods above are usually enough to write a general tuner. However, users may also want more methods, for example, intermediate results, trials' state (e.g., the methods in assessor), in order to have a more powerful automl algorithm. Therefore, we have another concept called ``advisor`` which directly inherits from ``MsgDispatcherBase`` in :githublink:`msg_dispatcher_base.py <nni/runtime/msg_dispatcher_base.py>`. Please refer to `here <CustomizeAdvisor.rst>`__ for how to write a customized advisor. The methods above are usually enough to write a general tuner. However, users may also want more methods, for example, intermediate results, trials' state (e.g., the methods in assessor), in order to have a more powerful automl algorithm. Therefore, we have another concept called ``advisor`` which directly inherits from ``MsgDispatcherBase`` in :githublink:`msg_dispatcher_base.py <nni/runtime/msg_dispatcher_base.py>`.
Customize Assessor Customize Assessor
------------------ ------------------
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment