Unverified Commit 51d261e7 authored by J-shang's avatar J-shang Committed by GitHub
Browse files

Merge pull request #4668 from microsoft/doc-refactor

parents d63a2ea3 b469e1c1
**Run an Experiment on Aliyun PAI-DSW + PAI-DLC** PAI-DLC Training Service
=================================================== ========================
NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ called dlc mode. NNI supports running an experiment on `PAI-DSW <https://help.aliyun.com/document_detail/194831.html>`__ , submit trials to `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ which is deep learning containers based on Alibaba ACK.
PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs. PAI-DSW server performs the role to submit a job while PAI-DLC is where the training job runs.
Setup environment Prerequisite
----------------- ------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__. Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
...@@ -24,8 +24,8 @@ Step 4. Open your PAI-DSW server command line, download and install PAI-DLC pyth ...@@ -24,8 +24,8 @@ Step 4. Open your PAI-DSW server command line, download and install PAI-DLC pyth
pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly. pip install ./pai-dlc-20201203 # pai-dlc-20201203 refer to unzipped sdk file name, replace it accordingly.
Run an experiment Usage
----------------- -----
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like: Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
...@@ -78,6 +78,6 @@ Run the following commands to start the example experiment: ...@@ -78,6 +78,6 @@ Run the following commands to start the example experiment:
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``. Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.3``.
Monitor your job Monitor your job
---------------- ^^^^^^^^^^^^^^^^
To monitor your job on DLC, you need to visit `DLC <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status. To monitor your job on DLC, you need to visit `DLC <https://pai-dlc.console.aliyun.com/#/jobs>`__ to check job status.
Remote Training Service
=======================
NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.
Prerequisite
------------
1. Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.
2. Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usage, please refer to :ref:`reference-remote-config-label` in reference for detailed usage.
3. Make sure the NNI version on each machine is consistent. Follow the install guide `here <../Tutorial/QuickStart.rst>`__ to install NNI.
4. Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
In addition, there are several steps for Windows server.
1. Install and start ``OpenSSH Server``.
1) Open ``Settings`` app on Windows.
2) Click ``Apps``\ , then click ``Optional features``.
3) Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
4) Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
sc config sshd start=auto
net start sshd
2. Make sure remote account is administrator, so that it can stop running trials.
3. Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.
The output like below is ok, when opening a new command window.
.. code-block:: text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
Usage
-----
Use ``examples/trials/mnist-pytorch`` as the example. Suppose there are two machines, which can be logged in with username and password or key authentication of SSH. Here is a template configuration specification.
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 4
maxTrialNumber: 20
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: remote
machineList:
- host: 192.0.2.1
user: alice
ssh_key_file: ~/.ssh/id_rsa
- host: 192.0.2.2
port: 10022
user: bob
password: bob123
The example configuration is saved in ``examples/trials/mnist-pytorch/config_remote.yml``.
You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
.. code-block:: bash
nnictl create --config examples/trials/mnist-pytorch/config_remote.yml
.. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your training service, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
More features
-------------
Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine.
For example, with anaconda you can specify:
.. code-block:: yaml
pythonPath: /home/bob/.conda/envs/ENV-NAME/bin
Configure shared storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support shared storage, which can help use your own storage during using NNI. Follow the guide `here <./shared_storage.rst>`__ to learn how to use shared storage.
Monitor via TensorBoard
^^^^^^^^^^^^^^^^^^^^^^^
Remote training service support trial visualization via TensorBoard. Follow the guide `here <./tensorboard.rst>`__ to learn how to use TensorBoard.
**How to Use Shared Storage** How to Use Shared Storage
============================= =========================
If you want to use your own storage during using NNI, shared storage can satisfy you. If you want to use your own storage during using NNI, shared storage can satisfy you.
Instead of using training service native storage, shared storage can bring you more convenience. Instead of using training service native storage, shared storage can bring you more convenience.
......
Visualize Trial with TensorBoard
================================
You can launch a tensorboard process cross one or multi trials within webportal since NNI v2.2. This feature supports local training service and reuse mode training service with shared storage for now, and will support more scenarios in later nni version.
Preparation
-----------
Make sure tensorboard installed in your environment. If you never used tensorboard, here are getting start tutorials for your reference, `tensorboard with tensorflow <https://www.tensorflow.org/tensorboard/get_started>`__, `tensorboard with pytorch <https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html>`__.
Use WebUI Launch Tensorboard
----------------------------
Save Logs
^^^^^^^^^
NNI will automatically fetch the ``tensorboard`` subfolder under trial's output folder as tensorboard logdir. So in trial's source code, you need to save the tensorboard logs under ``NNI_OUTPUT_DIR/tensorboard``. This log path can be joined as:
.. code-block:: python
log_dir = os.path.join(os.environ["NNI_OUTPUT_DIR"], 'tensorboard')
Launch Tensorboard
^^^^^^^^^^^^^^^^^^
* Like compare, select the trials you want to combine to launch the tensorboard at first, then click the ``Tensorboard`` button.
.. image:: ../../img/Tensorboard_1.png
:target: ../../img/Tensorboard_1.png
:alt:
* After click the ``OK`` button in the pop-up box, you will jump to the tensorboard portal.
.. image:: ../../img/Tensorboard_2.png
:target: ../../img/Tensorboard_2.png
:alt:
* You can see the ``SequenceID-TrialID`` on the tensorboard portal.
.. image:: ../../img/Tensorboard_3.png
:target: ../../img/Tensorboard_3.png
:alt:
Stop All
^^^^^^^^
If you want to open the portal you have already launched, click the tensorboard id. If you don't need the tensorboard anymore, click ``Stop all tensorboard`` button.
.. image:: ../../img/Tensorboard_4.png
:target: ../../img/Tensorboard_4.png
:alt:
Training Services
=================
NNI has supported many training services listed below. Users can go through each page to learning how to configure the corresponding training service. NNI has high extensibility by design, users can customize new training service for their special resource, platform or needs.
.. toctree::
:hidden:
Local <local>
Remote <remote>
OpenPAI <openpai>
Kubeflow <kubeflow>
AdaptDL <adaptdl>
FrameworkController <frameworkcontroller>
AML <aml>
PAI-DLC <paidlc>
Hybrid <hybrid>
Customize a Training Service <customize>
Shared Storage <shared_storage>
.. list-table::
:header-rows: 1
* - Training Service
- Description
* - Local
- The whole experiment runs on your dev machine (i.e., a single local machine)
* - Remote
- The trials are dispatched to your configured remote servers
* - OpenPAI
- Running trials on OpenPAI, a DNN model training platform based on Kubernetes
* - Kubeflow
- Running trials with Kubeflow, a DNN model training framework based on Kubernetes
* - AdaptDL
- Running trials on AdaptDL, an elastic DNN model training platform
* - FrameworkController
- Running trials with FrameworkController, a DNN model training framework on Kubernetes
* - AML
- Running trials on AML cloud service
* - PAI-DLC
- Running trials on PAI-DLC, which is deep learning containers based on Alibaba ACK
* - Hybrid
- Support jointly using multiple above training services
\ No newline at end of file
Web Portal
==========
Web portal is for users to conveniently visualize their NNI experiments, tuning and training progress, detailed metrics, and error logs. Web portal also allows users to control their NNI experiments, trials, such as updating an experiment of its concurrency, duration, rerunning trials.
.. toctree::
:hidden:
Experiment Web Portal <webui>
Visualize with TensorBoard <tensorboard>
Webportal
=========
Q&A
---
There are many trials in the detail table but ``Default Metric`` chart is empty
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
First you should know that ``Default metric`` and ``Hyper parameter`` chart only show succeeded trials.
What should you do when you think the chart is strange, such as ``Default metric``, ``Hyper parameter``...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Download the experiment results(``experiment config``, ``trial message`` and ``intermeidate metrics``) from ``Experiment summary`` and then upload these results in your issue.
.. image:: ../../img/webui-img/summary.png
:target: ../../img/webui-img/summary.png
:alt: summary
What should you do when your experiment has error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Click the icon in the right of ``experiment status`` and screenshot the error message.
* And then click the ``learn about`` to download ``nni-manager`` and ``dispatcher`` logfile.
* Please file an issue from the `Feedback` in the `About` and upload above message.
.. image:: ../../img/webui-img/experimentError.png
:target: ../../img/webui-img/experimentError.png
:alt: experimentError
What should you do when your trial fails
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* ``Customized trial`` could be used in here. Just submit the same parameters to the experiment to rerun the trial.
.. image:: ../../img/webui-img/detail/customizedTrialButton.png
:target: ../../img/webui-img/detail/customizedTrialButton.png
:alt: customizedTrialButton
.. image:: ../../img/webui-img/detail/customizedTrial.png
:target: ../../img/webui-img/detail/customizedTrial.png
:alt: customizedTrial
* ``Log model`` will help you find the error reason. There are three buttons ``View trial log``, ``View trial error`` and ``View trial stdout`` on local mode. If you run on the OpenPAI or Kubeflow platform, you could see trial stdout and nfs log.
If you have any question you could tell us in the issue.
**local mode:**
.. image:: ../../img/webui-img/detail/log-local.png
:target: ../../img/webui-img/detail/log-local.png
:alt: logOnLocal
**OpenPAI, Kubeflow and other mode:**
.. image:: ../../img/webui-img/detail-pai.png
:target: ../../img/webui-img/detail-pai.png
:alt: detailPai
How to use dict intermediate result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`The discussion <https://github.com/microsoft/nni/discussions/4289>`_ could help you.
.. _exp-manage-webportal:
Experiments management
----------------------
Experiments management page could manage many experiments on your machine.
.. image:: ../../img/webui-img/managerExperimentList/experimentListNav.png
:target: ../../img/webui-img/managerExperimentList/experimentListNav.png
:alt: ExperimentList nav
* On the ``All experiments`` page, you can see all the experiments on your machine.
.. image:: ../../img/webui-img/managerExperimentList/expList.png
:target: ../../img/webui-img/managerExperimentList/expList.png
:alt: Experiments list
* When you want to see more details about an experiment you could click the trial id, look that:
.. image:: ../../img/webui-img/managerExperimentList/toAnotherExp.png
:target: ../../img/webui-img/managerExperimentList/toAnotherExp.png
:alt: See this experiment detail
* If has many experiments on the table, you can use the ``filter`` button.
.. image:: ../../img/webui-img/managerExperimentList/expFilter.png
:target: ../../img/webui-img/managerExperimentList/expFilter.png
:alt: filter button
Experiment details
------------------
View overview page
^^^^^^^^^^^^^^^^^^
* On the overview tab, you can see the experiment information and status and the performance of ``top trials``.
.. image:: ../../img/webui-img/full-oview.png
:target: ../../img/webui-img/full-oview.png
:alt: overview
* If you want to see experiment search space and config, please click the right button ``Search space`` and ``Config`` (when you hover on this button).
**Search space file:**
.. image:: ../../img/webui-img/searchSpace.png
:target: ../../img/webui-img/searchSpace.png
:alt: searchSpace
**Config file:**
.. image:: ../../img/webui-img/config.png
:target: ../../img/webui-img/config.png
:alt: config
* You can view and download ``nni-manager/dispatcher log files`` on here.
.. image:: ../../img/webui-img/review-log.png
:target: ../../img/webui-img/review-log.png
:alt: logfile
* If your experiment has many trials, you can change the refresh interval here.
.. image:: ../../img/webui-img/refresh-interval.png
:target: ../../img/webui-img/refresh-interval.png
:alt: refresh
* You can change some experiment configurations such as ``maxExecDuration``, ``maxTrialNum`` and ``trial concurrency`` on here.
.. image:: ../../img/webui-img/edit-experiment-param.png
:target: ../../img/webui-img/edit-experiment-param.png
:alt: editExperimentParams
View job default metric
^^^^^^^^^^^^^^^^^^^^^^^
* Click the tab ``Default metric`` to see the point chart of all trials. Hover to see its specific default metric and search space message.
.. image:: ../../img/webui-img/default-metric.png
:target: ../../img/webui-img/default-metric.png
:alt: defaultMetricGraph
* Turn on the switch named ``Optimization curve`` to see the experiment's optimization curve.
.. image:: ../../img/webui-img/best-curve.png
:target: ../../img/webui-img/best-curve.png
:alt: bestCurveGraph
View hyper parameter
^^^^^^^^^^^^^^^^^^^^
Click the tab ``Hyper-parameter`` to see the parallel chart.
* You can click the ``add/remove`` button to add or remove axes.
* Drag the axes to swap axes on the chart.
* You can select the percentage to see top trials.
.. image:: ../../img/webui-img/hyperPara.png
:target: ../../img/webui-img/hyperPara.png
:alt: hyperParameterGraph
View Trial Duration
^^^^^^^^^^^^^^^^^^^
Click the tab ``Trial Duration`` to see the bar chart.
.. image:: ../../img/webui-img/trial_duration.png
:target: ../../img/webui-img/trial_duration.png
:alt: trialDurationGraph
View Trial Intermediate Result chart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Click the tab ``Intermediate Result`` to see the line chart.
.. image:: ../../img/webui-img/trials_intermeidate.png
:target: ../../img/webui-img/trials_intermeidate.png
:alt: trialIntermediateGraph
The trial may have many intermediate results in the training process. In order to see the trend of some trials more clearly, we set a filtering function for the intermediate result chart.
You may find that these trials will get better or worse at an intermediate result. This indicates that it is an important and relevant intermediate result. To take a closer look at the point here, you need to enter its corresponding X-value at #Intermediate. Then input the range of metrics on this intermedia result. In the picture below, we choose the No. 4 intermediate result and set the range of metrics to 0.8-1.
.. image:: ../../img/webui-img/filter-intermediate.png
:target: ../../img/webui-img/filter-intermediate.png
:alt: filterIntermediateGraph
View trials status
^^^^^^^^^^^^^^^^^^
Click the tab ``Trials Detail`` to see the status of all trials. Specifically:
* Trial detail: trial's id, trial's duration, start time, end time, status, accuracy, and search space file.
.. image:: ../../img/webui-img/detail-local.png
:target: ../../img/webui-img/detail-local.png
:alt: detailLocalImage
* Support searching for a specific trial by its id, status, Trial No. and trial parameters.
**Trial id:**
.. image:: ../../img/webui-img/detail/searchId.png
:target: ../../img/webui-img/detail/searchId.png
:alt: searchTrialId
**Trial No.:**
.. image:: ../../img/webui-img/detail/searchNo.png
:target: ../../img/webui-img/detail/searchNo.png
:alt: searchTrialNo.
**Trial status:**
.. image:: ../../img/webui-img/detail/searchStatus.png
:target: ../../img/webui-img/detail/searchStatus.png
:alt: searchStatus
**Trial parameters:**
``parameters whose type is choice:``
.. image:: ../../img/webui-img/detail/searchParameterChoice.png
:target: ../../img/webui-img/detail/searchParameterChoice.png
:alt: searchParameterChoice
``parameters whose type is not choice:``
.. image:: ../../img/webui-img/detail/searchParameterRange.png
:target: ../../img/webui-img/detail/searchParameterRange.png
:alt: searchParameterRange
* The button named ``Add column`` can select which column to show on the table. If you run an experiment whose final result is a dict, you can see other keys in the table. You can choose the column ``Intermediate count`` to watch the trial's progress.
.. image:: ../../img/webui-img/addColumn.png
:target: ../../img/webui-img/addColumn.png
:alt: addColumnGraph
* If you want to compare some trials, you can select them and then click ``Compare`` to see the results.
.. image:: ../../img/webui-img/select-trial.png
:target: ../../img/webui-img/select-trial.png
:alt: selectTrialGraph
.. image:: ../../img/webui-img/compare.png
:target: ../../img/webui-img/compare.png
:alt: compareTrialsGraph
* You can use the button named ``Copy as python`` to copy the trial's parameters.
.. image:: ../../img/webui-img/copyParameter.png
:target: ../../img/webui-img/copyParameter.png
:alt: copyTrialParameters
* Intermediate Result chart: you can see the default metric in this chart by clicking the intermediate button.
.. image:: ../../img/webui-img/intermediate.png
:target: ../../img/webui-img/intermediate.png
:alt: intermeidateGraph
* Kill: you can kill a job that status is running.
.. image:: ../../img/webui-img/kill-running.png
:target: ../../img/webui-img/kill-running.png
:alt: killTrial
.. bb68c969dbc2b3a2ec79d323cbd31401 .. ebf0627529ecdbf758f9db38701b4225
Web 界面 Web 界面
================== ========
Q&A
---
在 detail 页面的表格里明明有很多 trial 但是 Default Metric 图是空的没有数据
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
首先你要明白 ``Default metric`` 和 ``Hyper parameter`` 图只展示成功 trial。
当你觉得 ``Default metric``、``Hyper parameter`` 图有问题的时候应该做什么
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 从 Experiment summary 下载实验结果(实验配置,trial 信息,中间值),并把这些结果上传进 issue 里。
.. image:: ../../img/webui-img/summary.png
:target: ../../img/webui-img/summary.png
:alt: summary
当你的实验有故障时应该做什么
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 点击实验状态右边的小图标把 error 信息截屏。
* 然后点击 learn about 去下载 log 文件。And then click the ``learn about`` to download ``nni-manager`` and ``dispatcher`` logfile.
* 点击页面导航栏的 About 按钮点 Feedback 开一个 issue,附带上以上的截屏和 log 信息。
.. image:: ../../img/webui-img/experimentError.png
:target: ../../img/webui-img/experimentError.png
:alt: experimentError
当你的 trial 跑失败了你应该怎么做
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* 使用 Customized trial 功能。向实验提交相同的 trial 参数即可。
.. image:: ../../img/webui-img/detail/customizedTrialButton.png
:target: ../../img/webui-img/detail/customizedTrialButton.png
:alt: customizedTrialButton
.. image:: ../../img/webui-img/detail/customizedTrial.png
:target: ../../img/webui-img/detail/customizedTrial.png
:alt: customizedTrial
* ``Log 模块`` 能帮助你找到错误原因。 有三个按钮: ``View trial log``, ``View trial error`` 和 ``View trial stdout`` 可查 log。如果你用 OpenPai 或者 Kubeflow,你能看到 trial stdout 和 nfs log。
有任何问题请在 issue 里联系我们。
**local mode:**
.. image:: ../../img/webui-img/detail/log-local.png
:target: ../../img/webui-img/detail/log-local.png
:alt: logOnLocal
**OpenPAI, Kubeflow and other mode:**
.. image:: ../../img/webui-img/detail-pai.png
:target: ../../img/webui-img/detail-pai.png
:alt: detailPai
怎样去使用 dict intermediate result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`The discussion <https://github.com/microsoft/nni/discussions/4289>`_ 能帮助你。
.. _exp-manage-webportal:
实验管理
--------
实验管理页面能统筹你机器上的所有实验。
Experiments 管理
-----------------------
点击导航栏上的 ``All experiments`` 标签。
.. image:: ../../img/webui-img/managerExperimentList/experimentListNav.png .. image:: ../../img/webui-img/managerExperimentList/experimentListNav.png
:target: ../../img/webui-img/managerExperimentList/experimentListNav.png :target: ../../img/webui-img/managerExperimentList/experimentListNav.png
...@@ -16,6 +107,8 @@ Experiments 管理 ...@@ -16,6 +107,8 @@ Experiments 管理
* 在 ``All experiments`` 页面,可以看到机器上的所有 Experiment。 * 在 ``All experiments`` 页面,可以看到机器上的所有 Experiment。
.. image:: ../../img/webui-img/managerExperimentList/expList.png .. image:: ../../img/webui-img/managerExperimentList/expList.png
:target: ../../img/webui-img/managerExperimentList/expList.png :target: ../../img/webui-img/managerExperimentList/expList.png
:alt: Experiments list :alt: Experiments list
...@@ -24,6 +117,8 @@ Experiments 管理 ...@@ -24,6 +117,8 @@ Experiments 管理
* 查看 Experiment 更多详细信息时,可以单击 trial ID 跳转至该 Experiment 详情页,如下所示: * 查看 Experiment 更多详细信息时,可以单击 trial ID 跳转至该 Experiment 详情页,如下所示:
.. image:: ../../img/webui-img/managerExperimentList/toAnotherExp.png .. image:: ../../img/webui-img/managerExperimentList/toAnotherExp.png
:target: ../../img/webui-img/managerExperimentList/toAnotherExp.png :target: ../../img/webui-img/managerExperimentList/toAnotherExp.png
:alt: See this experiment detail :alt: See this experiment detail
...@@ -32,50 +127,58 @@ Experiments 管理 ...@@ -32,50 +127,58 @@ Experiments 管理
* 如果表格里有很多 Experiment,可以使用 ``filter`` 按钮。 * 如果表格里有很多 Experiment,可以使用 ``filter`` 按钮。
.. image:: ../../img/webui-img/managerExperimentList/expFilter.png .. image:: ../../img/webui-img/managerExperimentList/expFilter.png
:target: ../../img/webui-img/managerExperimentList/expFilter.png :target: ../../img/webui-img/managerExperimentList/expFilter.png
:alt: filter button :alt: filter button
查看概要页面 实验详情
----------------- --------
点击 ``Overview`` 标签。
查看实验 overview 页面
^^^^^^^^^^^^^^^^^^^^^^^
* 在 Overview 标签上,可看到 Experiment trial 的概况、搜索空间以及 ``top trials`` 的结果。 * 在 Overview 标签上,可看到 Experiment trial 的概况、搜索空间以及 ``top trials`` 的结果。
.. image:: ../../img/webui-img/full-oview.png .. image:: ../../img/webui-img/full-oview.png
:target: ../../img/webui-img/full-oview.png :target: ../../img/webui-img/full-oview.png
:alt: overview :alt: overview
如果想查看 Experiment 配置和搜索空间,点击右边的 ``Search space`` 和 ``Config`` 按钮。 * 如果想查看 Experiment 配置和搜索空间,点击右边的 ``Search space`` 和 ``Config`` 按钮。
1. 搜索空间文件: **搜索空间文件:**
.. image:: ../../img/webui-img/searchSpace.png
:target: ../../img/webui-img/searchSpace.png
:alt: searchSpace
.. image:: ../../img/webui-img/searchSpace.png
:target: ../../img/webui-img/searchSpace.png
:alt: searchSpace
2. 配置文件:
**配置文件:**
.. image:: ../../img/webui-img/config.png
:target: ../../img/webui-img/config.png
:alt: config .. image:: ../../img/webui-img/config.png
:target: ../../img/webui-img/config.png
:alt: config
* 你可以在这里查看和下载 ``nni-manager/dispatcher 日志文件``。 * 你可以在这里查看和下载 ``nni-manager/dispatcher 日志文件``。
.. image:: ../../img/webui-img/review-log.png .. image:: ../../img/webui-img/review-log.png
:target: ../../img/webui-img/review-log.png :target: ../../img/webui-img/review-log.png
:alt: logfile :alt: logfile
...@@ -85,48 +188,28 @@ Experiments 管理 ...@@ -85,48 +188,28 @@ Experiments 管理
* 如果 Experiment 包含了较多 Trial,可改变刷新间隔。 * 如果 Experiment 包含了较多 Trial,可改变刷新间隔。
.. image:: ../../img/webui-img/refresh-interval.png .. image:: ../../img/webui-img/refresh-interval.png
:target: ../../img/webui-img/refresh-interval.png :target: ../../img/webui-img/refresh-interval.png
:alt: refresh :alt: refresh
* 单击按钮 ``Experiment summary`` ,可以查看和下载 Experiment 结果(``Experiment 配置``,``trial 信息`` 和 ``中间结果`` )。
.. image:: ../../img/webui-img/summary.png
:target: ../../img/webui-img/summary.png
:alt: summary
* 在这里修改 Experiment 配置(例如 ``maxExecDuration``, ``maxTrialNum`` 和 ``trial concurrency``)。 * 在这里修改 Experiment 配置(例如 ``maxExecDuration``, ``maxTrialNum`` 和 ``trial concurrency``)。
.. image:: ../../img/webui-img/edit-experiment-param.png .. image:: ../../img/webui-img/edit-experiment-param.png
:target: ../../img/webui-img/edit-experiment-param.png :target: ../../img/webui-img/edit-experiment-param.png
:alt: editExperimentParams :alt: editExperimentParams
* 通过单击 ``Learn about`` ,可以查看错误消息和 ``nni-manager/dispatcher 日志文件``
.. image:: ../../img/webui-img/experimentError.png
:target: ../../img/webui-img/experimentError.png
:alt: experimentError
* ``About`` 菜单内含有版本信息以及问题反馈渠道。
查看 trial 最终结果 查看 trial 最终结果
---------------------------------------------- ^^^^^^^^^^^^^^^^^^^^^
* ``Default metric`` 是所有 trial 的最终结果图。 在每一个结果上悬停鼠标可以看到 trial 信息,比如 trial id、No. 超参等。
* ``Default metric`` 是所有 trial 的最终结果图。 在每一个结果上悬停鼠标可以看到 trial 信息,比如 trial id、No.、超参等。
.. image:: ../../img/webui-img/default-metric.png .. image:: ../../img/webui-img/default-metric.png
...@@ -138,13 +221,15 @@ Experiments 管理 ...@@ -138,13 +221,15 @@ Experiments 管理
* 打开 ``Optimization curve`` 来查看 Experiment 的优化曲线。 * 打开 ``Optimization curve`` 来查看 Experiment 的优化曲线。
.. image:: ../../img/webui-img/best-curve.png .. image:: ../../img/webui-img/best-curve.png
:target: ../../img/webui-img/best-curve.png :target: ../../img/webui-img/best-curve.png
:alt: bestCurveGraph :alt: bestCurveGraph
查看超参 查看超参
-------------------- ^^^^^^^^^^
单击 ``Hyper-parameter`` 标签查看平行坐标系图。 单击 ``Hyper-parameter`` 标签查看平行坐标系图。
...@@ -154,6 +239,7 @@ Experiments 管理 ...@@ -154,6 +239,7 @@ Experiments 管理
* 通过调节百分比来查看 top trial。 * 通过调节百分比来查看 top trial。
.. image:: ../../img/webui-img/hyperPara.png .. image:: ../../img/webui-img/hyperPara.png
:target: ../../img/webui-img/hyperPara.png :target: ../../img/webui-img/hyperPara.png
:alt: hyperParameterGraph :alt: hyperParameterGraph
...@@ -161,11 +247,12 @@ Experiments 管理 ...@@ -161,11 +247,12 @@ Experiments 管理
查看 Trial 运行时间 查看 Trial 运行时间
------------------- ^^^^^^^^^^^^^^^^^^^^^^
点击 ``Trial Duration`` 标签来查看柱状图。 点击 ``Trial Duration`` 标签来查看柱状图。
.. image:: ../../img/webui-img/trial_duration.png .. image:: ../../img/webui-img/trial_duration.png
:target: ../../img/webui-img/trial_duration.png :target: ../../img/webui-img/trial_duration.png
:alt: trialDurationGraph :alt: trialDurationGraph
...@@ -173,11 +260,12 @@ Experiments 管理 ...@@ -173,11 +260,12 @@ Experiments 管理
查看 Trial 中间结果 查看 Trial 中间结果
------------------------------------ ^^^^^^^^^^^^^^^^^^^^^^
单击 ``Intermediate Result`` 标签查看折线图。 单击 ``Intermediate Result`` 标签查看折线图。
.. image:: ../../img/webui-img/trials_intermeidate.png .. image:: ../../img/webui-img/trials_intermeidate.png
:target: ../../img/webui-img/trials_intermeidate.png :target: ../../img/webui-img/trials_intermeidate.png
:alt: trialIntermediateGraph :alt: trialIntermediateGraph
...@@ -189,6 +277,7 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解 ...@@ -189,6 +277,7 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解
这样可以发现 Trial 在某个中间结果上会变得更好或更差。 这表明它是一个重要的并相关的中间结果。 如果要仔细查看这个点,可以在 #Intermediate 中输入其 X 坐标。 并输入这个中间结果的指标范围。 在下图中,选择了第四个中间结果并将指标范围设置为了 0.8 -1。 这样可以发现 Trial 在某个中间结果上会变得更好或更差。 这表明它是一个重要的并相关的中间结果。 如果要仔细查看这个点,可以在 #Intermediate 中输入其 X 坐标。 并输入这个中间结果的指标范围。 在下图中,选择了第四个中间结果并将指标范围设置为了 0.8 -1。
.. image:: ../../img/webui-img/filter-intermediate.png .. image:: ../../img/webui-img/filter-intermediate.png
:target: ../../img/webui-img/filter-intermediate.png :target: ../../img/webui-img/filter-intermediate.png
:alt: filterIntermediateGraph :alt: filterIntermediateGraph
...@@ -196,7 +285,7 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解 ...@@ -196,7 +285,7 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解
查看 Trial 状态 查看 Trial 状态
------------------ ^^^^^^^^^^^^^^^^^^
点击 ``Trials Detail`` 标签查看所有 Trial 的状态。具体如下: 点击 ``Trials Detail`` 标签查看所有 Trial 的状态。具体如下:
...@@ -204,52 +293,71 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解 ...@@ -204,52 +293,71 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解
* Trial 详情:Trial id,持续时间,开始时间,结束时间,状态,精度和 search space 文件。 * Trial 详情:Trial id,持续时间,开始时间,结束时间,状态,精度和 search space 文件。
.. image:: ../../img/webui-img/detail-local.png .. image:: ../../img/webui-img/detail-local.png
:target: ../../img/webui-img/detail-local.png :target: ../../img/webui-img/detail-local.png
:alt: detailLocalImage :alt: detailLocalImage
* 支持通过 id,状态,Trial 编号以及参数来搜索。 * * 支持通过 id,状态,Trial 编号以及参数来搜索。
**Trial id:**
.. image:: ../../img/webui-img/detail/searchId.png
:target: ../../img/webui-img/detail/searchId.png
:alt: searchTrialId
**Trial No.:**
1. Trial id:
.. image:: ../../img/webui-img/detail/searchId.png
:target: ../../img/webui-img/detail/searchId.png
:alt: searchTrialId
.. image:: ../../img/webui-img/detail/searchNo.png
:target: ../../img/webui-img/detail/searchNo.png
:alt: searchTrialNo.
2. Trial No.:
.. image:: ../../img/webui-img/detail/searchNo.png
:target: ../../img/webui-img/detail/searchNo.png
:alt: searchTrialNo.
**Trial status:**
3. Trial 状态:
.. image:: ../../img/webui-img/detail/searchStatus.png
:target: ../../img/webui-img/detail/searchStatus.png
:alt: searchStatus
4. Trial 参数: .. image:: ../../img/webui-img/detail/searchStatus.png
:target: ../../img/webui-img/detail/searchStatus.png
:alt: searchStatus
(1) 类型为 choice 的参数:
.. image:: ../../img/webui-img/detail/searchParameterChoice.png
:target: ../../img/webui-img/detail/searchParameterChoice.png
:alt: searchParameterChoice
(2) 类型不是 choice 的参数: **Trial parameters:**
``类型为 choice 的参数:``
.. image:: ../../img/webui-img/detail/searchParameterChoice.png
:target: ../../img/webui-img/detail/searchParameterChoice.png
:alt: searchParameterChoice
``类型不是 choice 的参数:``
.. image:: ../../img/webui-img/detail/searchParameterRange.png
:target: ../../img/webui-img/detail/searchParameterRange.png
:alt: searchParameterRange
.. image:: ../../img/webui-img/detail/searchParameterRange.png
:target: ../../img/webui-img/detail/searchParameterRange.png
:alt: searchParameterRange
* ``Add column`` 按钮可选择在表格中显示的列。 如果 Experiment 的最终结果是 dict,则可以在表格中查看其它键。可选择 ``Intermediate count`` 列来查看 Trial 进度。 * ``Add column`` 按钮可选择在表格中显示的列。 如果 Experiment 的最终结果是 dict,则可以在表格中查看其它键。可选择 ``Intermediate count`` 列来查看 Trial 进度。
.. image:: ../../img/webui-img/addColumn.png .. image:: ../../img/webui-img/addColumn.png
:target: ../../img/webui-img/addColumn.png :target: ../../img/webui-img/addColumn.png
:alt: addColumnGraph :alt: addColumnGraph
...@@ -259,54 +367,44 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解 ...@@ -259,54 +367,44 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解
* 如果要比较某些 Trial,可选择并点击 ``Compare`` 来查看结果。 * 如果要比较某些 Trial,可选择并点击 ``Compare`` 来查看结果。
.. image:: ../../img/webui-img/select-trial.png .. image:: ../../img/webui-img/select-trial.png
:target: ../../img/webui-img/select-trial.png :target: ../../img/webui-img/select-trial.png
:alt: selectTrialGraph :alt: selectTrialGraph
.. image:: ../../img/webui-img/compare.png .. image:: ../../img/webui-img/compare.png
:target: ../../img/webui-img/compare.png :target: ../../img/webui-img/compare.png
:alt: compareTrialsGraph :alt: compareTrialsGraph
* ``Tensorboard`` 请参考 `此文档 <Tensorboard.rst>`__。
* 可使用 ``Copy as python`` 按钮来拷贝 Trial 的参数。 * 可使用 ``Copy as python`` 按钮来拷贝 Trial 的参数。
.. image:: ../../img/webui-img/copyParameter.png .. image:: ../../img/webui-img/copyParameter.png
:target: ../../img/webui-img/copyParameter.png :target: ../../img/webui-img/copyParameter.png
:alt: copyTrialParameters :alt: copyTrialParameters
* 您可以在 ``Log`` 选项卡上看到 Trial 日志。 在本地模式下有 ``View trial log``, ``View trial error`` 和 ``View trial stdout`` 三个按钮。 * 如果在 OpenPAI 或 Kubeflow 平台上运行,还可以看到 hdfsLog。
1. 本机模式
.. image:: ../../img/webui-img/detail/log-local.png
:target: ../../img/webui-img/detail/log-local.png
:alt: logOnLocal
2. OpenPAI、Kubeflow 等模式:
.. image:: ../../img/webui-img/detail-pai.png
:target: ../../img/webui-img/detail-pai.png
:alt: detailPai
* 中间结果图:可在此图中通过点击 intermediate 按钮来查看默认指标。 * 中间结果图:可在此图中通过点击 intermediate 按钮来查看默认指标。
.. image:: ../../img/webui-img/intermediate.png .. image:: ../../img/webui-img/intermediate.png
:target: ../../img/webui-img/intermediate.png :target: ../../img/webui-img/intermediate.png
:alt: intermeidateGraph :alt: intermeidateGraph
* Kill: 可终止正在运行的任务。
* Kill: 可终止正在运行的 trial。
.. image:: ../../img/webui-img/kill-running.png .. image:: ../../img/webui-img/kill-running.png
...@@ -315,14 +413,3 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解 ...@@ -315,14 +413,3 @@ Trial 在训练过程中可能有大量中间结果。 为了更清楚的理解
* 自定义 Trial:您可以更改此 Trial 参数,然后将其提交给 Experiment。如果您想重新运行失败的 Trial ,您可以向 Experiment 提交相同的参数。
.. image:: ../../img/webui-img/detail/customizedTrialButton.png
:target: ../../img/webui-img/detail/customizedTrialButton.png
:alt: customizedTrialButton
.. image:: ../../img/webui-img/detail/customizedTrial.png
:target: ../../img/webui-img/detail/customizedTrial.png
:alt: customizedTrial
###########################
Hyperparameter Optimization
###########################
.. toctree::
:maxdepth: 2
Implement Custom Tuners and Assessors <custom_algorithm>
Install Custom or 3rd-party Tuners and Assessors <custom_algorithm_installation>
Tuner Benchmark <hpo_benchmark>
Assessor: Early Stopping
========================
In HPO, some hyperparameter sets may have obviously poor performance and it will be unnecessary to finish the evaluation.
This is called *early stopping*, and in NNI early stopping algorithms are called *assessors*.
An assessor monitors *intermediate results* of each *trial*.
If a trial is predicted to produce suboptimal final result, the assessor will stop that trial immediately,
to save computing resources for other hyperparameter sets.
As introduced in quickstart tutorial, a trial is the evaluation process of a hyperparameter set,
and intermediate results are reported with :func:`nni.report_intermediate_result` API in trial code.
Typically, intermediate results are accuracy or loss metrics of each epoch.
Using an assessor will increase the efficiency of computing resources,
but may slightly reduce the predicition accuracy of tuners.
It is recommended to use an assessor when computing resources are insufficient.
Common Usage
------------
The usage of assessors are similar to tuners.
To use a built-in assessor you need to specify its name and arguments:
.. code-block:: python
config.assessor.name = 'Medianstop'
config.tuner.class_args = {'optimize_mode': 'maximize'}
Built-in Assessors
------------------
.. list-table::
:header-rows: 1
:widths: auto
* - Assessor
- Brief Introduction of Algorithm
* - :class:`Medianstop <nni.algorithms.hpo.medianstop_assessor.MedianstopAssessor>`
- It stops a pending trial X at step S if
the trial’s best objective value by step S is strictly worse than the median value of
the running averages of all completed trials’ objectives reported up to step S.
* - :class:`Curvefitting <nni.algorithms.hpo.curvefitting_assessor.CurvefittingAssessor>`
- It stops a pending trial X at step S if
the trial’s forecast result at target step is convergence and lower than the best performance in the history.
Customize-Tuner Customize Tuner
=============== ===============
NNI provides state-of-the-art tuning algorithm in builtin-tuners. NNI supports to build a tuner by yourself for tuning demand. NNI provides state-of-the-art tuning algorithm in builtin-tuners. NNI supports to build a tuner by yourself for tuning demand.
...@@ -123,3 +123,68 @@ Write a more advanced automl algorithm ...@@ -123,3 +123,68 @@ Write a more advanced automl algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The methods above are usually enough to write a general tuner. However, users may also want more methods, for example, intermediate results, trials' state (e.g., the methods in assessor), in order to have a more powerful automl algorithm. Therefore, we have another concept called ``advisor`` which directly inherits from ``MsgDispatcherBase`` in :githublink:`msg_dispatcher_base.py <nni/runtime/msg_dispatcher_base.py>`. Please refer to `here <CustomizeAdvisor.rst>`__ for how to write a customized advisor. The methods above are usually enough to write a general tuner. However, users may also want more methods, for example, intermediate results, trials' state (e.g., the methods in assessor), in order to have a more powerful automl algorithm. Therefore, we have another concept called ``advisor`` which directly inherits from ``MsgDispatcherBase`` in :githublink:`msg_dispatcher_base.py <nni/runtime/msg_dispatcher_base.py>`. Please refer to `here <CustomizeAdvisor.rst>`__ for how to write a customized advisor.
Customize Assessor
==================
NNI supports to build an assessor by yourself for tuning demand.
If you want to implement a customized Assessor, there are three things to do:
#. Inherit the base Assessor class
#. Implement assess_trial function
#. Configure your customized Assessor in experiment YAML config file
**1. Inherit the base Assessor class**
.. code-block:: python
from nni.assessor import Assessor
class CustomizedAssessor(Assessor):
def __init__(self, ...):
...
**2. Implement assess trial function**
.. code-block:: python
from nni.assessor import Assessor, AssessResult
class CustomizedAssessor(Assessor):
def __init__(self, ...):
...
def assess_trial(self, trial_history):
"""
Determines whether a trial should be killed. Must override.
trial_history: a list of intermediate result objects.
Returns AssessResult.Good or AssessResult.Bad.
"""
# you code implement here.
...
**3. Configure your customized Assessor in experiment YAML config file**
NNI needs to locate your customized Assessor class and instantiate the class, so you need to specify the location of the customized Assessor class and pass literal values as parameters to the __init__ constructor.
.. code-block:: yaml
assessor:
codeDir: /home/abc/myassessor
classFileName: my_customized_assessor.py
className: CustomizedAssessor
# Any parameter need to pass to your Assessor class __init__ constructor
# can be specified in this optional classArgs field, for example
classArgs:
arg1: value1
Please noted in **2**. The object ``trial_history`` are exact the object that Trial send to Assessor by using SDK ``report_intermediate_result`` function.
The working directory of your assessor is ``<home>/nni-experiments/<experiment_id>/log``\ , which can be retrieved with environment variable ``NNI_LOG_DIRECTORY``\ ,
More detail example you could see:
* :githublink:`medianstop-assessor <nni/algorithms/hpo/medianstop_assessor.py>`
* :githublink:`curvefitting-assessor <nni/algorithms/hpo/curvefitting_assessor/>`
...@@ -28,8 +28,8 @@ classification tasks, the metric "auc" and "logloss" were used for evaluation, w ...@@ -28,8 +28,8 @@ classification tasks, the metric "auc" and "logloss" were used for evaluation, w
After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``. After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``.
Since the file is large, we only show the following screenshot and summarize other important statistics instead. Since the file is large, we only show the following screenshot and summarize other important statistics instead.
.. image:: ../img/hpo_benchmark/performances.png .. image:: ../../img/hpo_benchmark/performances.png
:target: ../img/hpo_benchmark/performances.png :target: ../../img/hpo_benchmark/performances.png
:alt: :alt:
When the results are parsed, the tuners are also ranked based on their final performance. The following three tables show When the results are parsed, the tuners are also ranked based on their final performance. The following three tables show
...@@ -154,52 +154,52 @@ To view the same data in another way, for each tuner, we present the average ran ...@@ -154,52 +154,52 @@ To view the same data in another way, for each tuner, we present the average ran
Besides these reports, our script also generates two graphs for each fold of each task: one graph presents the best score received by each tuner until trial x, and another graph shows the score that each tuner receives in trial x. These two graphs can give some information regarding how the tuners are "converging" to their final solution. We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here. Besides these reports, our script also generates two graphs for each fold of each task: one graph presents the best score received by each tuner until trial x, and another graph shows the score that each tuner receives in trial x. These two graphs can give some information regarding how the tuners are "converging" to their final solution. We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
.. image:: ../img/hpo_benchmark/car_fold1_1.jpg .. image:: ../../img/hpo_benchmark/car_fold1_1.jpg
:target: ../img/hpo_benchmark/car_fold1_1.jpg :target: ../../img/hpo_benchmark/car_fold1_1.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/car_fold1_2.jpg .. image:: ../../img/hpo_benchmark/car_fold1_2.jpg
:target: ../img/hpo_benchmark/car_fold1_2.jpg :target: ../../img/hpo_benchmark/car_fold1_2.jpg
:alt: :alt:
The previous two graphs are generated for fold 1 of the task "car". In the first graph, we observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its best score improved for three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. This may be interpreted as the Anneal tuner explores more aggressively than the GPTuner and thus its scores for different trials vary a lot. Regardless, although this pattern can to some extent hint a tuner's position on the explore-exploit tradeoff, it is not a comprehensive evaluation of a tuner's effectiveness. The previous two graphs are generated for fold 1 of the task "car". In the first graph, we observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its best score improved for three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. This may be interpreted as the Anneal tuner explores more aggressively than the GPTuner and thus its scores for different trials vary a lot. Regardless, although this pattern can to some extent hint a tuner's position on the explore-exploit tradeoff, it is not a comprehensive evaluation of a tuner's effectiveness.
.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg .. image:: ../../img/hpo_benchmark/christine_fold0_1.jpg
:target: ../img/hpo_benchmark/christine_fold0_1.jpg :target: ../../img/hpo_benchmark/christine_fold0_1.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg .. image:: ../../img/hpo_benchmark/christine_fold0_2.jpg
:target: ../img/hpo_benchmark/christine_fold0_2.jpg :target: ../../img/hpo_benchmark/christine_fold0_2.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg .. image:: ../../img/hpo_benchmark/cnae-9_fold0_1.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg :target: ../../img/hpo_benchmark/cnae-9_fold0_1.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg .. image:: ../../img/hpo_benchmark/cnae-9_fold0_2.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg :target: ../../img/hpo_benchmark/cnae-9_fold0_2.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg .. image:: ../../img/hpo_benchmark/credit-g_fold1_1.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_1.jpg :target: ../../img/hpo_benchmark/credit-g_fold1_1.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg .. image:: ../../img/hpo_benchmark/credit-g_fold1_2.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_2.jpg :target: ../../img/hpo_benchmark/credit-g_fold1_2.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg .. image:: ../../img/hpo_benchmark/titanic_2_fold1_1.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg :target: ../../img/hpo_benchmark/titanic_2_fold1_1.jpg
:alt: :alt:
.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg .. image:: ../../img/hpo_benchmark/titanic_2_fold1_2.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg :target: ../../img/hpo_benchmark/titanic_2_fold1_2.jpg
:alt: :alt:
###########################
Hyperparameter Optimization
###########################
Auto hyperparameter optimization (HPO), or auto tuning, is one of the key features of NNI.
.. raw:: html
<script>
const parts = window.location.href.split('/');
if (parts.pop() === 'index.html') {
window.location.replace(parts.join('/') + '/overview.html')
}
</script>
.. toctree::
:maxdepth: 2
Overview <overview>
Tutorial </tutorials/hpo_quickstart_pytorch/main>
Search Space <search_space>
Tuners <tuners>
Assessors <assessors>
TensorBoard Integration <tensorboard>
Advanced Usage <advanced_toctree.rst>
:orphan:
NNI Annotation NNI Annotation
============== ==============
......
Hyperparameter Optimization Overview
====================================
Auto hyperparameter optimization (HPO), or auto tuning, is one of the key features of NNI.
Introduction to HPO
-------------------
In machine learning, a hyperparameter is a parameter whose value is used to control learning process,
and HPO is the problem of choosing a set of optimal hyperparameters for a learning algorithm.
(`From <https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>`__
`Wikipedia <https://en.wikipedia.org/wiki/Hyperparameter_optimization>`__)
Following code snippet demonstrates a naive HPO process:
.. code-block:: python
best_hyperparameters = None
best_accuracy = 0
for learning_rate in [0.1, 0.01, 0.001, 0.0001]:
for momentum in [i / 10 for i in range(10)]:
for activation_type in ['relu', 'tanh', 'sigmoid']:
model = build_model(activation_type)
train_model(model, learning_rate, momentum)
accuracy = evaluate_model(model)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_hyperparameters = (learning_rate, momentum, activation_type)
print('Best hyperparameters:', best_hyperparameters)
You may have noticed, the example will train 4×10×3=120 models in total.
Since it consumes so much computing resources, you may want to:
1. Find the best set of hyperparameters with less iterations.
2. Train the models on distributed platforms.
3. Have a portal to monitor and control the process.
And NNI will do them for you.
Key Features of NNI HPO
-----------------------
Tuning Algorithms
^^^^^^^^^^^^^^^^^
NNI provides *tuners* to speed up the process of finding best hyperparameter set.
A tuner, or a tuning algorithm, decides the order in which hyperparameter sets are evaluated.
Based on the results of historical hyperparameter sets, an efficient tuner can predict where the best hyperparameters locates around,
and finds them in much fewer attempts.
The naive example above evaluates all possible hyperparameter sets in constant order, ignoring the historical results.
This is the brute-force tuning algorithm called *grid search*.
NNI has out-of-the-box support for a variety of popular tuners.
It includes naive algorithms like random search and grid search, Bayesian-based algorithms like TPE and SMAC,
RL based algorithms like PPO, and much more.
Main article: :doc:`tuners`
Training Platforms
^^^^^^^^^^^^^^^^^^
If you are not interested in distributed platforms, you can simply run NNI HPO with current computer,
just like any ordinary Python library.
And when you want to leverage more computing resources, NNI provides built-in integration for training platforms
from simple on-premise servers to scalable commercial clouds.
With NNI you can write one piece of model code, and concurrently evaluate hyperparameter sets on local machine, SSH servers,
Kubernetes-based clusters, AzureML service, and much more.
Main article: :doc:`/experiment/training_service`
Web Portal
^^^^^^^^^^
NNI provides a web portal to monitor training progress, to visualize hyperparameter performance,
to manually customize hyperparameters, and to manage multiple HPO experiments.
Main article: :doc:`/experiment/web_portal`
.. image:: ../../static/img/webui.gif
:width: 100%
Tutorials
---------
To start using NNI HPO, choose the quickstart tutorial of your favorite framework:
* :doc:`PyTorch tutorial </tutorials/hpo_quickstart_pytorch/main>`
* :doc:`TensorFlow tutorial </tutorials/hpo_quickstart_tensorflow/main>`
Extra Features
--------------
After you are familiar with basic usage, you can explore more HPO features:
* :doc:`Use command line tool to create and manage experiments (nnictl) </reference/nnictl>`
* :doc:`Early stop non-optimal models (assessor) <assessors>`
* :doc:`TensorBoard integration <tensorboard>`
* :doc:`Implement your own algorithm <custom_algorithm>`
* :doc:`Benchmark tuners <hpo_benchmark>`
Built-in Algorithms
-------------------
Tuning Algorithms
^^^^^^^^^^^^^^^^^
Main article: :doc:`tuners`
.. list-table::
:header-rows: 1
:widths: auto
* - Name
- Category
- Brief Description
* - :class:`Random <nni.algorithms.hpo.random_tuner.RandomTuner>`
- Basic
- Naive random search.
* - :class:`GridSearch <nni.algorithms.hpo.gridsearch_tuner.GridSearchTuner>`
- Basic
- Brute-force search.
* - :class:`TPE <nni.algorithms.hpo.tpe_tuner.TpeTuner>`
- Bayesian
- Tree-structured Parzen Estimator.
* - :class:`Anneal <nni.algorithms.hpo.hyperopt_tuner.HyperoptTuner>`
- Classic
- Simulated annealing algorithm.
* - :class:`Evolution <nni.algorithms.hpo.evolution_tuner.EvolutionTuner>`
- Classic
- Naive evolution algorithm.
* - :class:`SMAC <nni.algorithms.hpo.smac_tuner.SMACTuner>`
- Bayesian
- Sequential Model-based optimization for general Algorithm Configuration.
* - :class:`Hyperband <nni.algorithms.hpo.hyperband_advisor.Hyperband>`
- Advanced
- Evaluate more hyperparameter sets by adaptively allocating resources.
* - :class:`MetisTuner <nni.algorithms.hpo.metis_tuner.MetisTuner>`
- Bayesian
- Robustly optimizing tail latencies of cloud systems.
* - :class:`BOHB <nni.algorithms.hpo.bohb_advisor.BOHB>`
- Advanced
- Bayesian Optimization with HyperBand.
* - :class:`GPTuner <nni.algorithms.hpo.gp_tuner.GPTuner>`
- Bayesian
- Gaussian Process.
* - :class:`PBTTuner <nni.algorithms.hpo.pbt_tuner.PBTTuner>`
- Advanced
- Population Based Training of neural networks.
* - :class:`DNGOTuner <nni.algorithms.hpo.dngo_tuner.DNGOTuner>`
- Bayesian
- Deep Networks for Global Optimization.
* - :class:`PPOTuner <nni.algorithms.hpo.ppo_tuner.PPOTuner>`
- RL
- Proximal Policy Optimization.
* - :class:`BatchTuner <nni.algorithms.hpo.batch_tuner.BatchTuner>`
- Basic
- Manually specify hyperparameter sets.
Early Stopping
^^^^^^^^^^^^^^
Main article: :doc:`assessors`
.. list-table::
:header-rows: 1
:widths: auto
* - Name
- Brief Description
* - :class:`Medianstop <nni.algorithms.hpo.medianstop_assessor.MedianstopAssessor>`
- Stop if the hyperparameter set performs worse than median at any step.
* - :class:`Curvefitting <nni.algorithms.hpo.curvefitting_assessor.CurvefittingAssessor>`
- Stop if the learning curve will likely converge to suboptimal result.
...@@ -198,16 +198,16 @@ Search Space Types Supported by Each Tuner ...@@ -198,16 +198,16 @@ Search Space Types Supported by Each Tuner
- -
* - Grid Search Tuner * - Grid Search Tuner
- :raw-html:`&#10003;` - :raw-html:`&#10003;`
-
- :raw-html:`&#10003;` - :raw-html:`&#10003;`
-
- :raw-html:`&#10003;` - :raw-html:`&#10003;`
- - :raw-html:`&#10003;`
- - :raw-html:`&#10003;`
- - :raw-html:`&#10003;`
- - :raw-html:`&#10003;`
- - :raw-html:`&#10003;`
- - :raw-html:`&#10003;`
- :raw-html:`&#10003;`
- :raw-html:`&#10003;`
* - Hyperband Advisor * - Hyperband Advisor
- :raw-html:`&#10003;` - :raw-html:`&#10003;`
- -
......
Tuner: Tuning Algorithms
========================
The tuner decides which hyperparameter sets will be evaluated. It is a most important part of NNI HPO.
A tuner works like following pseudocode:
.. code-block:: python
space = get_search_space()
history = []
while not experiment_end:
hp = suggest_hyperparameter_set(space, history)
result = run_trial(hp)
history.append((hp, result))
NNI has out-of-the-box support for many popular tuning algorithms.
They should be sufficient to cover most typical machine learning scenarios.
However, if you have a very specific demand, or if you have designed an algorithm yourself,
you can also implement your own tuner: :doc:`custom_algorithm`
Common Usage
------------
All built-in tuners have similar usage.
To use a built-in tuner, you need to specify its name and arguments in experiment config,
and provides a standard :doc:`search_space`.
Some tuners, like SMAC and DNGO, have extra dependencies that need to be installed separately.
Please check each tuner's reference page for what arguments it supports and whether it needs extra dependencies.
As a general example, random tuner can be configured as follow:
.. code-block:: python
config.search_space = {
'x': {'_type': 'uniform', '_value': [0, 1]},
'y': {'_type': 'choice', '_value': ['a', 'b', 'c']}
}
config.tuner.name = 'Random'
config.tuner.class_args = {'seed': 0}
Built-in Tuners
---------------
.. list-table::
:header-rows: 1
:widths: auto
* - Tuner
- Brief Introduction
* - :class:`TPE <nni.algorithms.hpo.tpe_tuner.TpeTuner>`
- Tree-structured Parzen Estimator, a classic Bayesian optimization algorithm.
(`paper <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__)
TPE is a lightweight tuner that has no extra dependency and supports all search space types.
Good to start with.
The drawback is that TPE cannot discover relationship between different hyperparameters.
* - :class:`Random <nni.algorithms.hpo.random_tuner.RandomTuner>`
- Naive random search, the baseline. It supports all search space types.
* - :class:`Grid Search <nni.algorithms.hpo.gridsearch_tuner.GridSearchTuner>`
- Divides search space into evenly spaced grid, and performs brute-force traverse. Another baseline.
It supports all search space types.
Recommended when the search space is small, and when you want to find the strictly optimal hyperparameters.
* - :class:`Anneal <nni.algorithms.hpo.hyperopt_tuner.HyperoptTuner>`
- This simple annealing algorithm begins by sampling from the prior, but tends over time to sample from points closer and closer to the best ones observed. This algorithm is a simple variation on the random search that leverages smoothness in the response surface. The annealing rate is not adaptive.
* - :class:`Evolution <nni.algorithms.hpo.evolution_tuner.EvolutionTuner>`
- Naive Evolution comes from Large-Scale Evolution of Image Classifiers. It randomly initializes a population-based on search space. For each generation, it chooses better ones and does some mutation (e.g., change a hyperparameter, add/remove one layer) on them to get the next generation. Naïve Evolution requires many trials to work, but it's very simple and easy to expand new features. `Reference paper <https://arxiv.org/pdf/1703.01041.pdf>`__
* - :class:`SMAC <nni.algorithms.hpo.smac_tuner.SMACTuner>`
- SMAC is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO, in order to handle categorical parameters. The SMAC supported by NNI is a wrapper on the SMAC3 GitHub repo.
Notice, SMAC needs to be installed by ``pip install nni[SMAC]`` command. `Reference Paper, <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__ `GitHub Repo <https://github.com/automl/SMAC3>`__
* - :class:`Batch <nni.algorithms.hpo.batch_tuner.BatchTuner>`
- Batch tuner allows users to simply provide several configurations (i.e., choices of hyper-parameters) for their trial code. After finishing all the configurations, the experiment is done. Batch tuner only supports the type choice in search space spec.
* - :class:`Hyperband <nni.algorithms.hpo.hyperband_advisor.Hyperband>`
- Hyperband tries to use limited resources to explore as many configurations as possible and returns the most promising ones as a final result. The basic idea is to generate many configurations and run them for a small number of trials. The half least-promising configurations are thrown out, the remaining are further trained along with a selection of new configurations. The size of these populations is sensitive to resource constraints (e.g. allotted search time). `Reference Paper <https://arxiv.org/pdf/1603.06560.pdf>`__
* - :class:`Metis <nni.algorithms.hpo.metis_tuner.MetisTuner>`
- Metis offers the following benefits when it comes to tuning parameters: While most tools only predict the optimal configuration, Metis gives you two outputs: (a) current prediction of optimal configuration, and (b) suggestion for the next trial. No more guesswork. While most tools assume training datasets do not have noisy data, Metis actually tells you if you need to re-sample a particular hyper-parameter. `Reference Paper <https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/>`__
* - :class:`BOHB <nni.algorithms.hpo.bohb_advisor.BOHB>`
- BOHB is a follow-up work to Hyperband. It targets the weakness of Hyperband that new configurations are generated randomly without leveraging finished trials. For the name BOHB, HB means Hyperband, BO means Bayesian Optimization. BOHB leverages finished trials by building multiple TPE models, a proportion of new configurations are generated through these models. `Reference Paper <https://arxiv.org/abs/1807.01774>`__
* - :class:`GP <nni.algorithms.hpo.gp_tuner.GPTuner>`
- Gaussian Process Tuner is a sequential model-based optimization (SMBO) approach with Gaussian Process as the surrogate. `Reference Paper <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__, `Github Repo <https://github.com/fmfn/BayesianOptimization>`__
* - :class:`PBT <nni.algorithms.hpo.pbt_tuner.PBTTuner>`
- PBT Tuner is a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. `Reference Paper <https://arxiv.org/abs/1711.09846v1>`__
* - :class:`DNGO <nni.algorithms.hpo.dngo_tuner.DNGOTuner>`
- Use of neural networks as an alternative to GPs to model distributions over functions in bayesian optimization.
Comparison
----------
These articles have compared built-in tuners' performance on some different tasks:
:doc:`hpo_benchmark_stats`
:doc:`/misc/hpo_comparison`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment