Commit e773dfcc authored by qianyj's avatar qianyj
Browse files

create branch for v2.9

parents
Install on Linux & Mac
======================
Installation
------------
Installation on Linux and macOS follow the same instructions, given below.
Install NNI through pip
^^^^^^^^^^^^^^^^^^^^^^^
Prerequisite: ``python 64-bit >= 3.6``
.. code-block:: bash
python3 -m pip install --upgrade nni
Install NNI through source code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you are interested in special or the latest code versions, you can install NNI through source code.
Prerequisites: ``python 64-bit >=3.6``, ``git``
.. code-block:: bash
git clone -b v2.6 https://github.com/Microsoft/nni.git
cd nni
python3 -m pip install -U -r dependencies/setup.txt
python3 -m pip install -r dependencies/develop.txt
python3 setup.py develop
Build wheel package from NNI source code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The previous section shows how to install NNI in `development mode <https://setuptools.readthedocs.io/en/latest/userguide/development_mode.html>`__.
If you want to perform a persist install instead, we recommend to build your own wheel package and install from wheel.
.. code-block:: bash
git clone -b v2.6 https://github.com/Microsoft/nni.git
cd nni
export NNI_RELEASE=2.0
python3 -m pip install -U -r dependencies/setup.txt
python3 -m pip install -r dependencies/develop.txt
python3 setup.py clean --all
python3 setup.py build_ts
python3 setup.py bdist_wheel -p manylinux1_x86_64
python3 -m pip install dist/nni-2.0-py3-none-manylinux1_x86_64.whl
Use NNI in a docker image
^^^^^^^^^^^^^^^^^^^^^^^^^
You can also install NNI in a docker image. Please follow the instructions `here <../Tutorial/HowToUseDocker.rst>`__ to build an NNI docker image. The NNI docker image can also be retrieved from Docker Hub through the command ``docker pull msranni/nni:latest``.
Verify installation
-------------------
*
Download the examples via cloning the source code.
.. code-block:: bash
git clone -b v2.6 https://github.com/Microsoft/nni.git
*
Run the MNIST example.
.. code-block:: bash
nnictl create --config nni/examples/trials/mnist-pytorch/config.yml
*
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the ``Web UI url``.
.. code-block:: text
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: http://223.255.255.1:8080 http://127.0.0.1:8080
-----------------------------------------------------------------------
You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
commands description
1. nnictl experiment show show the information of experiments
2. nnictl trial ls list all of trial jobs
3. nnictl top monitor the status of running experiments
4. nnictl log stderr show stderr log content
5. nnictl log stdout show stdout log content
6. nnictl stop stop an experiment
7. nnictl trial kill kill a trial job by id
8. nnictl --help get help information about nnictl
-----------------------------------------------------------------------
* Open the ``Web UI url`` in your browser, you can view detailed information about the experiment and all the submitted trial jobs as shown below. `Here <../Tutorial/WebUI.rst>`__ are more Web UI pages.
.. image:: ../../img/webui_overview_page.png
:target: ../../img/webui_overview_page.png
:alt: overview
.. image:: ../../img/webui_trialdetail_page.png
:target: ../../img/webui_trialdetail_page.png
:alt: detail
System requirements
-------------------
Due to potential programming changes, the minimum system requirements of NNI may change over time.
Linux
^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* -
- Recommended
- Minimum
* - **Operating System**
- Ubuntu 16.04 or above
-
* - **CPU**
- Intel® Core™ i5 or AMD Phenom™ II X3 or better
- Intel® Core™ i3 or AMD Phenom™ X3 8650
* - **GPU**
- NVIDIA® GeForce® GTX 660 or better
- NVIDIA® GeForce® GTX 460
* - **Memory**
- 6 GB RAM
- 4 GB RAM
* - **Storage**
- 30 GB available hare drive space
-
* - **Internet**
- Boardband internet connection
-
* - **Resolution**
- 1024 x 768 minimum display resolution
-
macOS
^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* -
- Recommended
- Minimum
* - **Operating System**
- macOS 10.14.1 or above
-
* - **CPU**
- Intel® Core™ i7-4770 or better
- Intel® Core™ i5-760 or better
* - **GPU**
- AMD Radeon™ R9 M395X or better
- NVIDIA® GeForce® GT 750M or AMD Radeon™ R9 M290 or better
* - **Memory**
- 8 GB RAM
- 4 GB RAM
* - **Storage**
- 70GB available space SSD
- 70GB available space 7200 RPM HDD
* - **Internet**
- Boardband internet connection
-
* - **Resolution**
- 1024 x 768 minimum display resolution
-
Further reading
---------------
* `Overview <../Overview.rst>`__
* `Use command line tool nnictl <Nnictl.rst>`__
* `Use NNIBoard <WebUI.rst>`__
* `Define search space <SearchSpaceSpec.rst>`__
* `Config an experiment <ExperimentConfig.rst>`__
* `How to run an experiment on local (with multiple GPUs)? <../TrainingService/LocalMode.rst>`__
* `How to run an experiment on multiple machines? <../TrainingService/RemoteMachineMode.rst>`__
* `How to run an experiment on OpenPAI? <../TrainingService/PaiMode.rst>`__
* `How to run an experiment on Kubernetes through Kubeflow? <../TrainingService/KubeflowMode.rst>`__
* `How to run an experiment on Kubernetes through FrameworkController? <../TrainingService/FrameworkControllerMode.rst>`__
* `How to run an experiment on Kubernetes through AdaptDL? <../TrainingService/AdaptDLMode.rst>`__
.. 1488ec09b21ac2a6c35b41f710c9211e
在 Linux 和 Mac 下安装
======================
安装
------------
在 Linux 和 macOS 上安装,遵循以下相同的说明。
通过 pip 命令安装 NNI
^^^^^^^^^^^^^^^^^^^^^^^
先决条件:``python 64-bit >= 3.6``
.. code-block:: bash
python3 -m pip install --upgrade nni
通过源代码安装 NNI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
如果对某个或最新版本的代码感兴趣,可通过源代码安装 NNI。
先决条件:``python 64-bit >=3.6``, ``git``
.. code-block:: bash
git clone -b v2.6 https://github.com/Microsoft/nni.git
cd nni
python3 -m pip install -U -r dependencies/setup.txt
python3 -m pip install -r dependencies/develop.txt
python3 setup.py develop
从 NNI 源代码构建 Wheel 包
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
上一节介绍了如何在 `开发模式 <https://setuptools.readthedocs.io/en/latest/userguide/development_mode.html>`__ 下安装NNI。
如果要执行持久安装,建议您构建自己的 wheel 软件包并从wheel 安装。
.. code-block:: bash
git clone -b v2.6 https://github.com/Microsoft/nni.git
cd nni
export NNI_RELEASE=2.6
python3 -m pip install -U -r dependencies/setup.txt
python3 -m pip install -r dependencies/develop.txt
python3 setup.py clean --all
python3 setup.py build_ts
python3 setup.py bdist_wheel -p manylinux1_x86_64
python3 -m pip install dist/nni-2.6-py3-none-manylinux1_x86_64.whl
在 Docker 映像中使用 NNI
^^^^^^^^^^^^^^^^^^^^^^^^^
也可将 NNI 安装到 docker 映像中。 参考 `这里 <../Tutorial/HowToUseDocker.rst>`__ 来生成 NNI 的 docker 映像。 也可通过此命令从 Docker Hub 中直接拉取 NNI 的映像 ``docker pull msranni/nni:latest``。
验证安装
-------------------
*
通过克隆源代码下载示例。
.. code-block:: bash
git clone -b v2.6 https://github.com/Microsoft/nni.git
*
运行 MNIST 示例。
.. code-block:: bash
nnictl create --config nni/examples/trials/mnist-pytorch/config.yml
*
在命令行中等待输出 ``INFO: Successfully started experiment!`` 。 此消息表明实验已成功启动。 通过命令行输出的 Web UI url 来访问 Experiment 的界面。
.. code-block:: text
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: http://223.255.255.1:8080 http://127.0.0.1:8080
-----------------------------------------------------------------------
You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
commands description
1. nnictl experiment show show the information of experiments
2. nnictl trial ls list all of trial jobs
3. nnictl top monitor the status of running experiments
4. nnictl log stderr show stderr log content
5. nnictl log stdout show stdout log content
6. nnictl stop stop an experiment
7. nnictl trial kill kill a trial job by id
8. nnictl --help get help information about nnictl
-----------------------------------------------------------------------
* 在浏览器中打开 ``Web UI url``,可看到下图的实验详细信息,以及所有的尝试任务。 查看 `这里 <../Tutorial/WebUI.rst>`__ 的更多页面。
.. image:: ../../img/webui_overview_page.png
:target: ../../img/webui_overview_page.png
:alt: overview
.. image:: ../../img/webui_trialdetail_page.png
:target: ../../img/webui_trialdetail_page.png
:alt: detail
系统需求
-------------------
由于程序变更,NNI 的最低配置会有所更改。
Linux
^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* -
- 推荐配置
- 最低配置
* - **操作系统**
- Ubuntu 16.04 或以上版本
-
* - **CPU**
- Intel® Core™ i5 或 AMD Phenom™ II X3 或更高配置
- Intel® Core™ i3 或 AMD Phenom™ X3 8650
* - **GPU**
- NVIDIA® GeForce® GTX 660 或更高配置
- NVIDIA® GeForce® GTX 460
* - **内存**
- 6 GB
- 4 GB
* - **存储**
- 30 GB 可用的磁盘空间
-
* - **网络**
- 宽带连接
-
* - **分辨率**
- 1024 x 768 以上
-
macOS
^^^^^
.. list-table::
:header-rows: 1
:widths: auto
* -
- 推荐配置
- 最低配置
* - **操作系统**
- macOS 10.14.1 或更高版本
-
* - **CPU**
- Intel® Core™ i7-4770 或更高
- Intel® Core™ i5-760 或更高
* - **GPU**
- AMD Radeon™ R9 M395X 或更高
- NVIDIA® GeForce® GT 750M 或 AMD Radeon™ R9 M290 或更高
* - **内存**
- 8 GB
- 4 GB
* - **存储**
- 70GB 可用空间 SSD 硬盘
- 70GB 可用空间及 7200 RPM 硬盘
* - **网络**
- 宽带连接
-
* - **分辨率**
- 1024 x 768 以上
-
更多
---------------
* `概述 <../Overview.rst>`__
* `如何使用命令行工具 nnictl <Nnictl.rst>`__
* `如何使用 NNIBoard <WebUI.rst>`__
* `定义搜索空间 <SearchSpaceSpec.rst>`__
* `定义实验配置 <ExperimentConfig.rst>`__
* `如何在本机运行 Experiment (支持多 GPU 卡)? <../TrainingService/LocalMode.rst>`__
* `如何在多机上运行 Experiment? <../TrainingService/RemoteMachineMode.rst>`__
* `如何在 OpenPAI 上运行 Experiment? <../TrainingService/PaiMode.rst>`__
* `如何通过 Kubeflow 在 Kubernetes 上运行 Experiment? <../TrainingService/KubeflowMode.rst>`__
* `How to run an experiment on Kubernetes through FrameworkController? <../TrainingService/FrameworkControllerMode.rst>`__
* `如何通过 AdaptDL在 Kubernetes 上运行 Experiment? <../TrainingService/AdaptDLMode.rst>`__
Install on Windows
==================
Prerequires
-----------
*
Python 3.6 (or above) 64-bit. `Anaconda <https://www.anaconda.com/products/individual>`__ or `Miniconda <https://docs.conda.io/en/latest/miniconda.html>`__ is highly recommended to manage multiple Python environments on Windows.
*
If it's a newly installed Python environment, it needs to install `Microsoft C++ Build Tools <https://visualstudio.microsoft.com/visual-cpp-build-tools/>`__ to support build NNI dependencies like ``scikit-learn``.
.. code-block:: bat
pip install cython wheel
*
git for verifying installation.
Install NNI
-----------
In most cases, you can install and upgrade NNI from pip package. It's easy and fast.
If you are interested in special or the latest code versions, you can install NNI through source code.
If you want to contribute to NNI, refer to `setup development environment <SetupNniDeveloperEnvironment.rst>`__.
*
From pip package
.. code-block:: bat
python -m pip install --upgrade nni
*
From source code
.. code-block:: bat
git clone -b v2.6 https://github.com/Microsoft/nni.git
cd nni
python -m pip install -U -r dependencies/setup.txt
python -m pip install -r dependencies/develop.txt
python setup.py develop
Verify installation
-------------------
*
Clone examples within source code.
.. code-block:: bat
git clone -b v2.6 https://github.com/Microsoft/nni.git
*
Run the MNIST example.
.. code-block:: bat
nnictl create --config nni\examples\trials\mnist-pytorch\config_windows.yml
Note: If you are familiar with other frameworks, you can choose corresponding example under ``examples\trials``. It needs to change trial command ``python3`` to ``python`` in each example YAML, since default installation has ``python.exe``\ , not ``python3.exe`` executable.
*
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. You can explore the experiment using the ``Web UI url``.
.. code-block:: text
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: http://223.255.255.1:8080 http://127.0.0.1:8080
-----------------------------------------------------------------------
You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
commands description
1. nnictl experiment show show the information of experiments
2. nnictl trial ls list all of trial jobs
3. nnictl top monitor the status of running experiments
4. nnictl log stderr show stderr log content
5. nnictl log stdout show stdout log content
6. nnictl stop stop an experiment
7. nnictl trial kill kill a trial job by id
8. nnictl --help get help information about nnictl
-----------------------------------------------------------------------
* Open the ``Web UI url`` in your browser, you can view detailed information about the experiment and all the submitted trial jobs as shown below. `Here <../Tutorial/WebUI.rst>`__ are more Web UI pages.
.. image:: ../../img/webui_overview_page.png
:target: ../../img/webui_overview_page.png
:alt: overview
.. image:: ../../img/webui_trialdetail_page.png
:target: ../../img/webui_trialdetail_page.png
:alt: detail
System requirements
-------------------
Below are the minimum system requirements for NNI on Windows, Windows 10.1809 is well tested and recommend. Due to potential programming changes, the minimum system requirements for NNI may change over time.
.. list-table::
:header-rows: 1
:widths: auto
* -
- Recommended
- Minimum
* - **Operating System**
- Windows 10 1809 or above
-
* - **CPU**
- Intel® Core™ i5 or AMD Phenom™ II X3 or better
- Intel® Core™ i3 or AMD Phenom™ X3 8650
* - **GPU**
- NVIDIA® GeForce® GTX 660 or better
- NVIDIA® GeForce® GTX 460
* - **Memory**
- 6 GB RAM
- 4 GB RAM
* - **Storage**
- 30 GB available hare drive space
-
* - **Internet**
- Boardband internet connection
-
* - **Resolution**
- 1024 x 768 minimum display resolution
-
FAQ
---
simplejson failed when installing NNI
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Make sure a C++ 14.0 compiler is installed.
..
building 'simplejson._speedups' extension error: [WinError 3] The system cannot find the path specified
Trial failed with missing DLL in command line or PowerShell
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This error is caused by missing LIBIFCOREMD.DLL and LIBMMD.DLL and failure to install SciPy. Using Anaconda or Miniconda with Python(64-bit) can solve it.
..
ImportError: DLL load failed
Trial failed on webUI
^^^^^^^^^^^^^^^^^^^^^
Please check the trial log file stderr for more details.
If there is a stderr file, please check it. Two possible cases are:
* forgetting to change the trial command ``python3`` to ``python`` in each experiment YAML.
* forgetting to install experiment dependencies such as TensorFlow, Keras and so on.
Fail to use BOHB on Windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Make sure a C++ 14.0 compiler is installed when trying to run ``pip install nni[BOHB]`` to install the dependencies.
Not supported tuner on Windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SMAC is not supported currently; for the specific reason refer to this `GitHub issue <https://github.com/automl/SMAC3/issues/483>`__.
Use Windows as a remote worker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Refer to `Remote Machine mode <../TrainingService/RemoteMachineMode.rst>`__.
Segmentation fault (core dumped) when installing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Refer to `FAQ <FAQ.rst>`__.
Further reading
---------------
* `Overview <../Overview.rst>`__
* `Use command line tool nnictl <Nnictl.rst>`__
* `Use NNIBoard <WebUI.rst>`__
* `Define search space <SearchSpaceSpec.rst>`__
* `Config an experiment <ExperimentConfig.rst>`__
* `How to run an experiment on local (with multiple GPUs)? <../TrainingService/LocalMode.rst>`__
* `How to run an experiment on multiple machines? <../TrainingService/RemoteMachineMode.rst>`__
* `How to run an experiment on OpenPAI? <../TrainingService/PaiMode.rst>`__
* `How to run an experiment on Kubernetes through Kubeflow? <../TrainingService/KubeflowMode.rst>`__
* `How to run an experiment on Kubernetes through FrameworkController? <../TrainingService/FrameworkControllerMode.rst>`__
.. acdfab53c8209a53709a5bdca72d29b2
在 Windows 上安装
==================
先决条件
-----------
*
Python 3.6(或以上)64 位。 在 Windows 上推荐使用 `Anaconda <https://www.anaconda.com/products/individual>`__ 或 `Miniconda <https://docs.conda.io/en/latest/miniconda.html>`__ 来管理多个 Python 环境。
*
如果是新安装的 Python 环境,需要安装 `Microsoft C++ Build Tools <https://visualstudio.microsoft.com/visual-cpp-build-tools/>`__ 来支持 NNI 的依赖项,如 ``scikit-learn``。
.. code-block:: bat
pip install cython wheel
*
安装 git 用于验证安装。
安装 NNI
-----------
大多数情况下,可以从 pip 包安装和升级 NNI。 这样既方便又快捷。
如果对某个或最新版本的代码感兴趣,可通过源代码安装 NNI。
如果要为 NNI 贡献代码,参考 `设置开发环境 <SetupNniDeveloperEnvironment.rst>`__。
*
从 pip 包安装
.. code-block:: bat
python -m pip install --upgrade nni
*
从源代码安装
.. code-block:: bat
git clone -b v2.6 https://github.com/Microsoft/nni.git
cd nni
python -m pip install -U -r dependencies/setup.txt
python -m pip install -r dependencies/develop.txt
python setup.py develop
验证安装
-------------------
*
克隆源代码中的示例。
.. code-block:: bat
git clone -b v2.6 https://github.com/Microsoft/nni.git
*
运行 MNIST 示例。
.. code-block:: bat
nnictl create --config nni\examples\trials\mnist-pytorch\config_windows.yml
注意:如果熟悉其它框架,可选择 ``examples\trials`` 目录下对应的示例。 需要将示例 YAML 文件中 Trial 命令的 ``python3`` 改为 ``python``,这是因为默认安装的 Python 可执行文件是 ``python.exe``,没有 ``python3.exe``。
*
在命令行中等待输出 ``INFO: Successfully started experiment!`` 。 此消息表明实验已成功启动。 通过命令行输出的 Web UI url 来访问 Experiment 的界面。
.. code-block:: text
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: http://223.255.255.1:8080 http://127.0.0.1:8080
-----------------------------------------------------------------------
You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
commands description
1. nnictl experiment show show the information of experiments
2. nnictl trial ls list all of trial jobs
3. nnictl top monitor the status of running experiments
4. nnictl log stderr show stderr log content
5. nnictl log stdout show stdout log content
6. nnictl stop stop an experiment
7. nnictl trial kill kill a trial job by id
8. nnictl --help get help information about nnictl
-----------------------------------------------------------------------
* 在浏览器中打开 ``Web UI url``,可看到下图的实验详细信息,以及所有的尝试任务。 查看 `这里 <../Tutorial/WebUI.rst>`__ 的更多页面。
.. image:: ../../img/webui_overview_page.png
:target: ../../img/webui_overview_page.png
:alt: overview
.. image:: ../../img/webui_trialdetail_page.png
:target: ../../img/webui_trialdetail_page.png
:alt: detail
系统需求
-------------------
以下是 NNI 在 Windows 上的最低配置,推荐使用 Windows 10 1809 版。 由于程序变更,NNI 的最低配置会有所更改。
.. list-table::
:header-rows: 1
:widths: auto
* -
- 推荐配置
- 最低配置
* - **操作系统**
- Windows 10 1809 或更高版本
-
* - **CPU**
- Intel® Core™ i5 或 AMD Phenom™ II X3 或更高配置
- Intel® Core™ i3 或 AMD Phenom™ X3 8650
* - **GPU**
- NVIDIA® GeForce® GTX 660 或更高配置
- NVIDIA® GeForce® GTX 460
* - **内存**
- 6 GB
- 4 GB
* - **存储**
- 30 GB 可用的磁盘空间
-
* - **网络**
- 宽带连接
-
* - **分辨率**
- 1024 x 768 以上
-
常见问答
------------
安装 NNI 时出现 simplejson 错误
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
确保安装了 C++ 14.0 编译器。
..
building 'simplejson._speedups' extension error: [WinError 3] The system cannot find the path specified
在命令行或 PowerShell 中,Trial 因为缺少 DLL 而失败
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
此错误因为缺少 LIBIFCOREMD.DLL 和 LIBMMD.DLL 文件,且 SciPy 安装失败。 使用 Anaconda 或 Miniconda 和 Python(64位)可解决。
..
ImportError: DLL load failed
Web 界面上的 Trial 错误
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
检查 Trial 日志文件来了解详情。
如果存在 stderr 文件,也需要查看其内容。 两种可能的情况是:
* 忘记将 Experiment 配置的 Trial 命令中的 ``python3`` 改为 ``python``。
* 忘记安装 Experiment 的依赖,如 TensorFlow,Keras 等。
无法在 Windows 上使用 BOHB
^^^^^^^^^^^^^^^^^^^^^^^^^^^
确保安装了 C ++ 14.0 编译器然后尝试运行 ``pip install nni[BOHB]`` 来安装依赖项。
Windows 上不支持的 Tuner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
当前不支持 SMAC,原因可参考 `此问题 <https://github.com/automl/SMAC3/issues/483>`__。
用 Windows 作为远程节点
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
参考 `远程模式 <../TrainingService/RemoteMachineMode.rst>`__.
安装时出现 Segmentation Fault (core dumped)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
参考 `常见问题 <FAQ.rst>`__。
更多
---------------
* `概述 <../Overview.rst>`__
* `如何使用命令行工具 nnictl <Nnictl.rst>`__
* `如何使用 NNIBoard <WebUI.rst>`__
* `定义搜索空间 <SearchSpaceSpec.rst>`__
* `定义实验配置 <ExperimentConfig.rst>`__
* `如何在本机运行 Experiment (支持多 GPU 卡)? <../TrainingService/LocalMode.rst>`__
* `如何在多机上运行 Experiment? <../TrainingService/RemoteMachineMode.rst>`__
* `如何在 OpenPAI 上运行 Experiment? <../TrainingService/PaiMode.rst>`__
* `如何通过 Kubeflow 在 Kubernetes 上运行 Experiment? <../TrainingService/KubeflowMode.rst>`__
* `如何通过 FrameworkController 在 Kubernetes 上运行 Experiment? <../TrainingService/FrameworkControllerMode.rst>`__
<p align="center">
<img src=".././img/release-1-title-1.png" width="100%" />
</p>
From September 2018 to September 2019, We are still moving on …
**Great news!**&nbsp;&nbsp;With the tag of **Scalability** and **Ease of Use**, NNI v1.0 is comming. Based on the various types of [Tuning Algorithms](./Tuner/BuiltinTuner.md), NNI has supported the Hyperparameter tuning, Neural Architecture search and Auto-Feature-Engineering, which is an exciting news for algorithmic engineers; besides these, NNI v1.0 has made many improvements in the optimization of tuning algorithm, [WebUI's simplicity and intuition](./Tutorial/WebUI.md) and [Platform diversification](./TrainingService/SupportTrainingService.md). NNI has grown into a more intelligent automated machine learning (AutoML) toolkit.
<br/>
<br/>
<br/>
<p align="center">
<img src=".././img/nni-1.png" width="80%" />
</p>
<br />
<br />
<p align="center">
<img src=".././img/release-1-title-2.png" width="100%" />
</p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step one**: Start with the [Tutorial Doc](./Tutorial/Installation.md), and install NNI v1.0 first.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step two**: Find a " Hello world example", follow the [Tutorial Doc](./Tutorial/QuickStart.md) and have a Quick Start. <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step three**: Get familiar with the [WebUI Tutorial](./Tutorial/WebUI.md) and let NNI better assists with your tuning tour.<br>
The fully automated tool greatly improves the efficiency of the tuning process. For more detail about the 1.0 updates, you can refer to [Release 1.0](https://github.com/microsoft/nni/releases). More of our advance plan, you can refer to our [Roadmap](https://github.com/microsoft/nni/wiki/Roadmap). Besides, we also welcome more and more contributors to join us, there are many ways to participate, please refer to [How to contribute](./Tutorial/Contributing.md) for more details.
\ No newline at end of file
.. role:: raw-html(raw)
:format: html
Framework and Library Supports
==============================
With the built-in Python API, NNI naturally supports the hyper parameter tuning and neural network search for all the AI frameworks and libraries who support Python models(\ ``version >= 3.6``\ ). NNI had also provided a set of examples and tutorials for some of the popular scenarios to make jump start easier.
Supported AI Frameworks
-----------------------
* `PyTorch <https://github.com/pytorch/pytorch>`__
* :githublink:`MNIST-pytorch <examples/trials/mnist-distributed-pytorch>`
* `CIFAR-10 <./TrialExample/Cifar10Examples.rst>`__
* :githublink:`TGS salt identification chanllenge <examples/trials/kaggle-tgs-salt/README.md>`
* :githublink:`Network_morphism <examples/trials/network_morphism/README.md>`
* `TensorFlow <https://github.com/tensorflow/tensorflow>`__
* :githublink:`MNIST-tensorflow <examples/trials/mnist-distributed>`
* :githublink:`Squad <examples/trials/ga_squad/README.md>`
* `Keras <https://github.com/keras-team/keras>`__
* :githublink:`MNIST-keras <examples/trials/mnist-keras>`
* :githublink:`Network_morphism <examples/trials/network_morphism/README.md>`
* `MXNet <https://github.com/apache/incubator-mxnet>`__
* `Caffe2 <https://github.com/BVLC/caffe>`__
* `CNTK (Python language) <https://github.com/microsoft/CNTK>`__
* `Spark MLlib <http://spark.apache.org/mllib/>`__
* `Chainer <https://chainer.org/>`__
* `Theano <https://pypi.org/project/Theano/>`__
You are encouraged to `contribute more examples <Tutorial/Contributing.rst>`__ for other NNI users.
Supported Library
-----------------
NNI also supports all libraries written in python.Here are some common libraries, including some algorithms based on GBDT: XGBoost, CatBoost and lightGBM.
* `Scikit-learn <https://scikit-learn.org/stable/>`__
* `Scikit-learn <TrialExample/SklearnExamples.rst>`__
* `XGBoost <https://xgboost.readthedocs.io/en/latest/>`__
* `CatBoost <https://catboost.ai/>`__
* `LightGBM <https://lightgbm.readthedocs.io/en/latest/>`__
* `Auto-gbdt <TrialExample/GbdtExample.rst>`__
Here is just a small list of libraries that supported by NNI. If you are interested in NNI, you can refer to the `tutorial <TrialExample/Trials.rst>`__ to complete your own hacks.
In addition to the above examples, we also welcome more and more users to apply NNI to your own work, if you have any doubts, please refer `Write a Trial Run on NNI <TrialExample/Trials.rst>`__. In particular, if you want to be a contributor of NNI, whether it is the sharing of examples , writing of Tuner or otherwise, we are all looking forward to your participation.More information please refer to `here <Tutorial/Contributing.rst>`__.
**Run an Experiment on Azure Machine Learning**
===================================================
NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
Setup environment
-----------------
Step 1. Install NNI, follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Step 2. Create an Azure account/subscription using this `link <https://azure.microsoft.com/en-us/free/services/machine-learning/>`__. If you already have an Azure account/subscription, skip this step.
Step 3. Install the Azure CLI on your machine, follow the install guide `here <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__.
Step 4. Authenticate to your Azure subscription from the CLI. To authenticate interactively, open a command line or terminal and use the following command:
.. code-block:: bash
az login
Step 5. Log into your Azure account with a web browser and create a Machine Learning resource. You will need to choose a resource group and specific a workspace name. Then download ``config.json`` which will be used later.
.. image:: ../../img/aml_workspace.png
:target: ../../img/aml_workspace.png
:alt:
Step 6. Create an AML cluster as the computeTarget.
.. image:: ../../img/aml_cluster.png
:target: ../../img/aml_cluster.png
:alt:
Step 7. Open a command line and install AML package environment.
.. code-block:: bash
python3 -m pip install azureml
python3 -m pip install azureml-sdk
Run an experiment
-----------------
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialConcurrency: 1
maxTrialNumber: 10
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: aml
dockerImage: msranni/nni
subscriptionId: ${your subscription ID}
resourceGroup: ${your resource group}
workspaceName: ${your workspace name}
computeTarget: ${your compute target}
Note: You should set ``platform: aml`` in NNI config YAML file if you want to start experiment in aml mode.
Compared with `LocalMode <LocalMode.rst>`__ training service configuration in aml mode have these additional keys:
* dockerImage
* required key. The docker image name used in job. NNI support image ``msranni/nni`` for running aml jobs.
.. Note:: This image is build based on cuda environment, may not be suitable for CPU clusters in AML.
amlConfig:
* subscriptionId
* required key, the subscriptionId of your account
* resourceGroup
* required key, the resourceGroup of your account
* workspaceName
* required key, the workspaceName of your account
* computeTarget
* required key, the compute cluster name you want to use in your AML workspace. `refer <https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target>`__ See Step 6.
* maxTrialNumberPerGpu
* optional key, default 1. Used to specify the max concurrency trial number on a GPU device.
* useActiveGpu
* optional key, default false. Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU.
The required information of amlConfig could be found in the downloaded ``config.json`` in Step 5.
Run the following commands to start the example experiment:
.. code-block:: bash
git clone -b ${NNI_VERSION} https://github.com/microsoft/nni
cd nni/examples/trials/mnist-pytorch
# modify config_aml.yml ...
nnictl create --config config_aml.yml
Replace ``${NNI_VERSION}`` with a released version name or branch name, e.g., ``v2.4``.
Monitor your code in the cloud by using the studio
--------------------------------------------------
To monitor your job's code, you need to visit your studio which you create at step 5. Once the job completes, go to the Outputs + logs tab. There you can see a 70_driver_log.txt file, This file contains the standard output from a run and can be useful when you're debugging remote runs in the cloud. Learn more about aml from `here <https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world>`__.
Run an Experiment on FrameworkController
========================================
NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use $(HOME)/.kube/config as kubeconfig file's path. You can also specify other kubeconfig files by setting the**KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copies files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or **Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
#. NNI support Kubeflow based on Azure Kubernetes Service, follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
#. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**. Use ``az login`` to set azure account, and connect kubectl client to AKS, refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI uses `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Prerequisite for PVC storage mode
-----------------------------------------
In order to use persistent volume claims instead of NFS or Azure storage, related storage must
be created manually, in the namespace your trials will run later. This restriction is due to the
fact, that persistent volume claims are hard to recycle and thus can quickly mess with a cluster's
storage management. Persistent volume claims can be created by e.g. using kubectl. Please refer
to the official Kubernetes documentation for `further information <https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims>`__.
Setup FrameworkController
-------------------------
Follow the `guideline <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run>`__ to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode. If your cluster enforces authorization, you need to create a service account with granted permission for FrameworkController, and then pass the name of the FrameworkController service account to the NNI Experiment Config. `refer <https://github.com/Microsoft/frameworkcontroller/tree/master/example/run#run-by-kubernetes-statefulset>`__.
If the k8s cluster enforces Authorization, you also need to create a ServiceAccount with granted permission for FrameworkController, `refer <https://github.com/microsoft/frameworkcontroller/tree/master/example/run#prerequisite>`__.
Design
------
Please refer the design of `Kubeflow training service <KubeflowMode.rst>`__\ , FrameworkController training service pipeline is similar.
Example
-------
The FrameworkController config file format is:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 100
#choice: local, remote, pai, kubeflow, frameworkcontroller
trainingServicePlatform: frameworkcontroller
searchSpacePath: ~/nni/examples/trials/mnist-tfv1/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: ~/nni/examples/trials/mnist-tfv1
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
server: {your_nfs_server}
path: {your_nfs_server_exported_path}
If you use Azure Kubernetes Service, you should set ``frameworkcontrollerConfig`` in your config YAML file as follows:
.. code-block:: yaml
frameworkcontrollerConfig:
storage: azureStorage
serviceAccountName: {your_frameworkcontroller_service_account_name}
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
If you set `ServiceAccount <https://github.com/microsoft/frameworkcontroller/tree/master/example/run#prerequisite>`__ in your k8s, please set ``serviceAccountName`` in your config file:
For example:
.. code-block:: yaml
frameworkcontrollerConfig:
serviceAccountName: frameworkcontroller
Note: You should explicitly set ``trainingServicePlatform: frameworkcontroller`` in NNI config YAML file if you want to start experiment in frameworkcontrollerConfig mode.
The trial's config format for NNI frameworkcontroller mode is a simple version of FrameworkController's official config, you could refer the `Tensorflow example of FrameworkController <https://github.com/microsoft/frameworkcontroller/blob/master/example/framework/scenario/tensorflow/ps/cpu/tensorflowdistributedtrainingwithcpu.yaml>`__ for deep understanding.
Trial configuration in frameworkcontroller mode have the following configuration keys:
* taskRoles: you could set multiple task roles in config file, and each task role is a basic unit to process in Kubernetes cluster.
* name: the name of task role specified, like "worker", "ps", "master".
* taskNum: the replica number of the task role.
* command: the users' command to be used in the container.
* gpuNum: the number of gpu device used in container.
* cpuNum: the number of cpu device used in container.
* memoryMB: the memory limitaion to be specified in container.
* image: the docker image used to create pod and run the program.
* frameworkAttemptCompletionPolicy: the policy to run framework, please refer the `user-manual <https://github.com/Microsoft/frameworkcontroller/blob/master/doc/user-manual.md#frameworkattemptcompletionpolicy>`__ to get the specific information. Users could use the policy to control the pod, for example, if ps does not stop, only worker stops, The completion policy could helps stop ps.
NNI also offers the possibility to include a customized frameworkcontroller template similar
to the aforementioned tensorflow example. A valid configuration the may look like:
.. code-block:: yaml
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 2
logLevel: trace
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
frameworkcontrollerConfig:
configPath: fc_template.yml
storage: pvc
namespace: twin-pipelines
pvc:
path: /mnt/data
Note that in this example a persistent volume claim has been used, that must be created manually in the specified namespace beforehand. Stick to the mnist-pytorch example (:githublink: `<examples/trials/mnist-pytorch>`__) for a more detailed config (:githublink: `<examples/trials/mnist-pytorch/config_frameworkcontroller_custom.yml>`__) and frameworkcontroller template (:githublink: `<examples/trials/fc_template.yml>`__).
How to run example
------------------
After you prepare a config file, you could run your experiment by nnictl. The way to start an experiment on FrameworkController is similar to Kubeflow, please refer the `document <KubeflowMode.rst>`__ for more information.
version check
-------------
NNI support version check feature in since version 0.6, `refer <PaiMode.rst>`__
FrameworkController reuse mode
------------------------------
NNI support setting reuse mode for trial jobs. In reuse mode, NNI will submit a long-running trial runner process to occupy the container, and start trial jobs as the subprocess of the trial runner process, it means k8s do not need to schedule new container again, it just reuse old container.
Currently, frameworkcontroller reuse mode only support V2 config.
Here is the example:
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 4
maxTrialNumber: 20
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
reuseMode: true
platform: frameworkcontroller
taskRoles:
- name:
dockerImage: 'msranni/nni:latest'
taskNumber: 1
command:
gpuNumber:
cpuNumber:
memorySize:
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceedTaskCount: 1
storage:
storageType: azureStorage
azureAccount: {your_account}
azureShare: {your_share}
keyVaultName: {your_valut_name}
keyVaultKey: {your_valut_key}
**Run an Experiment on Hybrid Mode**
===========================================
Run NNI on hybrid mode means that NNI will run trials jobs in multiple kinds of training platforms. For example, NNI could submit trial jobs to remote machine and AML simultaneously.
Setup environment
-----------------
NNI has supported `local <./LocalMode.rst>`__\ , `remote <./RemoteMachineMode.rst>`__\ , `PAI <./PaiMode.rst>`__\ , `AML <./AMLMode.rst>`__, `Kubeflow <./KubeflowMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ ,for hybrid training service. Before starting an experiment using these mode, users should setup the corresponding environment for the platforms. More details about the environment setup could be found in the corresponding docs.
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
experimentName: MNIST
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialCodeDirectory: .
trialConcurrency: 2
trialGpuNumber: 0
maxExperimentDuration: 24h
maxTrialNumber: 100
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
- platform: remote
machineList:
- host: 127.0.0.1
user: bob
password: bob
- platform: local
To use hybrid training services, users should set training service configurations as a list in `trainingService` field.
Currently, hybrid support setting `local`, `remote`, `pai`, `aml`, `kubeflow` and `frameworkcontroller` training services.
Run an Experiment on Kubeflow
=============================
Now NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
Prerequisite for on-premises Kubernetes Service
-----------------------------------------------
#. A **Kubernetes** cluster using Kubernetes 1.8 or later. Follow this `guideline <https://kubernetes.io/docs/setup/>`__ to set up Kubernetes
#. Download, set up, and deploy **Kubeflow** to your Kubernetes cluster. Follow this `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__ to setup Kubeflow.
#. Prepare a **kubeconfig** file, which will be used by NNI to interact with your Kubernetes API server. By default, NNI manager will use ``$(HOME)/.kube/config`` as kubeconfig file's path. You can also specify other kubeconfig files by setting the **KUBECONFIG** environment variable. Refer this `guideline <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig>`__ to learn more about kubeconfig.
#. If your NNI trial job needs GPU resource, you should follow this `guideline <https://github.com/NVIDIA/k8s-device-plugin>`__ to configure **Nvidia device plugin for Kubernetes**.
#. Prepare a **NFS server** and export a general purpose mount (we recommend to map your NFS server path in ``root_squash option``\ , otherwise permission issue may raise when NNI copy files to NFS. Refer this `page <https://linux.die.net/man/5/exports>`__ to learn what root_squash option is), or **Azure File Storage**.
#. Install **NFS client** on the machine where you install NNI and run nnictl to create experiment. Run this command to install NFSv4 client:
.. code-block:: bash
apt-get install nfs-common
7. Install **NNI**\ , follow the install guide `here <../Tutorial/QuickStart.rst>`__.
Prerequisite for Azure Kubernetes Service
-----------------------------------------
#. NNI support Kubeflow based on Azure Kubernetes Service, follow the `guideline <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__ to set up Azure Kubernetes Service.
#. Install `Azure CLI <https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest>`__ and **kubectl**. Use ``az login`` to set azure account, and connect kubectl client to AKS, refer this `guideline <https://docs.microsoft.com/en-us/azure/aks/kubernetes-walkthrough#connect-to-the-cluster>`__.
#. Deploy Kubeflow on Azure Kubernetes Service, follow the `guideline <https://www.kubeflow.org/docs/started/getting-started/>`__.
#. Follow the `guideline <https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=portal>`__ to create azure file storage account. If you use Azure Kubernetes Service, NNI need Azure Storage Service to store code files and the output files.
#. To access Azure storage service, NNI need the access key of the storage account, and NNI use `Azure Key Vault <https://azure.microsoft.com/en-us/services/key-vault/>`__ Service to protect your private key. Set up Azure Key Vault Service, add a secret to Key Vault to store the access key of Azure storage account. Follow this `guideline <https://docs.microsoft.com/en-us/azure/key-vault/quick-create-cli>`__ to store the access key.
Design
------
.. image:: ../../img/kubeflow_training_design.png
:target: ../../img/kubeflow_training_design.png
:alt:
Kubeflow training service instantiates a Kubernetes rest client to interact with your K8s cluster's API server.
For each trial, we will upload all the files in your local codeDir path (configured in nni_config.yml) together with NNI generated files like parameter.cfg into a storage volumn. Right now we support two kinds of storage volumes: `nfs <https://en.wikipedia.org/wiki/Network_File_System>`__ and `azure file storage <https://azure.microsoft.com/en-us/services/storage/files/>`__\ , you should configure the storage volumn in NNI config YAML file. After files are prepared, Kubeflow training service will call K8S rest API to create Kubeflow jobs (\ `tf-operator <https://github.com/kubeflow/tf-operator>`__ job or `pytorch-operator <https://github.com/kubeflow/pytorch-operator>`__ job) in K8S, and mount your storage volume into the job's pod. Output files of Kubeflow job, like stdout, stderr, trial.log or model files, will also be copied back to the storage volumn. NNI will show the storage volumn's URL for each trial in WebUI, to allow user browse the log files and job's output files.
Supported operator
------------------
NNI only support tf-operator and pytorch-operator of Kubeflow, other operators is not tested.
Users could set operator type in config file.
The setting of tf-operator:
.. code-block:: yaml
kubeflowConfig:
operator: tf-operator
The setting of pytorch-operator:
.. code-block:: yaml
kubeflowConfig:
operator: pytorch-operator
If users want to use tf-operator, he could set ``ps`` and ``worker`` in trial config. If users want to use pytorch-operator, he could set ``master`` and ``worker`` in trial config.
Supported storage type
----------------------
NNI support NFS and Azure Storage to store the code and output files, users could set storage type in config file and set the corresponding config.
The setting for NFS storage are as follows:
.. code-block:: yaml
kubeflowConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
If you use Azure storage, you should set ``kubeflowConfig`` in your config YAML file as follows:
.. code-block:: yaml
kubeflowConfig:
storage: azureStorage
keyVault:
vaultName: {your_vault_name}
name: {your_secert_name}
azureStorage:
accountName: {your_storage_account_name}
azureShare: {your_azure_share_name}
Run an experiment
-----------------
Use ``examples/trials/mnist-tfv1`` as an example. This is a tensorflow job, and use tf-operator of Kubeflow. The NNI config YAML file's content is like:
.. code-block:: yaml
authorName: default
experimentName: example_mnist
trialConcurrency: 2
maxExecDuration: 1h
maxTrialNum: 20
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 2
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
ps:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
apiVersion: v1alpha2
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
Note: You should explicitly set ``trainingServicePlatform: kubeflow`` in NNI config YAML file if you want to start experiment in kubeflow mode.
If you want to run PyTorch jobs, you could set your config files as follow:
.. code-block:: yaml
authorName: default
experimentName: example_mnist_distributed_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
codeDir: .
master:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 1
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
worker:
replicas: 1
command: python3 dist_mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 2048
image: msranni/nni:latest
kubeflowConfig:
operator: pytorch-operator
apiVersion: v1alpha2
nfs:
# Your NFS server IP, like 10.10.10.10
server: {your_nfs_server_ip}
# Your NFS server export path, like /var/nfs/nni
path: {your_nfs_server_export_path}
Trial configuration in kubeflow mode have the following configuration keys:
* codeDir
* code directory, where you put training code and config files
* worker (required). This config section is used to configure tensorflow worker role
* replicas
* Required key. Should be positive number depends on how many replication your want to run for tensorflow worker role.
* command
* Required key. Command to launch your trial job, like ``python mnist.py``
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* cpuNum
* gpuNum
* image
* Required key. In kubeflow mode, your trial program will be scheduled by Kubernetes to run in `Pod <https://kubernetes.io/docs/concepts/workloads/pods/pod/>`__. This key is used to specify the Docker image used to create the pod where your trail program will run.
* We already build a docker image :githublink:`msranni/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it.
* privateRegistryAuthPath
* Optional field, specify ``config.json`` file path that holds an authorization token of docker registry, used to pull image from private registry. `Refer <https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/>`__.
* apiVersion
* Required key. The API version of your Kubeflow.
.. cannot find :githublink:`msranni/nni <deployment/docker/Dockerfile>`
* ps (optional). This config section is used to configure Tensorflow parameter server role.
* master(optional). This config section is used to configure PyTorch parameter server role.
Once complete to fill NNI experiment config file and save (for example, save as exp_kubeflow.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_kubeflow.yml
to start the experiment in kubeflow mode. NNI will create Kubeflow tfjob or pytorchjob for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see the Kubeflow tfjob created by NNI in your Kubernetes dashboard.
Notice: In kubeflow mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can go to NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
version check
-------------
NNI support version check feature in since version 0.6, `refer <PaiMode.rst>`__
Any problems when using NNI in Kubeflow mode, please create issues on `NNI Github repo <https://github.com/Microsoft/nni>`__.
Kubeflow reuse mode
----------------------
NNI support setting reuse mode for trial jobs. In reuse mode, NNI will submit a long-running trial runner process to occupy the container, and start trial jobs as the subprocess of the trial runner process, it means k8s do not need to schedule new container again, it just reuse old container.
Currently, kubeflow reuse mode only support V2 config.
Here is the example:
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 4
maxTrialNumber: 20
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
reuseMode: true
platform: kubeflow
worker:
command: python3 mnist.py
code_directory: .
dockerImage: msranni/nni
cpuNumber: 1
gpuNumber: 0
memorySize: 8192
replicas: 1
operator: tf-operator
storage:
storageType: azureStorage
azureAccount: {your_account}
azureShare: {your_share}
keyVaultName: {your_valut_name}
keyVaultKey: {your_valut_key}
apiVersion: v1
Training Service
================
What is Training Service?
-------------------------
NNI training service is designed to allow users to focus on AutoML itself, agnostic to the underlying computing infrastructure where the trials are actually run. When migrating from one cluster to another (e.g., local machine to Kubeflow), users only need to tweak several configurations, and the experiment can be easily scaled.
Users can use training service provided by NNI, to run trial jobs on `local machine <./LocalMode.rst>`__\ , `remote machines <./RemoteMachineMode.rst>`__\ , and on clusters like `PAI <./PaiMode.rst>`__\ , `Kubeflow <./KubeflowMode.rst>`__\ , `AdaptDL <./AdaptDLMode.rst>`__\ , `FrameworkController <./FrameworkControllerMode.rst>`__\ , `DLTS <./DLTSMode.rst>`__, `AML <./AMLMode.rst>`__ and `DLC <./DLCMode.rst>`__. These are called *built-in training services*.
If the computing resource customers try to use is not listed above, NNI provides interface that allows users to build their own training service easily. Please refer to `how to implement training service <./HowToImplementTrainingService.rst>`__ for details.
How to use Training Service?
----------------------------
Training service needs to be chosen and configured properly in experiment configuration YAML file. Users could refer to the document of each training service for how to write the configuration. Also, `reference <../Tutorial/ExperimentConfig.rst>`__ provides more details on the specification of the experiment configuration file.
Next, users should prepare code directory, which is specified as ``codeDir`` in config file. Please note that in non-local mode, the code directory will be uploaded to remote or cluster before the experiment. Therefore, we limit the number of files to 2000 and total size to 300MB. If the code directory contains too many files, users can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see :githublink:`this example <examples/trials/mnist-tfv1/.nniignore>` and the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
In case users intend to use large files in their experiment (like large-scaled datasets) and they are not using local mode, they can either: 1) download the data before each trial launches by putting it into trial command; or 2) use a shared storage that is accessible to worker nodes. Usually, training platforms are equipped with shared storage, and NNI allows users to easily use them. Refer to docs of each built-in training service for details.
Built-in Training Services
--------------------------
.. list-table::
:header-rows: 1
:widths: auto
* - TrainingService
- Brief Introduction
* - `Local <./LocalMode.rst>`__
- NNI supports running an experiment on local machine, called local mode. Local mode means that NNI will run the trial jobs and nniManager process in same machine, and support gpu schedule function for trial jobs.
* - `Remote <./RemoteMachineMode.rst>`__
- NNI supports running an experiment on multiple machines through SSH channel, called remote mode. NNI assumes that you have access to those machines, and already setup the environment for running deep learning training code. NNI will submit the trial jobs in remote machine, and schedule suitable machine with enough gpu resource if specified.
* - `PAI <./PaiMode.rst>`__
- NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__ (aka PAI), called PAI mode. Before starting to use NNI PAI mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In PAI mode, your trial program will run in PAI's container created by Docker.
* - `Kubeflow <./KubeflowMode.rst>`__
- NNI supports running experiment on `Kubeflow <https://github.com/kubeflow/kubeflow>`__\ , called kubeflow mode. Before starting to use NNI kubeflow mode, you should have a Kubernetes cluster, either on-premises or `Azure Kubernetes Service(AKS) <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__\ , a Ubuntu machine on which `kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`__ is setup to connect to your Kubernetes cluster. If you are not familiar with Kubernetes, `here <https://kubernetes.io/docs/tutorials/kubernetes-basics/>`__ is a good start. In kubeflow mode, your trial program will run as Kubeflow job in Kubernetes cluster.
* - `AdaptDL <./AdaptDLMode.rst>`__
- NNI supports running experiment on `AdaptDL <https://github.com/petuum/adaptdl>`__\ , called AdaptDL mode. Before starting to use AdaptDL mode, you should have a Kubernetes cluster.
* - `FrameworkController <./FrameworkControllerMode.rst>`__
- NNI supports running experiment using `FrameworkController <https://github.com/Microsoft/frameworkcontroller>`__\ , called frameworkcontroller mode. FrameworkController is built to orchestrate all kinds of applications on Kubernetes, you don't need to install Kubeflow for specific deep learning framework like tf-operator or pytorch-operator. Now you can use FrameworkController as the training service to run NNI experiment.
* - `DLTS <./DLTSMode.rst>`__
- NNI supports running experiment using `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , which is an open source toolkit, developed by Microsoft, that allows AI scientists to spin up an AI cluster in turn-key fashion.
* - `AML <./AMLMode.rst>`__
- NNI supports running an experiment on `AML <https://azure.microsoft.com/en-us/services/machine-learning/>`__ , called aml mode.
* - `DLC <./DLCMode.rst>`__
- NNI supports running an experiment on `PAI-DLC <https://help.aliyun.com/document_detail/165137.html>`__ , called dlc mode.
What does Training Service do?
------------------------------
.. raw:: html
<p align="center">
<img src="https://user-images.githubusercontent.com/23273522/51816536-ed055580-2301-11e9-8ad8-605a79ee1b9a.png" alt="drawing" width="700"/>
</p>
According to the architecture shown in `Overview <../Overview.rst>`__\ , training service (platform) is actually responsible for two events: 1) initiating a new trial; 2) collecting metrics and communicating with NNI core (NNI manager); 3) monitoring trial job status. To demonstrated in detail how training service works, we show the workflow of training service from the very beginning to the moment when first trial succeeds.
Step 1. **Validate config and prepare the training platform.** Training service will first check whether the training platform user specifies is valid (e.g., is there anything wrong with authentication). After that, training service will start to prepare for the experiment by making the code directory (\ ``codeDir``\ ) accessible to training platform.
.. Note:: Different training services have different ways to handle ``codeDir``. For example, local training service directly runs trials in ``codeDir``. Remote training service packs ``codeDir`` into a zip and uploads it to each machine. K8S-based training services copy ``codeDir`` onto a shared storage, which is either provided by training platform itself, or configured by users in config file.
Step 2. **Submit the first trial.** To initiate a trial, usually (in non-reuse mode), NNI copies another few files (including parameters, launch script and etc.) onto training platform. After that, NNI launches the trial through subprocess, SSH, RESTful API, and etc.
.. Warning:: The working directory of trial command has exactly the same content as ``codeDir``, but can have different paths (even on different machines) Local mode is the only training service that shares one ``codeDir`` across all trials. Other training services copies a ``codeDir`` from the shared copy prepared in step 1 and each trial has an independent working directory. We strongly advise users not to rely on the shared behavior in local mode, as it will make your experiments difficult to scale to other training services.
Step 3. **Collect metrics.** NNI then monitors the status of trial, updates the status (e.g., from ``WAITING`` to ``RUNNING``\ , ``RUNNING`` to ``SUCCEEDED``\ ) recorded, and also collects the metrics. Currently, most training services are implemented in an "active" way, i.e., training service will call the RESTful API on NNI manager to update the metrics. Note that this usually requires the machine that runs NNI manager to be at least accessible to the worker node.
Training Service Under Reuse Mode
---------------------------------
When reuse mode is enabled, a cluster, such as a remote machine or a computer instance on AML, will launch a long-running environment, so that NNI will submit trials to these environments iteratively, which saves the time to create new jobs. For instance, using OpenPAI training platform under reuse mode can avoid the overhead of pulling docker images, creating containers, and downloading data repeatedly.
In the reuse mode, user needs to make sure each trial can run independently in the same job (e.g., avoid loading checkpoints from previous trials).
.. note:: Currently, only `Local <./LocalMode.rst>`__, `Remote <./RemoteMachineMode.rst>`__, `OpenPAI <./PaiMode.rst>`__, `AML <./AMLMode.rst>`__ and `DLC <./DLCMode.rst>`__ training services support resue mode. For Remote and OpenPAI training platforms, you can enable reuse mode according to `here <../reference/experiment_config.rst>`__ manually. AML is implemented under reuse mode, so the default mode is reuse mode, no need to manually enable.
.. role:: raw-html(raw)
:format: html
**Run an Experiment on OpenPAI**
====================================
NNI supports running an experiment on `OpenPAI <https://github.com/Microsoft/pai>`__\ , called pai mode. Before starting to use NNI pai mode, you should have an account to access an `OpenPAI <https://github.com/Microsoft/pai>`__ cluster. See `here <https://github.com/Microsoft/pai#how-to-deploy>`__ if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.
.. toctree::
Setup environment
-----------------
**Step 1. Install NNI, follow the install guide** `here <../Tutorial/QuickStart.rst>`__.
**Step 2. Get token.**
Open web portal of OpenPAI, and click ``My profile`` button in the top-right side.
.. image:: ../../img/pai_profile.jpg
:scale: 80%
Click ``copy`` button in the page to copy a jwt token.
.. image:: ../../img/pai_token.jpg
:scale: 67%
**Step 3. Mount NFS storage to local machine.**
Click ``Submit job`` button in web portal.
.. image:: ../../img/pai_job_submission_page.jpg
:scale: 50%
Find the data management region in job submission page.
.. image:: ../../img/pai_data_management_page.jpg
:scale: 33%
The ``Preview container paths`` is the NFS host and path that OpenPAI provided, you need to mount the corresponding host and path to your local machine first, then NNI could use the OpenPAI's NFS storage.\ :raw-html:`<br>`
For example, use the following command:
.. code-block:: bash
sudo mount -t nfs4 gcr-openpai-infra02:/pai/data /local/mnt
Then the ``/data`` folder in container will be mounted to ``/local/mnt`` folder in your local machine.\ :raw-html:`<br>`
You could use the following configuration in your NNI's config file:
.. code-block:: yaml
localStorageMountPoint: /local/mnt
**Step 4. Get OpenPAI's storage config name and localStorageMountPoint**
The ``Team share storage`` field is storage configuration used to specify storage value in OpenPAI. You can get ``storageConfigName`` and ``containerStorageMountPoint`` field in ``Team share storage``\ , for example:
.. code-block:: yaml
storageConfigName: confignfs-data
containerStorageMountPoint: /mnt/confignfs-data
Run an experiment
-----------------
Use ``examples/trials/mnist-pytorch`` as an example. The NNI config YAML file's content is like:
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 1
maxTrialNumber: 10
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: openpai
host: http://123.123.123.123
username: ${your user name}
token: ${your token}
dockerImage: msranni/nni
trialCpuNumber: 1
trialMemorySize: 8GB
storageConfigName: ${your storage config name}
localStorageMountPoint: ${NFS mount point on local machine}
containerStorageMountPoint: ${NFS mount point inside Docker container}
Note: You should set ``platform: pai`` in NNI config YAML file if you want to start experiment in pai mode. The host field in configuration file is PAI's job submission page uri, like ``10.10.5.1``\ , the default protocol in NNI is HTTPS, if your PAI's cluster disabled https, please use the uri in ``http://10.10.5.1`` format.
OpenPai configurations
^^^^^^^^^^^^^^^^^^^^^^
Compared with `LocalMode <LocalMode.rst>`__ and `RemoteMachineMode <RemoteMachineMode.rst>`__\ , ``trainingService`` configuration in pai mode has the following additional keys:
*
username
Required key. User name of OpenPAI platform.
*
token
Required key. Authentication key of OpenPAI platform.
*
host
Required key. The host of OpenPAI platform. It's OpenPAI's job submission page uri, like ``10.10.5.1``\ , the default protocol in NNI is HTTPS, if your OpenPAI cluster disabled https, please use the uri in ``http://10.10.5.1`` format.
*
trialCpuNumber
Optional key. Should be positive number based on your trial program's CPU requirement. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
*
trialMemorySize
Optional key. Should be in format like ``2gb`` based on your trial program's memory requirement. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
*
dockerImage
Optional key. In OpenPai mode, your trial program will be scheduled by OpenPAI to run in `Docker container <https://www.docker.com/>`__. This key is used to specify the Docker image used to create the container in which your trial will run.
We already build a docker image :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`. You can either use this image directly in your config file, or build your own image based on it. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
.. cannot find :githublink:`nnimsra/nni <deployment/docker/Dockerfile>`
*
virtualCluster
Optional key. Set the virtualCluster of OpenPAI. If omitted, the job will run on default virtual cluster.
*
localStorageMountPoint
Required key. Set the mount path in the machine you run nnictl.
*
containerStorageMountPoint
Required key. Set the mount path in your container used in OpenPAI.
*
storageConfigName:
Optional key. Set the storage name used in OpenPAI. If it is not set in trial configuration, it should be set in the config specified in ``openpaiConfig`` or ``openpaiConfigFile`` field.
*
openpaiConfigFile
Optional key. Set the file path of OpenPAI job configuration, the file is in yaml format.
If users set ``openpaiConfigFile`` in NNI's configuration file, no need to specify the fields ``storageConfigName``, ``virtualCluster``, ``dockerImage``, ``trialCpuNumber``, ``trialGpuNumber``, ``trialMemorySize`` in configuration. These fields will use the values from the config file specified by ``openpaiConfigFile``.
*
openpaiConfig
Optional key. Similar to ``openpaiConfigFile``, but instead of referencing an external file, using this field you embed the content into NNI's config YAML.
Note:
#.
The job name in OpenPAI's configuration file will be replaced by a new job name, the new job name is created by NNI, the name format is ``nni_exp_{this.experimentId}_trial_{trialJobId}`` .
#.
If users set multiple taskRoles in OpenPAI's configuration file, NNI will wrap all of these taksRoles and start multiple tasks in one trial job, users should ensure that only one taskRole report metric to NNI, otherwise there might be some conflict error.
Once complete to fill NNI experiment config file and save (for example, save as exp_pai.yml), then run the following command
.. code-block:: bash
nnictl create --config exp_pai.yml
to start the experiment in pai mode. NNI will create OpenPAI job for each trial, and the job name format is something like ``nni_exp_{experiment_id}_trial_{trial_id}``.
You can see jobs created by NNI in the OpenPAI cluster's web portal, like:
.. image:: ../../img/nni_pai_joblist.jpg
:target: ../../img/nni_pai_joblist.jpg
:alt:
Notice: In pai mode, NNIManager will start a rest server and listen on a port which is your NNI WebUI's port plus 1. For example, if your WebUI port is ``8080``\ , the rest server will listen on ``8081``\ , to receive metrics from trial job running in Kubernetes. So you should ``enable 8081`` TCP port in your firewall rule to allow incoming traffic.
Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.
Expand a trial information in trial list view, click the logPath link like:
.. image:: ../../img/nni_webui_joblist.png
:scale: 30%
And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
.. image:: ../../img/nni_trial_hdfs_output.jpg
:scale: 80%
You can see there're three fils in output folder: stderr, stdout, and trial.log
data management
---------------
Before using NNI to start your experiment, users should set the corresponding mount data path in your nniManager machine. OpenPAI has their own storage(NFS, AzureBlob ...), and the storage will used in OpenPAI will be mounted to the container when it start a job. Users should set the OpenPAI storage type by ``paiStorageConfigName`` field to choose a storage in OpenPAI. Then users should mount the storage to their nniManager machine, and set the ``nniManagerNFSMountPath`` field in configuration file, NNI will generate bash files and copy data in ``codeDir`` to the ``nniManagerNFSMountPath`` folder, then NNI will start a trial job. The data in ``nniManagerNFSMountPath`` will be sync to OpenPAI storage, and will be mounted to OpenPAI's container. The data path in container is set in ``containerNFSMountPath``\ , NNI will enter this folder first, and then run scripts to start a trial job.
version check
-------------
NNI support version check feature in since version 0.6. It is a policy to insure the version of NNIManager is consistent with trialKeeper, and avoid errors caused by version incompatibility.
Check policy:
#. NNIManager before v0.6 could run any version of trialKeeper, trialKeeper support backward compatibility.
#. Since version 0.6, NNIManager version should keep same with triakKeeper version. For example, if NNIManager version is 0.6, trialKeeper version should be 0.6 too.
#. Note that the version check feature only check first two digits of version.For example, NNIManager v0.6.1 could use trialKeeper v0.6 or trialKeeper v0.6.2, but could not use trialKeeper v0.5.1 or trialKeeper v0.7.
If you could not run your experiment and want to know if it is caused by version check, you could check your webUI, and there will be an error message about version check.
.. image:: ../../img/webui-img/experimentError.png
:scale: 80%
Run an Experiment on Remote Machines
====================================
NNI can run one experiment on multiple remote machines through SSH, called ``remote`` mode. It's like a lightweight training platform. In this mode, NNI can be started from your computer, and dispatch trials to remote machines in parallel.
The OS of remote machines supports ``Linux``\ , ``Windows 10``\ , and ``Windows Server 2019``.
Requirements
------------
*
Make sure the default environment of remote machines meets requirements of your trial code. If the default environment does not meet the requirements, the setup script can be added into ``command`` field of NNI config.
*
Make sure remote machines can be accessed through SSH from the machine which runs ``nnictl`` command. It supports both password and key authentication of SSH. For advanced usages, please refer to `machineList part of configuration <../Tutorial/ExperimentConfig.rst>`__.
*
Make sure the NNI version on each machine is consistent.
*
Make sure the command of Trial is compatible with remote OSes, if you want to use remote Linux and Windows together. For example, the default python 3.x executable called ``python3`` on Linux, and ``python`` on Windows.
Linux
^^^^^
* Follow `installation <../Tutorial/InstallationLinux.rst>`__ to install NNI on the remote machine.
Windows
^^^^^^^
*
Follow `installation <../Tutorial/InstallationWin.rst>`__ to install NNI on the remote machine.
*
Install and start ``OpenSSH Server``.
#.
Open ``Settings`` app on Windows.
#.
Click ``Apps``\ , then click ``Optional features``.
#.
Click ``Add a feature``\ , search and select ``OpenSSH Server``\ , and then click ``Install``.
#.
Once it's installed, run below command to start and set to automatic start.
.. code-block:: bat
sc config sshd start=auto
net start sshd
*
Make sure remote account is administrator, so that it can stop running trials.
*
Make sure there is no welcome message more than default, since it causes ssh2 failed in NodeJs. For example, if you're using Data Science VM on Azure, it needs to remove extra echo commands in ``C:\dsvm\tools\setup\welcome.bat``.
The output like below is ok, when opening a new command window.
.. code-block:: text
Microsoft Windows [Version 10.0.17763.1192]
(c) 2018 Microsoft Corporation. All rights reserved.
(py37_default) C:\Users\AzureUser>
Run an experiment
-----------------
e.g. there are three machines, which can be logged in with username and password.
.. list-table::
:header-rows: 1
:widths: auto
* - IP
- Username
- Password
* - 10.1.1.1
- bob
- bob123
* - 10.1.1.2
- bob
- bob123
* - 10.1.1.3
- bob
- bob123
Install and run NNI on one of those three machines or another machine, which has network access to them.
Use ``examples/trials/mnist-pytorch`` as the example. Below is content of ``examples/trials/mnist-pytorch/config_remote.yml``\ :
.. code-block:: yaml
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialCodeDirectory: . # default value, can be omitted
trialGpuNumber: 0
trialConcurrency: 4
maxTrialNumber: 20
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: remote
machineList:
- host: 192.0.2.1
user: alice
ssh_key_file: ~/.ssh/id_rsa
- host: 192.0.2.2
port: 10022
user: bob
password: bob123
pythonPath: /usr/bin
Files in ``trialCodeDirectory`` will be uploaded to remote machines automatically. You can run below command on Windows, Linux, or macOS to spawn trials on remote Linux machines:
.. code-block:: bash
nnictl create --config examples/trials/mnist-pytorch/config_remote.yml
Configure python environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default, commands and scripts will be executed in the default environment in remote machine. If there are multiple python virtual environments in your remote machine, and you want to run experiments in a specific environment, then use **pythonPath** to specify a python environment on your remote machine.
For example, with anaconda you can specify:
.. code-block:: yaml
pythonPath: /home/bob/.conda/envs/ENV-NAME/bin
CIFAR-10 examples
=================
Overview
--------
`CIFAR-10 <https://www.cs.toronto.edu/~kriz/cifar.html>`__ classification is a common benchmark problem in machine learning. The CIFAR-10 dataset is the collection of images. It is one of the most widely used datasets for machine learning research which contains 60,000 32x32 color images in 10 different classes. Thus, we use CIFAR-10 classification as an example to introduce NNI usage.
**Goals**
^^^^^^^^^^^^^
As we all know, the choice of model optimizer is directly affects the performance of the final metrics. The goal of this tutorial is to **tune a better performace optimizer** to train a relatively small convolutional neural network (CNN) for recognizing images.
In this example, we have selected the following common deep learning optimizer:
.. code-block:: bash
"SGD", "Adadelta", "Adagrad", "Adam", "Adamax"
**Experimental**
^^^^^^^^^^^^^^^^^^^^
Preparations
^^^^^^^^^^^^
This example requires PyTorch. PyTorch install package should be chosen based on python version and cuda version.
Here is an example of the environment python==3.5 and cuda == 8.0, then using the following commands to install `PyTorch <https://pytorch.org/>`__\ :
.. code-block:: bash
python3 -m pip install http://download.pytorch.org/whl/cu80/torch-0.4.1-cp35-cp35m-linux_x86_64.whl
python3 -m pip install torchvision
CIFAR-10 with NNI
^^^^^^^^^^^^^^^^^
**Search Space**
As we stated in the target, we target to find out the best ``optimizer`` for training CIFAR-10 classification. When using different optimizers, we also need to adjust ``learning rates`` and ``network structure`` accordingly. so we chose these three parameters as hyperparameters and write the following search space.
.. code-block:: json
{
"lr":{"_type":"choice", "_value":[0.1, 0.01, 0.001, 0.0001]},
"optimizer":{"_type":"choice", "_value":["SGD", "Adadelta", "Adagrad", "Adam", "Adamax"]},
"model":{"_type":"choice", "_value":["vgg", "resnet18", "googlenet", "densenet121", "mobilenet", "dpn92", "senet18"]}
}
Implemented code directory: :githublink:`search_space.json <examples/trials/cifar10_pytorch/search_space.json>`
**Trial**
The code for CNN training of each hyperparameters set, paying particular attention to the following points are specific for NNI:
* Use ``nni.get_next_parameter()`` to get next training hyperparameter set.
* Use ``nni.report_intermediate_result(acc)`` to report the intermedian result after finish each epoch.
* Use ``nni.report_final_result(acc)`` to report the final result before the trial end.
Implemented code directory: :githublink:`main.py <examples/trials/cifar10_pytorch/main.py>`
You can also use your previous code directly, refer to `How to define a trial <Trials.rst>`__ for modify.
**Config**
Here is the example of running this experiment on local(with multiple GPUs):
code directory: :githublink:`examples/trials/cifar10_pytorch/config.yml <examples/trials/cifar10_pytorch/config.yml>`
Here is the example of running this experiment on OpenPAI:
code directory: :githublink:`examples/trials/cifar10_pytorch/config_pai.yml <examples/trials/cifar10_pytorch/config_pai.yml>`
The complete examples we have implemented: :githublink:`examples/trials/cifar10_pytorch/ <examples/trials/cifar10_pytorch>`
Launch the experiment
^^^^^^^^^^^^^^^^^^^^^
We are ready for the experiment, let's now **run the config.yml file from your command line to start the experiment**.
.. code-block:: bash
nnictl create --config nni/examples/trials/cifar10_pytorch/config.yml
GBDT in nni
===========
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree has many popular implementations, such as `lightgbm <https://github.com/Microsoft/LightGBM>`__\ , `xgboost <https://github.com/dmlc/xgboost>`__\ , and `catboost <https://github.com/catboost/catboost>`__\ , etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
1. Search Space in GBDT
-----------------------
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):
..
* For better accuracy
* ``learning_rate``. The range of ``learning rate`` could be [0.001, 0.9].
*
``num_leaves``. ``num_leaves`` is related to ``max_depth``\ , you don't have to tune both of them.
*
``bagging_freq``. ``bagging_freq`` could be [1, 2, 4, 8, 10]
*
``num_iterations``. May larger if underfitting.
..
* For speed up
* ``bagging_fraction``. The range of ``bagging_fraction`` could be [0.7, 1.0].
*
``feature_fraction``. The range of ``feature_fraction`` could be [0.6, 1.0].
*
``max_bin``.
..
* To avoid overfitting
* ``min_data_in_leaf``. This depends on your dataset.
*
``min_sum_hessian_in_leaf``. This depend on your dataset.
*
``lambda_l1`` and ``lambda_l2``.
*
``min_gain_to_split``.
*
``num_leaves``.
Reference link:
`lightgbm <https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html>`__ and `autoxgoboost <https://github.com/ja-thomas/autoxgboost/blob/master/poster_2018.pdf>`__
2. Task description
-------------------
Now we come back to our example "auto-gbdt" which run in lightgbm and nni. The data including :githublink:`train data <examples/trials/auto-gbdt/data/regression.train>` and :githublink:`test data <examples/trials/auto-gbdt/data/regression.train>`.
Given the features and label in train data, we train a GBDT regression model and use it to predict.
3. How to run in nni
--------------------
3.1 Install all the requirments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: bash
pip install lightgbm
pip install pandas
3.2 Prepare your trial code
^^^^^^^^^^^^^^^^^^^^^^^^^^^
You need to prepare a basic code as following:
.. code-block:: python
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
PARAMS = get_default_parameters()
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
3.3 Prepare your search space.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you like to tune ``num_leaves``\ , ``learning_rate``\ , ``bagging_fraction`` and ``bagging_freq``\ , you could write a :githublink:`search_space.json <examples/trials/auto-gbdt/search_space.json>` as follow:
.. code-block:: json
{
"num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]},
"learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
"bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
"bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
More support variable type you could reference `here <../Tutorial/SearchSpaceSpec.rst>`__.
3.4 Add SDK of nni into your code.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: diff
+import nni
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
+ nni.report_final_result(rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
+ RECEIVED_PARAMS = nni.get_next_parameter()
PARAMS = get_default_parameters()
+ PARAMS.update(RECEIVED_PARAMS)
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
3.5 Write a config file and run it.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the config file, you could set some settings including:
* Experiment setting: ``trialConcurrency``\ , ``trialGpuNumber``\ , etc.
* Platform setting: ``trainingService``\ , etc.
* Path setting: ``searchSpaceFile``\ , ``trialCodeDirectory``\ , etc.
* Algorithm setting: select ``tuner`` algorithm, ``tuner optimize_mode``\ , etc.
An config.yml as follow:
.. code-block:: yaml
experimentName: auto-gbdt example
searchSpaceFile: search_space.json
trialCommand: python3 main.py
trialGpuNumber: 0
trialConcurrency: 1
maxTrialNumber: 10
trainingService:
platform: local
tuner:
name: TPE #choice: TPE, Random, Anneal, Evolution, BatchTuner, etc
classArgs:
optimize_mode: minimize
Run this experiment with command as follow:
.. code-block:: bash
nnictl create --config ./config.yml
.. role:: raw-html(raw)
:format: html
MNIST examples
==============
CNN MNIST classifier for deep learning is similar to ``hello world`` for programming languages. Thus, we use MNIST as example to introduce different features of NNI. The examples are listed below:
* `MNIST with NNI API (PyTorch) <#mnist-pytorch>`__
* `MNIST with NNI API (TensorFlow v2.x) <#mnist-tfv2>`__
* `MNIST with NNI API (TensorFlow v1.x) <#mnist-tfv1>`__
* `MNIST with NNI annotation <#mnist-annotation>`__
* `MNIST in keras <#mnist-keras>`__
* `MNIST -- tuning with batch tuner <#mnist-batch>`__
* `MNIST -- tuning with hyperband <#mnist-hyperband>`__
* `MNIST -- tuning within a nested search space <#mnist-nested>`__
* `distributed MNIST (tensorflow) using kubeflow <#mnist-kubeflow-tf>`__
* `distributed MNIST (pytorch) using kubeflow <#mnist-kubeflow-pytorch>`__
:raw-html:`<a name="mnist-pytorch"></a>`
**MNIST with NNI API (PyTorch)**
This is a simple network which has two convolutional layers, two pooling layers and a fully connected layer.
We tune hyperparameters, such as dropout rate, convolution size, hidden size, etc.
It can be tuned with most NNI built-in tuners, such as TPE, SMAC, Random.
We also provide an exmaple YAML file which enables assessor.
code directory: :githublink:`mnist-pytorch/ <examples/trials/mnist-pytorch/>`
:raw-html:`<a name="mnist-tfv2"></a>`
**MNIST with NNI API (TensorFlow v2.x)**
Same network to the example above, but written in TensorFlow.
code directory: :githublink:`mnist-tfv2/ <examples/trials/mnist-tfv2/>`
:raw-html:`<a name="mnist-tfv1"></a>`
**MNIST with NNI API (TensorFlow v1.x)**
Same network to the example above, but written in TensorFlow v1.x API.
code directory: :githublink:`mnist-tfv1/ <examples/trials/mnist-tfv1/>`
:raw-html:`<a name="mnist-annotation"></a>`
**MNIST with NNI annotation**
This example is similar to the example above, the only difference is that this example uses NNI annotation to specify search space and report results, while the example above uses NNI apis to receive configuration and report results.
code directory: :githublink:`mnist-annotation/ <examples/trials/mnist-annotation/>`
:raw-html:`<a name="mnist-batch"></a>`
**MNIST -- tuning with batch tuner**
This example is to show how to use batch tuner. Users simply list all the configurations they want to try in the search space file. NNI will try all of them.
code directory: :githublink:`mnist-batch-tune-keras/ <examples/trials/mnist-batch-tune-keras/>`
:raw-html:`<a name="mnist-hyperband"></a>`
**MNIST -- tuning with hyperband**
This example is to show how to use hyperband to tune the model. There is one more key ``STEPS`` in the received configuration for trials to control how long it can run (e.g., number of iterations).
.. cannot find :githublink:`mnist-hyperband/ <examples/trials/mnist-hyperband/>`
code directory: :githublink:`mnist-hyperband/ <examples/trials/mnist-hyperband/>`
:raw-html:`<a name="mnist-nested"></a>`
**MNIST -- tuning within a nested search space**
This example is to show that NNI also support nested search space. The search space file is an example of how to define nested search space.
code directory: :githublink:`mnist-nested-search-space/ <examples/trials/mnist-nested-search-space/>`
:raw-html:`<a name="mnist-kubeflow-tf"></a>`
**distributed MNIST (tensorflow) using kubeflow**
This example is to show how to run distributed training on kubeflow through NNI. Users can simply provide distributed training code and a configure file which specifies the kubeflow mode. For example, what is the command to run ps and what is the command to run worker, and how many resources they consume. This example is implemented in tensorflow, thus, uses kubeflow tensorflow operator.
code directory: :githublink:`mnist-distributed/ <examples/trials/mnist-distributed/>`
:raw-html:`<a name="mnist-kubeflow-pytorch"></a>`
**distributed MNIST (pytorch) using kubeflow**
Similar to the previous example, the difference is that this example is implemented in pytorch, thus, it uses kubeflow pytorch operator.
code directory: :githublink:`mnist-distributed-pytorch/ <examples/trials/mnist-distributed-pytorch/>`
Pix2pix example
=================
Overview
--------
`Pix2pix <https://arxiv.org/abs/1611.07004>`__ is a conditional generative adversial network (conditional GAN) framework proposed by Isola et. al. in 2016 targeting at solving image-to-image translation problems. This framework performs well in a wide range of image generation problems. In the original paper, the authors demonstrate how to use pix2pix to solve the following image translation problems: 1) labels to street scene; 2) labels to facade; 3) BW to Color; 4) Aerial to Map; 5) Day to Night and 6) Edges to Photo. If you are interested, please read more in the `official project page <https://phillipi.github.io/pix2pix/>`__ . In this example, we use pix2pix to introduce how to use NNI for tuning conditional GANs.
**Goals**
^^^^^^^^^^^^^
Although GANs are known to be able to generate high-resolution realistic images, they are generally fragile and difficult to optimize, and mode collapse can happen during training due to improper optimization setting, loss formulation, model architecture, weight initialization, or even data augmentation patterns. The goal of this tutorial is to leverage NNI hyperparameter tuning tools to automatically find a good setting for these important factors.
In this example, we aim at selecting the following hyperparameters automatically:
* ``ngf``: number of generator filters in the last conv layer
* ``ndf``: number of discriminator filters in the first conv layer
* ``netG``: generator architecture
* ``netD``: discriminator architecture
* ``norm``: normalization type
* ``init_type``: weight initialization method
* ``lr``: initial learning rate for adam
* ``beta1``: momentum term of adam
* ``lr_policy``: learning rate policy
* ``gan_mode``: type of GAN objective
* ``lambda_L1``: weight of L1 loss in the generator objective
**Experiments**
^^^^^^^^^^^^^^^^^^^^
Preparations
^^^^^^^^^^^^
This example requires the GPU version of PyTorch. PyTorch installation should be chosen based on system, python version, and cuda version.
Please refer to the detailed instruction of installing `PyTorch <https://pytorch.org/get-started/locally/>`__
Next, run the following shell script to clone the repository maintained by the original authors of pix2pix. This example relies on the implementations in this repository.
.. code-block:: bash
./setup.sh
Pix2pix with NNI
^^^^^^^^^^^^^^^^^
**Search Space**
We summarize the range of values for each hyperparameter mentioned above into a single search space json object.
.. code-block:: json
{
"ngf": {"_type":"choice","_value":[16, 32, 64, 128, 256]},
"ndf": {"_type":"choice","_value":[16, 32, 64, 128, 256]},
"netG": {"_type":"choice","_value":["resnet_9blocks", "unet_256"]},
"netD": {"_type":"choice","_value":["basic", "pixel", "n_layers"]},
"norm": {"_type":"choice","_value":["batch", "instance", "none"]},
"init_type": {"_type":"choice","_value":["xavier", "normal", "kaiming", "orthogonal"]},
"lr":{"_type":"choice","_value":[0.0001, 0.0002, 0.0005, 0.001, 0.005, 0.01, 0.1]},
"beta1":{"_type":"uniform","_value":[0, 1]},
"lr_policy": {"_type":"choice","_value":["linear", "step", "plateau", "cosine"]},
"gan_mode": {"_type":"choice","_value":["vanilla", "lsgan", "wgangp"]} ,
"lambda_L1": {"_type":"choice","_value":[1, 5, 10, 100, 250, 500]}
}
Starting from v2.0, the search space is directly included in the config. Please find the example here: :githublink:`config.yml <examples/trials/pix2pix-pytorch/config.yml>`
**Trial**
To experiment on this set of hyperparameters using NNI, we have to write a trial code, which receives a set of parameter settings from NNI, trains a generator and discriminator using these parameters, and then reports the final scores back to NNI. In the experiment, NNI repeatedly calls this trial code, passing in different set of hyperparameter settings. It is important that the following three lines are incorporated in the trial code:
* Use ``nni.get_next_parameter()`` to get next hyperparameter set.
* (Optional) Use ``nni.report_intermediate_result(score)`` to report the intermediate result after finishing each epoch.
* Use ``nni.report_final_result(score)`` to report the final result before the trial ends.
Implemented code directory: :githublink:`pix2pix.py <examples/trials/pix2pix-pytorch/pix2pix.py>`
Some notes on the implementation:
* The trial code for this example is adapted from the `repository maintained by the authors of Pix2pix and CycleGAN <https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix>`__ . You can also use your previous code directly. Please refer to `How to define a trial <Trials.rst>`__ for modifying the code.
* By default, the code uses the dataset "facades". It also supports the datasets "night2day", "edges2handbags", "edges2shoes", and "maps".
* For "facades", 200 epochs are enough for the model to converge to a point where the difference between models trained with different hyperparameters are salient enough for evaluation. If you are using other datasets, please consider increasing the ``n_epochs`` and ``n_epochs_decay`` parameters by either passing them as arguments when calling ``pix2pix.py`` in the config file (discussed below) or changing the ``pix2pix.py`` directly. Also, for "facades", 200 epochs are enought for the final training, while the number may vary for other datasets.
* In this example, we use L1 loss on the test set as the score to report to NNI. Although L1 is by no means a comprehensive measure of image generation performance, at most times it makes sense for evaluating pix2pix models with similar architectural setup. In this example, for the hyperparameters we experiment on, a higher L1 score generally indicates a higher generation performance.
**Config**
Here is the example config of running this experiment on local (with a single GPU):
code directory: :githublink:`examples/trials/pix2pix-pytorch/config.yml <examples/trials/pix2pix-pytorch/config.yml>`
To have a full glance on our implementation, check: :githublink:`examples/trials/pix2pix-pytorch/ <examples/trials/pix2pix-pytorch>`
Launch the experiment
^^^^^^^^^^^^^^^^^^^^^
We are ready for the experiment, let's now **run the config.yml file from your command line to start the experiment**.
.. code-block:: bash
nnictl create --config nni/examples/trials/pix2pix-pytorch/config.yml
Collecting the Results
^^^^^^^^^^^^^^^^^^^^^^
By default, our trial code saves the final trained model for each trial in the ``checkpoints/`` directory in the trial directory of the NNI experiment. The ``latest_net_G.pth`` and ``latest_net_D.pth`` correspond to the save checkpoints for the generator and the discriminator.
To make it easier to run inference and see the generated images, we also incorporate a simple inference code here: :githublink:`test.py <examples/trials/pix2pix-pytorch/test.py>`
To use the code, run the following command:
.. code-block:: bash
python3 test.py -c CHECKPOINT -p PARAMETER_CFG -d DATASET_NAME -o OUTPUT_DIR
``CHECKPOINT`` is the directory saving the checkpoints (e.g., the ``checkpoints/`` directory in the trial directory). ``PARAMETER_CFG`` is the ``parameter.cfg`` file generated by NNI recording the hyperparameter settings. This file can be found in the trial directory created by NNI.
Results and Discussions
^^^^^^^^^^^^^^^^^^^^^^^
Following the previous steps, we ran the example for 40 trials using the TPE tuner. We found that the best-performing parameters on the 'facades' dataset to be the following set.
.. code-block:: json
{
"ngf": 16,
"ndf": 128,
"netG": "unet_256",
"netD": "pixel",
"norm": "none",
"init_type": "normal",
"lr": 0.0002,
"beta1": 0.6954,
"lr_policy": "step",
"gan_mode": "lsgan",
"lambda_L1": 500
}
Meanwhile, we compare the results with the model training using the following default empirical hyperparameter settings:
.. code-block:: json
{
"ngf": 128,
"ndf": 128,
"netG": "unet_256",
"netD": "basic",
"norm": "batch",
"init_type": "xavier",
"lr": 0.0002,
"beta1": 0.5,
"lr_policy": "linear",
"gan_mode": "lsgan",
"lambda_L1": 100
}
We can observe that for learning rate (0.0002), the generator architecture (U-Net), and gan objective (LSGAN), the two results agree with each other. This is also consistent with the widely accepted practice on this dataset. Meanwhile, the hyperparameters "beta1", "lambda_L1", "ngf", and "ndf" are slightly changed in the NNI's found solution to fit the target dataset. We found that the parameters searched by NNI outperforms the empirical parameters on the facades dataset both in terms of L1 loss and the visual qualities of the images. While the search hyperparameter has a L1 loss of 0.3317 on the test set of facades, the empirical hyperparameters can only achieve a L1 loss of 0.4148. The following image shows some sample results of facades test set input-output pairs produced by the model with hyperparameters tuned with NNI.
.. image:: ../../img/pix2pix_pytorch_facades.png
:target: ../../img/pix2pix_pytorch_facades.png
:alt:
Scikit-learn in NNI
===================
`Scikit-learn <https://github.com/scikit-learn/scikit-learn>`__ is a popular machine learning tool for data mining and data analysis. It supports many kinds of machine learning models like LinearRegression, LogisticRegression, DecisionTree, SVM etc. How to make the use of scikit-learn more efficiency is a valuable topic.
NNI supports many kinds of tuning algorithms to search the best models and/or hyper-parameters for scikit-learn, and support many kinds of environments like local machine, remote servers and cloud.
1. How to run the example
-------------------------
To start using NNI, you should install the NNI package, and use the command line tool ``nnictl`` to start an experiment. For more information about installation and preparing for the environment, please refer `here <../Tutorial/QuickStart.rst>`__.
After you installed NNI, you could enter the corresponding folder and start the experiment using following commands:
.. code-block:: bash
nnictl create --config ./config.yml
2. Description of the example
-----------------------------
2.1 classification
^^^^^^^^^^^^^^^^^^
This example uses the dataset of digits, which is made up of 1797 8x8 images, and each image is a hand-written digit, the goal is to classify these images into 10 classes.
In this example, we use SVC as the model, and choose some parameters of this model, including ``"C", "kernel", "degree", "gamma" and "coef0"``. For more information of these parameters, please `refer <https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html>`__.
2.2 regression
^^^^^^^^^^^^^^
This example uses the Boston Housing Dataset, this dataset consists of price of houses in various places in Boston and the information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE) etc., to predict the house price of Boston.
In this example, we tune different kinds of regression models including ``"LinearRegression", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor"`` and some parameters like ``"svr_kernel", "knr_weights"``. You could get more details about these models from `here <https://scikit-learn.org/stable/supervised_learning.html#supervised-learning>`__.
3. How to write scikit-learn code using NNI
-------------------------------------------
It is easy to use NNI in your scikit-learn code, there are only a few steps.
*
**step 1**
Prepare a search_space.json to storage your choose spaces.
For example, if you want to choose different models, you may try:
.. code-block:: json
{
"model_name":{"_type":"choice","_value":["LinearRegression", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor"]}
}
If you want to choose different models and parameters, you could put them together in a search_space.json file.
.. code-block:: json
{
"model_name":{"_type":"choice","_value":["LinearRegression", "SVR", "KNeighborsRegressor", "DecisionTreeRegressor"]},
"svr_kernel": {"_type":"choice","_value":["linear", "poly", "rbf"]},
"knr_weights": {"_type":"choice","_value":["uniform", "distance"]}
}
Then you could read these values as a dict from your python code, please get into the step 2.
*
**step 2**
At the beginning of your python code, you should ``import nni`` to insure the packages works normally.
First, you should use ``nni.get_next_parameter()`` function to get your parameters given by NNI. Then you could use these parameters to update your code.
For example, if you define your search_space.json like following format:
.. code-block:: json
{
"C": {"_type":"uniform","_value":[0.1, 1]},
"kernel": {"_type":"choice","_value":["linear", "rbf", "poly", "sigmoid"]},
"degree": {"_type":"choice","_value":[1, 2, 3, 4]},
"gamma": {"_type":"uniform","_value":[0.01, 0.1]},
"coef0": {"_type":"uniform","_value":[0.01, 0.1]}
}
You may get a parameter dict like this:
.. code-block:: python
params = {
'C': 1.0,
'kernel': 'linear',
'degree': 3,
'gamma': 0.01,
'coef0': 0.01
}
Then you could use these variables to write your scikit-learn code.
*
**step 3**
After you finished your training, you could get your own score of the model, like your precision, recall or MSE etc. NNI needs your score to tuner algorithms and generate next group of parameters, please report the score back to NNI and start next trial job.
You just need to use ``nni.report_final_result(score)`` to communicate with NNI after you process your scikit-learn code. Or if you have multiple scores in the steps of training, you could also report them back to NNI using ``nni.report_intemediate_result(score)``. Note, you may not report intermediate result of your job, but you must report back your final result.
.. role:: raw-html(raw)
:format: html
Write a Trial Run on NNI
========================
A **Trial** in NNI is an individual attempt at applying a configuration (e.g., a set of hyper-parameters) to a model.
To define an NNI trial, you need to first define the set of parameters (i.e., search space) and then update the model. NNI provides two approaches for you to define a trial: `NNI API <#nni-api>`__ and `NNI Python annotation <#nni-annotation>`__. You could also refer to `here <#more-examples>`__ for more trial examples.
:raw-html:`<a name="nni-api"></a>`
NNI API
-------
Step 1 - Prepare a SearchSpace parameters file.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example is shown below:
.. code-block:: json
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
Refer to `SearchSpaceSpec <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search spaces. Tuner will generate configurations from this search space, that is, choosing a value for each hyperparameter from the range.
Step 2 - Update model code
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Import NNI
Include ``import nni`` in your trial code to use NNI APIs.
* Get configuration from Tuner
.. code-block:: python
RECEIVED_PARAMS = nni.get_next_parameter()
``RECEIVED_PARAMS`` is an object, for example:
``{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}``.
* Report metric data periodically (optional)
.. code-block:: python
nni.report_intermediate_result(metrics)
``metrics`` can be any python object. If users use the NNI built-in tuner/assessor, ``metrics`` can only have two formats: 1) a number e.g., float, int, or 2) a dict object that has a key named ``default`` whose value is a number. These ``metrics`` are reported to `assessor <../Assessor/BuiltinAssessor.rst>`__. Often, ``metrics`` includes the periodically evaluated loss or accuracy.
* Report performance of the configuration
.. code-block:: python
nni.report_final_result(metrics)
``metrics`` can also be any python object. If users use the NNI built-in tuner/assessor, ``metrics`` follows the same format rule as that in ``report_intermediate_result``\ , the number indicates the model's performance, for example, the model's accuracy, loss etc. These ``metrics`` are reported to `tuner <../Tuner/BuiltinTuner.rst>`__.
Step 3 - Enable NNI API
^^^^^^^^^^^^^^^^^^^^^^^
To enable NNI API mode, you need to set useAnnotation to *false* and provide the path of the SearchSpace file was defined in step 1:
.. code-block:: yaml
searchSpacePath: /path/to/your/search_space.json
You can refer to `here <../Tutorial/ExperimentConfig.rst>`__ for more information about how to set up experiment configurations.
Please refer to `here <../sdk_reference.rst>`__ for more APIs (e.g., ``nni.get_sequence_id()``\ ) provided by NNI.
:raw-html:`<a name="nni-annotation"></a>`
NNI Python Annotation
---------------------
An alternative to writing a trial is to use NNI's syntax for python. NNI annotations are simple, similar to comments. You don't have to make structural changes to your existing code. With a few lines of NNI annotation, you will be able to:
* annotate the variables you want to tune
* specify the range in which you want to tune the variables
* annotate which variable you want to report as an intermediate result to ``assessor``
* annotate which variable you want to report as the final result (e.g. model accuracy) to ``tuner``.
Again, take MNIST as an example, it only requires 2 steps to write a trial with NNI Annotation.
Step 1 - Update codes with annotations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following is a TensorFlow code snippet for NNI Annotation where the highlighted four lines are annotations that:
#. tune batch_size and dropout_rate
#. report test_acc every 100 steps
#. lastly report test_acc as the final result.
It's worth noting that, as these newly added codes are merely annotations, you can still run your code as usual in environments without NNI installed.
.. code-block:: diff
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
+ """@nni.variable(nni.choice(50, 250, 500), name=batch_size)"""
batch_size = 128
for i in range(10000):
batch = mnist.train.next_batch(batch_size)
+ """@nni.variable(nni.choice(0.1, 0.5), name=dropout_rate)"""
dropout_rate = 0.5
mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
mnist_network.labels: batch[1],
mnist_network.keep_prob: dropout_rate})
if i % 100 == 0:
test_acc = mnist_network.accuracy.eval(
feed_dict={mnist_network.images: mnist.test.images,
mnist_network.labels: mnist.test.labels,
mnist_network.keep_prob: 1.0})
+ """@nni.report_intermediate_result(test_acc)"""
test_acc = mnist_network.accuracy.eval(
feed_dict={mnist_network.images: mnist.test.images,
mnist_network.labels: mnist.test.labels,
mnist_network.keep_prob: 1.0})
+ """@nni.report_final_result(test_acc)"""
**NOTE**\ :
* ``@nni.variable`` will affect its following line which should be an assignment statement whose left-hand side must be the same as the keyword ``name`` in the ``@nni.variable`` statement.
* ``@nni.report_intermediate_result``\ /\ ``@nni.report_final_result`` will send the data to assessor/tuner at that line.
For more information about annotation syntax and its usage, please refer to `Annotation <../Tutorial/AnnotationSpec.rst>`__.
Step 2 - Enable NNI Annotation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the YAML configure file, you need to set *useAnnotation* to true to enable NNI annotation:
.. code-block:: bash
useAnnotation: true
Standalone mode for debugging
-----------------------------
NNI supports a standalone mode for trial code to run without starting an NNI experiment. This is for finding out bugs in trial code more conveniently. NNI annotation natively supports standalone mode, as the added NNI related lines are comments. For NNI trial APIs, the APIs have changed behaviors in standalone mode, some APIs return dummy values, and some APIs do not really report values. Please refer to the following table for the full list of these APIs.
.. code-block:: python
# NOTE: please assign default values to the hyperparameters in your trial code
nni.get_next_parameter # return {}
nni.report_final_result # have log printed on stdout, but does not report
nni.report_intermediate_result # have log printed on stdout, but does not report
nni.get_experiment_id # return "STANDALONE"
nni.get_trial_id # return "STANDALONE"
nni.get_sequence_id # return 0
You can try standalone mode with the :githublink:`mnist example <examples/trials/mnist-pytorch>`. Simply run ``python3 mnist.py`` under the code directory. The trial code should successfully run with the default hyperparameter values.
For more information on debugging, please refer to `How to Debug <../Tutorial/HowToDebug.rst>`__
Where are my trials?
--------------------
Local Mode
^^^^^^^^^^
In NNI, every trial has a dedicated directory for them to output their own data. In each trial, an environment variable called ``NNI_OUTPUT_DIR`` is exported. Under this directory, you can find each trial's code, data, and other logs. In addition, each trial's log (including stdout) will be re-directed to a file named ``trial.log`` under that directory.
If NNI Annotation is used, the trial's converted code is in another temporary directory. You can check that in a file named ``run.sh`` under the directory indicated by ``NNI_OUTPUT_DIR``. The second line (i.e., the ``cd`` command) of this file will change directory to the actual directory where code is located. Below is an example of ``run.sh``\ :
.. code-block:: bash
#!/bin/bash
cd /tmp/user_name/nni/annotation/tmpzj0h72x6 #This is the actual directory
export NNI_PLATFORM=local
export NNI_SYS_DIR=/home/user_name/nni-experiments/$experiment_id$/trials/$trial_id$
export NNI_TRIAL_JOB_ID=nrbb2
export NNI_OUTPUT_DIR=/home/user_name/nni-experiments/$eperiment_id$/trials/$trial_id$
export NNI_TRIAL_SEQ_ID=1
export MULTI_PHASE=false
export CUDA_VISIBLE_DEVICES=
eval python3 mnist.py 2>/home/user_name/nni-experiments/$experiment_id$/trials/$trial_id$/stderr
echo $? `date +%s%3N` >/home/user_name/nni-experiments/$experiment_id$/trials/$trial_id$/.nni/state
Other Modes
^^^^^^^^^^^
When running trials on other platforms like remote machine or PAI, the environment variable ``NNI_OUTPUT_DIR`` only refers to the output directory of the trial, while the trial code and ``run.sh`` might not be there. However, the ``trial.log`` will be transmitted back to the local machine in the trial's directory, which defaults to ``~/nni-experiments/$experiment_id$/trials/$trial_id$/``
For more information, please refer to `HowToDebug <../Tutorial/HowToDebug.rst>`__.
:raw-html:`<a name="more-examples"></a>`
More Trial Examples
-------------------
* `Write logs to trial output directory for tensorboard <../Tutorial/Tensorboard.rst>`__
* `MNIST examples <MnistExamples.rst>`__
* `Finding out best optimizer for Cifar10 classification <Cifar10Examples.rst>`__
* `How to tune Scikit-learn on NNI <SklearnExamples.rst>`__
* `Automatic Model Architecture Search for Reading Comprehension. <SquadEvolutionExamples.rst>`__
* `Tuning GBDT on NNI <GbdtExample.rst>`__
* `Tuning RocksDB on NNI <RocksdbExamples.rst>`__
BOHB Advisor
============
BOHB is a robust and efficient hyperparameter tuning algorithm mentioned in `this reference paper <https://arxiv.org/abs/1807.01774>`__. BO is an abbreviation for "Bayesian Optimization" and HB is an abbreviation for "Hyperband".
BOHB relies on HB (Hyperband) to determine how many configurations to evaluate with which budget, but it **replaces the random selection of configurations at the beginning of each HB iteration by a model-based search (Bayesian Optimization)**. Once the desired number of configurations for the iteration is reached, the standard successive halving procedure is carried out using these configurations. We keep track of the performance of all function evaluations g(x, b) of configurations x on all budgets b to use as a basis for our models in later iterations.
Below we divide the introduction of the BOHB process into two parts:
HB (Hyperband)
^^^^^^^^^^^^^^
We follow Hyperband’s way of choosing the budgets and continue to use SuccessiveHalving. For more details, you can refer to the `Hyperband in NNI <HyperbandAdvisor.rst>`__ and the `reference paper for Hyperband <https://arxiv.org/abs/1603.06560>`__. This procedure is summarized by the pseudocode below.
.. image:: ../../img/bohb_1.png
:target: ../../img/bohb_1.png
:alt:
BO (Bayesian Optimization)
^^^^^^^^^^^^^^^^^^^^^^^^^^
The BO part of BOHB closely resembles TPE with one major difference: we opted for a single multidimensional KDE compared to the hierarchy of one-dimensional KDEs used in TPE in order to better handle interaction effects in the input space.
Tree Parzen Estimator(TPE): uses a KDE (kernel density estimator) to model the densities.
.. image:: ../../img/bohb_2.png
:target: ../../img/bohb_2.png
:alt:
To fit useful KDEs, we require a minimum number of data points Nmin; this is set to d + 1 for our experiments, where d is the number of hyperparameters. To build a model as early as possible, we do not wait until Nb = \|Db\|, where the number of observations for budget b is large enough to satisfy q · Nb ≥ Nmin. Instead, after initializing with Nmin + 2 random configurations, we choose the
.. image:: ../../img/bohb_3.png
:target: ../../img/bohb_3.png
:alt:
best and worst configurations, respectively, to model the two densities.
Note that we also sample a constant fraction named **random fraction** of the configurations uniformly at random.
Workflow
--------
.. image:: ../../img/bohb_6.jpg
:target: ../../img/bohb_6.jpg
:alt:
This image shows the workflow of BOHB. Here we set max_budget = 9, min_budget = 1, eta = 3, others as default. In this case, s_max = 2, so we will continuously run the {s=2, s=1, s=0, s=2, s=1, s=0, ...} cycle. In each stage of SuccessiveHalving (the orange box), we will pick the top 1/eta configurations and run them again with more budget, repeating the SuccessiveHalving stage until the end of this iteration. At the same time, we collect the configurations, budgets and final metrics of each trial and use these to build a multidimensional KDEmodel with the key "budget".
Multidimensional KDE is used to guide the selection of configurations for the next iteration.
The sampling procedure (using Multidimensional KDE to guide selection) is summarized by the pseudocode below.
.. image:: ../../img/bohb_4.png
:target: ../../img/bohb_4.png
:alt:
Usage
-----
Installation
^^^^^^^^^^^^
BOHB advisor requires the `ConfigSpace <https://github.com/automl/ConfigSpace>`__ package. ConfigSpace can be installed using the following command.
.. code-block:: bash
pip install nni[BOHB]
classArgs Requirements
^^^^^^^^^^^^^^^^^^^^^^
* **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', tuners will try to maximize metrics. If 'minimize', tuner will try to minimize metrics.
* **min_budget** (*int, optional, default = 1*) - The smallest budget to assign to a trial job, (budget can be the number of mini-batches or epochs). Needs to be positive.
* **max_budget** (*int, optional, default = 3*) - The largest budget to assign to a trial job, (budget can be the number of mini-batches or epochs). Needs to be larger than min_budget.
* **eta** (*int, optional, default = 3*) - In each iteration, a complete run of sequential halving is executed. In it, after evaluating each configuration on the same subset size, only a fraction of 1/eta of them 'advances' to the next round. Must be greater or equal to 2.
* **min_points_in_model** (*int, optional, default = None*): number of observations to start building a KDE. Default 'None' means dim+1; when the number of completed trials in this budget is equal to or larger than ``max{dim+1, min_points_in_model}``, BOHB will start to build a KDE model of this budget then use said KDE model to guide configuration selection. Needs to be positive. (dim means the number of hyperparameters in search space)
* **top_n_percent** (*int, optional, default = 15*): percentage (between 1 and 99) of the observations which are considered good. Good points and bad points are used for building KDE models. For example, if you have 100 observed trials and top_n_percent is 15, then the top 15% of points will be used for building the good points models "l(x)". The remaining 85% of points will be used for building the bad point models "g(x)".
* **num_samples** (*int, optional, default = 64*): number of samples to optimize EI (default 64). In this case, we will sample "num_samples" points and compare the result of l(x)/g(x). Then we will return the one with the maximum l(x)/g(x) value as the next configuration if the optimize_mode is ``maximize``. Otherwise, we return the smallest one.
* **random_fraction** (*float, optional, default = 0.33*): fraction of purely random configurations that are sampled from the prior without the model.
* **bandwidth_factor** (*float, optional, default = 3.0*): to encourage diversity, the points proposed to optimize EI are sampled from a 'widened' KDE where the bandwidth is multiplied by this factor. We suggest using the default value if you are not familiar with KDE.
* **min_bandwidth** (*float, optional, default = 0.001*): to keep diversity, even when all (good) samples have the same value for one of the parameters, a minimum bandwidth (default: 1e-3) is used instead of zero. We suggest using the default value if you are not familiar with KDE.
*Please note that the float type currently only supports decimal representations. You have to use 0.333 instead of 1/3 and 0.001 instead of 1e-3.*
Config File
^^^^^^^^^^^
To use BOHB, you should add the following spec in your experiment's YAML config file:
.. code-block:: yaml
advisor:
builtinAdvisorName: BOHB
classArgs:
optimize_mode: maximize
min_budget: 1
max_budget: 27
eta: 3
min_points_in_model: 7
top_n_percent: 15
num_samples: 64
random_fraction: 0.33
bandwidth_factor: 3.0
min_bandwidth: 0.001
**classArgs Requirements:**
* **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', tuners will try to maximize metrics. If 'minimize', tuner will try to minimize metrics.
* **min_budget** (*int, optional, default = 1*) - The smallest budget to assign to a trial job, (budget can be the number of mini-batches or epochs). Needs to be positive.
* **max_budget** (*int, optional, default = 3*) - The largest budget to assign to a trial job, (budget can be the number of mini-batches or epochs). Needs to be larger than min_budget.
* **eta** (*int, optional, default = 3*) - In each iteration, a complete run of sequential halving is executed. In it, after evaluating each configuration on the same subset size, only a fraction of 1/eta of them 'advances' to the next round. Must be greater or equal to 2.
* **min_points_in_model** (*int, optional, default = None*): number of observations to start building a KDE. Default 'None' means dim+1; when the number of completed trials in this budget is equal to or larger than ``max{dim+1, min_points_in_model}``, BOHB will start to build a KDE model of this budget then use said KDE model to guide configuration selection. Needs to be positive. (dim means the number of hyperparameters in search space)
* **top_n_percent** (*int, optional, default = 15*): percentage (between 1 and 99) of the observations which are considered good. Good points and bad points are used for building KDE models. For example, if you have 100 observed trials and top_n_percent is 15, then the top 15% of points will be used for building the good points models "l(x)". The remaining 85% of points will be used for building the bad point models "g(x)".
* **num_samples** (*int, optional, default = 64*): number of samples to optimize EI (default 64). In this case, we will sample "num_samples" points and compare the result of l(x)/g(x). Then we will return the one with the maximum l(x)/g(x) value as the next configuration if the optimize_mode is ``maximize``. Otherwise, we return the smallest one.
* **random_fraction** (*float, optional, default = 0.33*): fraction of purely random configurations that are sampled from the prior without the model.
* **bandwidth_factor** (*float, optional, default = 3.0*): to encourage diversity, the points proposed to optimize EI are sampled from a 'widened' KDE where the bandwidth is multiplied by this factor. We suggest using the default value if you are not familiar with KDE.
* **min_bandwidth** (*float, optional, default = 0.001*): to keep diversity, even when all (good) samples have the same value for one of the parameters, a minimum bandwidth (default: 1e-3) is used instead of zero. We suggest using the default value if you are not familiar with KDE.
* **config_space** (*str, optional*): directly use a .pcs file serialized by `ConfigSpace <https://automl.github.io/ConfigSpace/>` in "pcs new" format. In this case, search space file (if provided in config) will be ignored. Note that this path needs to be an absolute path. Relative path is currently not supported.
*Please note that the float type currently only supports decimal representations. You have to use 0.333 instead of 1/3 and 0.001 instead of 1e-3.*
File Structure
--------------
The advisor has a lot of different files, functions, and classes. Here, we will only give most of those files a brief introduction:
* ``bohb_advisor.py`` Definition of BOHB, handles interaction with the dispatcher, including generating new trials and processing results. Also includes the implementation of the HB (Hyperband) part.
* ``config_generator.py`` Includes the implementation of the BO (Bayesian Optimization) part. The function *get_config* can generate new configurations based on BO; the function *new_result* will update the model with the new result.
Experiment
----------
MNIST with BOHB
^^^^^^^^^^^^^^^
code implementation: :githublink:`examples/trials/mnist-advisor <examples/trials/>`
We chose BOHB to build a CNN on the MNIST dataset. The following is our experimental final results:
.. image:: ../../img/bohb_5.png
:target: ../../img/bohb_5.png
:alt:
More experimental results can be found in the `reference paper <https://arxiv.org/abs/1807.01774>`__. We can see that BOHB makes good use of previous results and has a balanced trade-off in exploration and exploitation.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment