Unverified Commit 51d261e7 authored by J-shang's avatar J-shang Committed by GitHub
Browse files

Merge pull request #4668 from microsoft/doc-refactor

parents d63a2ea3 b469e1c1
**Run an Experiment on DLTS**
=================================
NNI supports running an experiment on `DLTS <https://github.com/microsoft/DLWorkspace.git>`__\ , called dlts mode. Before starting to use NNI dlts mode, you should have an account to access DLTS dashboard.
Setup Environment
-----------------
Step 1. Choose a cluster from DLTS dashboard, ask administrator for the cluster dashboard URL.
.. image:: ../../img/dlts-step1.png
:target: ../../img/dlts-step1.png
:alt: Choose Cluster
Step 2. Prepare a NNI config YAML like the following:
.. code-block:: yaml
# Set this field to "dlts"
trainingServicePlatform: dlts
authorName: your_name
experimentName: auto_mnist
trialConcurrency: 2
maxExecDuration: 3h
maxTrialNum: 100
searchSpacePath: search_space.json
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 1
image: msranni/nni
# Configuration to access DLTS
dltsConfig:
dashboard: # Ask administrator for the cluster dashboard URL
Remember to fill the cluster dashboard URL to the last line.
Step 3. Open your working directory of the cluster, paste the NNI config as well as related code to a directory.
.. image:: ../../img/dlts-step3.png
:target: ../../img/dlts-step3.png
:alt: Copy Config
Step 4. Submit a NNI manager job to the specified cluster.
.. image:: ../../img/dlts-step4.png
:target: ../../img/dlts-step4.png
:alt: Submit Job
Step 5. Go to Endpoints tab of the newly created job, click the Port 40000 link to check trial's information.
.. image:: ../../img/dlts-step5.png
:target: ../../img/dlts-step5.png
:alt: View NNI WebUI
**Tutorial: Create and Run an Experiment on local with NNI API**
================================================================
In this tutorial, we will use the example in [nni/examples/trials/mnist-pytorch] to explain how to create and run an experiment on local with NNI API.
..
Before starts
You have an implementation for MNIST classifer using convolutional layers, the Python code is similar to ``mnist.py``.
..
Step 1 - Update model codes
To enable NNI API, make the following changes:
1.1 Declare NNI API: include ``import nni`` in your trial code to use NNI APIs.
1.2 Get predefined parameters
Use the following code snippet:
.. code-block:: python
tuner_params = nni.get_next_parameter()
to get hyper-parameters' values assigned by tuner. ``tuner_params`` is an object, for example:
.. code-block:: json
{"batch_size": 32, "hidden_size": 128, "lr": 0.01, "momentum": 0.2029}
..
1.3 Report NNI results: Use the API: ``nni.report_intermediate_result(accuracy)`` to send ``accuracy`` to assessor. Use the API: ``nni.report_final_result(accuracy)`` to send `accuracy` to tuner.
**NOTE**\ :
.. code-block:: bash
accuracy - The `accuracy` could be any python object, but if you use NNI built-in tuner/assessor, `accuracy` should be a numerical variable (e.g. float, int).
tuner - The tuner will generate next parameters/architecture based on the explore history (final result of all trials).
assessor - The assessor will decide which trial should early stop based on the history performance of trial (intermediate result of one trial).
..
Step 2 - Define SearchSpace
The hyper-parameters used in ``Step 1.2 - Get predefined parameters`` is defined in a ``search_space.json`` file like below:
.. code-block:: bash
{
"batch_size": {"_type":"choice", "_value": [16, 32, 64, 128]},
"hidden_size":{"_type":"choice","_value":[128, 256, 512, 1024]},
"lr":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]},
"momentum":{"_type":"uniform","_value":[0, 1]}
}
Refer to `define search space <../Tutorial/SearchSpaceSpec.rst>`__ to learn more about search space.
..
Step 3 - Define Experiment
..
To run an experiment in NNI, you only needed:
* Provide a runnable trial
* Provide or choose a tuner
* Provide a YAML experiment configure file
* (optional) Provide or choose an assessor
**Prepare trial**\ :
..
You can download nni source code and a set of examples can be found in ``nni/examples``, run ``ls nni/examples/trials`` to see all the trial examples.
Let's use a simple trial example, e.g. mnist, provided by NNI. After you cloned NNI source, NNI examples have been put in ~/nni/examples, run ``ls ~/nni/examples/trials`` to see all the trial examples. You can simply execute the following command to run the NNI mnist example:
.. code-block:: bash
python ~/nni/examples/trials/mnist-pytorch/mnist.py
This command will be filled in the YAML configure file below. Please refer to `here <../TrialExample/Trials.rst>`__ for how to write your own trial.
**Prepare tuner**\ : NNI supports several popular automl algorithms, including Random Search, Tree of Parzen Estimators (TPE), Evolution algorithm etc. Users can write their own tuner (refer to `here <../Tuner/CustomizeTuner.rst>`__\ ), but for simplicity, here we choose a tuner provided by NNI as below:
.. code-block:: bash
tuner:
name: TPE
classArgs:
optimize_mode: maximize
*name* is used to specify a tuner in NNI, *classArgs* are the arguments pass to the tuner (the spec of builtin tuners can be found `here <../Tuner/BuiltinTuner.rst>`__\ ), *optimization_mode* is to indicate whether you want to maximize or minimize your trial's result.
**Prepare configure file**\ : Since you have already known which trial code you are going to run and which tuner you are going to use, it is time to prepare the YAML configure file. NNI provides a demo configure file for each trial example, ``cat ~/nni/examples/trials/mnist-pytorch/config.yml`` to see it. Its content is basically shown below:
.. code-block:: yaml
experimentName: local training service example
searchSpaceFile ~/nni/examples/trials/mnist-pytorch/search_space.json
trailCommand: python3 mnist.py
trialCodeDirectory: ~/nni/examples/trials/mnist-pytorch
trialGpuNumber: 0
trialConcurrency: 1
maxExperimentDuration: 3h
maxTrialNumber: 10
trainingService:
platform: local
tuner:
name: TPE
classArgs:
optimize_mode: maximize
With all these steps done, we can run the experiment with the following command:
.. code-block:: bash
nnictl create --config ~/nni/examples/trials/mnist-pytorch/config.yml
You can refer to `here <../Tutorial/Nnictl.rst>`__ for more usage guide of *nnictl* command line tool.
View experiment results
-----------------------
The experiment has been running now. Other than *nnictl*\ , NNI also provides WebUI for you to view experiment progress, to control your experiment, and some other appealing features.
Using multiple local GPUs to speed up search
--------------------------------------------
The following steps assume that you have 4 NVIDIA GPUs installed at local and PyTorch with CUDA support. The demo enables 4 concurrent trail jobs and each trail job uses 1 GPU.
**Prepare configure file**\ : NNI provides a demo configuration file for the setting above, ``cat ~/nni/examples/trials/mnist-pytorch/config_detailed.yml`` to see it. The trailConcurrency and trialGpuNumber are different from the basic configure file:
.. code-block:: bash
...
trialGpuNumber: 1
trialConcurrency: 4
...
trainingService:
platform: local
useActiveGpu: false # set to "true" if you are using graphical OS like Windows 10 and Ubuntu desktop
We can run the experiment with the following command:
.. code-block:: bash
nnictl create --config ~/nni/examples/trials/mnist-pytorch/config_detailed.yml
You can use *nnictl* command line tool or WebUI to trace the training progress. *nvidia_smi* command line tool can also help you to monitor the GPU usage during training.
.. 263c2dcfaee0c2fd06c19b5e90b96374
.. role:: raw-html(raw)
:format: html
实现 NNI 的 Trial(试验)代码
=================================
**Trial(试验)** 是将一组参数组合(例如,超参)在模型上独立的一次尝试。
定义 NNI 的 Trial,需要首先定义参数组(例如,搜索空间),并更新模型代码。 有两种方法来定义一个 Trial:`NNI Python API <#nni-api>`__ 和 `NNI Python annotation <#nni-annotation>`__。 参考 `这里 <#more-examples>`__ 更多 Trial 示例。
:raw-html:`<a name="nni-api"></a>`
NNI Trial API
-------------
第一步:准备搜索空间参数文件。
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
样例如下:
.. code-block:: json
{
"dropout_rate":{"_type":"uniform","_value":[0.1,0.5]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"learning_rate":{"_type":"uniform","_value":[0.0001, 0.1]}
}
参考 `SearchSpaceSpec.rst <../Tutorial/SearchSpaceSpec.rst>`__ 进一步了解搜索空间。 Tuner 会根据搜索空间来生成配置,即从每个超参的范围中选一个值。
第二步:更新模型代码
^^^^^^^^^^^^^^^^^^^^^^^^^^
* Import NNI
在 Trial 代码中加上 ``import nni`` 。
* 从 Tuner 获得参数值
.. code-block:: python
RECEIVED_PARAMS = nni.get_next_parameter()
``RECEIVED_PARAMS`` 是一个对象,如:
``{"conv_size": 2, "hidden_size": 124, "learning_rate": 0.0307, "dropout_rate": 0.2029}``
* 定期返回指标数据(可选)
.. code-block:: python
nni.report_intermediate_result(metrics)
``指标`` 可以是任意的 Python 对象。 如果使用了 NNI 内置的 Tuner/Assessor,``指标`` 只可以是两种类型:1) 数值类型,如 float、int, 2) dict 对象,其中必须有键名为 ``default`` ,值为数值的项目。 ``指标`` 会发送给 `assessor <../Assessor/BuiltinAssessor.rst>`__。 通常,``指标`` 包含了定期评估的损失值或精度。
* 返回配置的最终性能
.. code-block:: python
nni.report_final_result(metrics)
``指标`` 可以是任意的 Python 对象。 如果使用了内置的 Tuner/Assessor,``指标`` 格式和 ``report_intermediate_result`` 中一样,这个数值表示模型的性能,如精度、损失值等。 ``指标`` 会发送给 `tuner <../Tuner/BuiltinTuner.rst>`__。
第三步:启动 NNI Experiment (实验)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
启动 NNI 实验,提供搜索空间文件的路径,即第一步中定义的文件:
.. code-block:: yaml
searchSpacePath: /path/to/your/search_space.json
参考 `这里 <../Tutorial/ExperimentConfig.rst>`__ 进一步了解如何配置 Experiment。
参考 `这里 <../sdk_reference.rst>`__ ,了解更多 NNI Trial API (例如:``nni.get_sequence_id()``)。
:raw-html:`<a name="nni-annotation"></a>`
NNI Annotation
---------------------
另一种实现 Trial 的方法是使用 Python 注释来标记 NNI。 NNI Annotation 很简单,类似于注释。 不必对现有代码进行结构更改。 只需要添加一些 NNI Annotation,就能够:
* 标记需要调整的参数变量
* 指定要在其中调整的变量的范围
* 标记哪个变量需要作为中间结果范围给 ``assessor``
* 标记哪个变量需要作为最终结果(例如:模型精度) 返回给 ``tuner``
同样以 MNIST 为例,只需要两步就能用 NNI Annotation 来实现 Trial 代码。
第一步:在代码中加入 Annotation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
下面是加入了 Annotation 的 TensorFlow 代码片段,高亮的 4 行 Annotation 用于:
#. 调优 batch_size 和 dropout_rate
#. 每执行 100 步返回 test_acc
#. 最后返回 test_acc 作为最终结果。
值得注意的是,新添加的代码都是注释,不会影响以前的执行逻辑。因此这些代码仍然能在没有安装 NNI 的环境中运行。
.. code-block:: diff
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
+ """@nni.variable(nni.choice(50, 250, 500), name=batch_size)"""
batch_size = 128
for i in range(10000):
batch = mnist.train.next_batch(batch_size)
+ """@nni.variable(nni.choice(0.1, 0.5), name=dropout_rate)"""
dropout_rate = 0.5
mnist_network.train_step.run(feed_dict={mnist_network.images: batch[0],
mnist_network.labels: batch[1],
mnist_network.keep_prob: dropout_rate})
if i % 100 == 0:
test_acc = mnist_network.accuracy.eval(
feed_dict={mnist_network.images: mnist.test.images,
mnist_network.labels: mnist.test.labels,
mnist_network.keep_prob: 1.0})
+ """@nni.report_intermediate_result(test_acc)"""
test_acc = mnist_network.accuracy.eval(
feed_dict={mnist_network.images: mnist.test.images,
mnist_network.labels: mnist.test.labels,
mnist_network.keep_prob: 1.0})
+ """@nni.report_final_result(test_acc)"""
**注意**:
* ``@nni.variable`` 会对它的下面一行进行修改,左边被赋值变量必须与 ``@nni.variable`` 的关键字 ``name`` 相同。
* ``@nni.report_intermediate_result``\ /\ ``@nni.report_final_result`` 会将数据发送给 assessor/tuner。
Annotation 的语法和用法等,参考 `Annotation <../Tutorial/AnnotationSpec.rst>`__。
第二步:启用 Annotation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
在 YAML 配置文件中设置 *useAnnotation* 为 true 来启用 Annotation:
.. code-block:: bash
useAnnotation: true
用于调试的独立模式
-----------------------------
NNI 支持独立模式,使 Trial 代码无需启动 NNI 实验即可运行。 这样能更容易的找出 Trial 代码中的 Bug。 NNI Annotation 天然支持独立模式,因为添加的 NNI 相关的行都是注释的形式。 NNI Trial API 在独立模式下的行为有所变化,某些 API 返回虚拟值,而某些 API 不报告值。 有关这些 API 的完整列表,请参阅下表。
.. code-block:: python
# 注意:请为 Trial 代码中的超参分配默认值
nni.get_next_parameter # 返回 {}
nni.report_final_result # 已在 stdout 上打印日志,但不报告
nni.report_intermediate_result # 已在 stdout 上打印日志,但不报告
nni.get_experiment_id # 返回 "STANDALONE"
nni.get_trial_id # 返回 "STANDALONE"
nni.get_sequence_id # 返回 0
可使用 :githublink:`mnist 示例 <examples/trials/mnist-pytorch>` 来尝试独立模式。 只需在代码目录下运行 ``python3 mnist.py``。 Trial 代码会使用默认超参成功运行。
更多调试的信息,可参考 `How to Debug <../Tutorial/HowToDebug.rst>`__。
Trial 存放在什么地方?
----------------------------------------
本机模式
^^^^^^^^^^
每个 Trial 都有单独的目录来输出自己的数据。 在每次 Trial 运行后,环境变量 ``NNI_OUTPUT_DIR`` 定义的目录都会被导出。 在这个目录中可以看到 Trial 的代码、数据和日志。 此外,Trial 的日志(包括 stdout)还会被重定向到此目录中的 ``trial.log`` 文件。
如果使用了 Annotation 方法,转换后的 Trial 代码会存放在另一个临时目录中。 可以在 ``run.sh`` 文件中的 ``NNI_OUTPUT_DIR`` 变量找到此目录。 文件中的第二行(即:``cd``)会切换到代码所在的实际路径。 ``run.sh`` 文件示例:
.. code-block:: bash
#!/bin/bash
cd /tmp/user_name/nni/annotation/tmpzj0h72x6 #This is the actual directory
export NNI_PLATFORM=local
export NNI_SYS_DIR=/home/user_name/nni-experiments/$experiment_id$/trials/$trial_id$
export NNI_TRIAL_JOB_ID=nrbb2
export NNI_OUTPUT_DIR=/home/user_name/nni-experiments/$eperiment_id$/trials/$trial_id$
export NNI_TRIAL_SEQ_ID=1
export MULTI_PHASE=false
export CUDA_VISIBLE_DEVICES=
eval python3 mnist.py 2>/home/user_name/nni-experiments/$experiment_id$/trials/$trial_id$/stderr
echo $? `date +%s%3N` >/home/user_name/nni-experiments/$experiment_id$/trials/$trial_id$/.nni/state
其它模式
^^^^^^^^^^^
当 Trial 运行在 OpenPAI 这样的远程服务器上时,``NNI_OUTPUT_DIR`` 仅会指向 Trial 的输出目录,而 ``run.sh`` 不会在此目录中。 ``trial.log`` 文件会被复制回本机的 Trial 目录中。目录的默认位置在 ``~/nni-experiments/$experiment_id$/trials/$trial_id$/``。
更多调试的信息,可参考 `How to Debug <../Tutorial/HowToDebug.rst>`__。
:raw-html:`<a name="more-examples"></a>`
更多 Trial 的示例
-------------------
* `将日志写入 TensorBoard 的 Trial 输出目录 <../Tutorial/Tensorboard.rst>`__
* `MNIST 示例 <MnistExamples.rst>`__
* `为 CIFAR 10 分类找到最佳的 optimizer <Cifar10Examples.rst>`__
* `如何在 NNI 调优 SciKit-learn 的参数 <SklearnExamples.rst>`__
* `在阅读理解上使用自动模型架构搜索。 <SquadEvolutionExamples.rst>`__
* `如何在 NNI 上调优 GBDT <GbdtExample.rst>`__
* `在 NNI 上调优 RocksDB <RocksdbExamples.rst>`__
Anneal Tuner
============
This simple annealing algorithm begins by sampling from the prior but tends over time to sample from points closer and closer to the best ones observed. This algorithm is a simple variation on random search that leverages smoothness in the response surface. The annealing rate is not adaptive.
Usage
-----
classArgs Requirements
^^^^^^^^^^^^^^^^^^^^^^
* **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will try to maximize metrics. If 'minimize', the tuner will try to minimize metrics.
Example Configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
# config.yml
tuner:
name: Anneal
classArgs:
optimize_mode: maximize
Batch Tuner
===========
Batch tuner allows users to simply provide several configurations (i.e., choices of hyper-parameters) for their trial code. After finishing all the configurations, the experiment is done. Batch tuner only supports the type ``choice`` in the `search space spec <../Tutorial/SearchSpaceSpec.rst>`__.
Suggested scenario: If the configurations you want to try have been decided, you can list them in the SearchSpace file (using ``choice``) and run them using the batch tuner.
Usage
-----
Example Configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
# config.yml
tuner:
name: BatchTuner
Note that the search space for BatchTuner should look like:
.. code-block:: json
{
"combine_params":
{
"_type" : "choice",
"_value" : [{"optimizer": "Adam", "learning_rate": 0.00001},
{"optimizer": "Adam", "learning_rate": 0.0001},
{"optimizer": "Adam", "learning_rate": 0.001},
{"optimizer": "SGD", "learning_rate": 0.01},
{"optimizer": "SGD", "learning_rate": 0.005},
{"optimizer": "SGD", "learning_rate": 0.0002}]
}
}
The search space file should include the high-level key ``combine_params``. The type of params in the search space must be ``choice`` and the ``values`` must include all the combined params values.
Grid Search Tuner
=================
Grid Search performs an exhaustive search through a search space.
For uniform and normal distributed parameters, grid search tuner samples them at progressively decreased intervals.
Usage
-----
Grid search tuner has no argument.
Example Configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
tuner:
name: GridSearch
Network Morphism Tuner
======================
`Autokeras <https://arxiv.org/abs/1806.10282>`__ is a popular autoML tool using Network Morphism. The basic idea of Autokeras is to use Bayesian Regression to estimate the metric of the Neural Network Architecture. Each time, it generates several child networks from father networks. Then it uses a naïve Bayesian regression to estimate its metric value from the history of trained results of network and metric value pairs. Next, it chooses the child which has the best, estimated performance and adds it to the training queue. Inspired by the work of Autokeras and referring to its `code <https://github.com/jhfjhfj1/autokeras>`__, we implemented our Network Morphism method on the NNI platform.
If you want to know more about network morphism trial usage, please see the :githublink:`Readme.md <examples/trials/network_morphism/README.rst>`.
Usage
-----
Installation
^^^^^^^^^^^^
NetworkMorphism requires :githublink:`PyTorch <examples/trials/network_morphism/requirements.txt>`.
classArgs Requirements
^^^^^^^^^^^^^^^^^^^^^^
* **optimize_mode** (*maximize or minimize, optional, default = maximize*) - If 'maximize', the tuner will try to maximize metrics. If 'minimize', the tuner will try to minimize metrics.
* **task** (*('cv'), optional, default = 'cv'*) - The domain of the experiment. For now, this tuner only supports the computer vision (CV) domain.
* **input_width** (*int, optional, default = 32*) - input image width
* **input_channel** (*int, optional, default = 3*) - input image channel
* **n_output_node** (*int, optional, default = 10*) - number of classes
Config File
^^^^^^^^^^^
To use Network Morphism, you should modify the following spec in your ``config.yml`` file:
.. code-block:: yaml
tuner:
#choice: NetworkMorphism
name: NetworkMorphism
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
#for now, this tuner only supports cv domain
task: cv
#modify to fit your input image width
input_width: 32
#modify to fit your input image channel
input_channel: 3
#modify to fit your number of classes
n_output_node: 10
Example Configuration
^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
# config.yml
tuner:
name: NetworkMorphism
classArgs:
optimize_mode: maximize
task: cv
input_width: 32
input_channel: 3
n_output_node: 10
In the training procedure, it generates a JSON file which represents a Network Graph. Users can call the "json_to_graph()" function to build a PyTorch or Keras model from this JSON file.
.. code-block:: python
import nni
from nni.networkmorphism_tuner.graph import json_to_graph
def build_graph_from_json(ir_model_json):
"""build a pytorch model from json representation
"""
graph = json_to_graph(ir_model_json)
model = graph.produce_torch_model()
return model
# trial get next parameter from network morphism tuner
RCV_CONFIG = nni.get_next_parameter()
# call the function to build pytorch model or keras model
net = build_graph_from_json(RCV_CONFIG)
# training procedure
# ....
# report the final accuracy to NNI
nni.report_final_result(best_acc)
If you want to save and load the **best model**, the following methods are recommended.
.. code-block:: python
# 1. Use NNI API
## You can get the best model ID from WebUI
## or `nni-experiments/experiment_id/log/model_path/best_model.txt'
## read the json string from model file and load it with NNI API
with open("best-model.json") as json_file:
json_of_model = json_file.read()
model = build_graph_from_json(json_of_model)
# 2. Use Framework API (Related to Framework)
## 2.1 Keras API
## Save the model with Keras API in the trial code
## it's better to save model with id in nni local mode
model_id = nni.get_sequence_id()
## serialize model to JSON
model_json = model.to_json()
with open("model-{}.json".format(model_id), "w") as json_file:
json_file.write(model_json)
## serialize weights to HDF5
model.save_weights("model-{}.h5".format(model_id))
## Load the model with Keras API if you want to reuse the model
## load json and create model
model_id = "" # id of the model you want to reuse
with open('model-{}.json'.format(model_id), 'r') as json_file:
loaded_model_json = json_file.read()
loaded_model = model_from_json(loaded_model_json)
## load weights into new model
loaded_model.load_weights("model-{}.h5".format(model_id))
## 2.2 PyTorch API
## Save the model with PyTorch API in the trial code
model_id = nni.get_sequence_id()
torch.save(model, "model-{}.pt".format(model_id))
## Load the model with PyTorch API if you want to reuse the model
model_id = "" # id of the model you want to reuse
loaded_model = torch.load("model-{}.pt".format(model_id))
File Structure
--------------
The tuner has a lot of different files, functions, and classes. Here, we will give most of those files only a brief introduction:
*
``networkmorphism_tuner.py`` is a tuner which uses network morphism techniques.
*
``bayesian.py`` is a Bayesian method to estimate the metric of unseen model based on the models we have already searched.
* ``graph.py`` is the meta graph data structure. The class Graph represents the neural architecture graph of a model.
* Graph extracts the neural architecture graph from a model.
* Each node in the graph is an intermediate tensor between layers.
* Each layer is an edge in the graph.
* Notably, multiple edges may refer to the same layer.
*
``graph_transformer.py`` includes some graph transformers which widen, deepen, or add skip-connections to the graph.
*
``layers.py`` includes all the layers we use in our model.
* ``layer_transformer.py`` includes some layer transformers which widen, deepen, or add skip-connections to the layer.
* ``nn.py`` includes the class which generates the initial network.
* ``metric.py`` some metric classes including Accuracy and MSE.
* ``utils.py`` is the example search network architectures for the ``cifar10`` dataset, using Keras.
The Network Representation Json Example
---------------------------------------
Here is an example of the intermediate representation JSON file we defined, which is passed from the tuner to the trial in the architecture search procedure. Users can call the "json_to_graph()" function in the trial code to build a PyTorch or Keras model from this JSON file.
.. code-block:: json
{
"input_shape": [32, 32, 3],
"weighted": false,
"operation_history": [],
"layer_id_to_input_node_ids": {"0": [0],"1": [1],"2": [2],"3": [3],"4": [4],"5": [5],"6": [6],"7": [7],"8": [8],"9": [9],"10": [10],"11": [11],"12": [12],"13": [13],"14": [14],"15": [15],"16": [16]
},
"layer_id_to_output_node_ids": {"0": [1],"1": [2],"2": [3],"3": [4],"4": [5],"5": [6],"6": [7],"7": [8],"8": [9],"9": [10],"10": [11],"11": [12],"12": [13],"13": [14],"14": [15],"15": [16],"16": [17]
},
"adj_list": {
"0": [[1, 0]],
"1": [[2, 1]],
"2": [[3, 2]],
"3": [[4, 3]],
"4": [[5, 4]],
"5": [[6, 5]],
"6": [[7, 6]],
"7": [[8, 7]],
"8": [[9, 8]],
"9": [[10, 9]],
"10": [[11, 10]],
"11": [[12, 11]],
"12": [[13, 12]],
"13": [[14, 13]],
"14": [[15, 14]],
"15": [[16, 15]],
"16": [[17, 16]],
"17": []
},
"reverse_adj_list": {
"0": [],
"1": [[0, 0]],
"2": [[1, 1]],
"3": [[2, 2]],
"4": [[3, 3]],
"5": [[4, 4]],
"6": [[5, 5]],
"7": [[6, 6]],
"8": [[7, 7]],
"9": [[8, 8]],
"10": [[9, 9]],
"11": [[10, 10]],
"12": [[11, 11]],
"13": [[12, 12]],
"14": [[13, 13]],
"15": [[14, 14]],
"16": [[15, 15]],
"17": [[16, 16]]
},
"node_list": [
[0, [32, 32, 3]],
[1, [32, 32, 3]],
[2, [32, 32, 64]],
[3, [32, 32, 64]],
[4, [16, 16, 64]],
[5, [16, 16, 64]],
[6, [16, 16, 64]],
[7, [16, 16, 64]],
[8, [8, 8, 64]],
[9, [8, 8, 64]],
[10, [8, 8, 64]],
[11, [8, 8, 64]],
[12, [4, 4, 64]],
[13, [64]],
[14, [64]],
[15, [64]],
[16, [64]],
[17, [10]]
],
"layer_list": [
[0, ["StubReLU", 0, 1]],
[1, ["StubConv2d", 1, 2, 3, 64, 3]],
[2, ["StubBatchNormalization2d", 2, 3, 64]],
[3, ["StubPooling2d", 3, 4, 2, 2, 0]],
[4, ["StubReLU", 4, 5]],
[5, ["StubConv2d", 5, 6, 64, 64, 3]],
[6, ["StubBatchNormalization2d", 6, 7, 64]],
[7, ["StubPooling2d", 7, 8, 2, 2, 0]],
[8, ["StubReLU", 8, 9]],
[9, ["StubConv2d", 9, 10, 64, 64, 3]],
[10, ["StubBatchNormalization2d", 10, 11, 64]],
[11, ["StubPooling2d", 11, 12, 2, 2, 0]],
[12, ["StubGlobalPooling2d", 12, 13]],
[13, ["StubDropout2d", 13, 14, 0.25]],
[14, ["StubDense", 14, 15, 64, 64]],
[15, ["StubReLU", 15, 16]],
[16, ["StubDense", 16, 17, 64, 10]]
]
}
You can consider the model to be a `directed acyclic graph <https://en.wikipedia.org/wiki/Directed_acyclic_graph>`__. The definition of each model is a JSON object where:
* ``input_shape`` is a list of integers which do not include the batch axis.
* ``weighted`` means whether the weights and biases in the neural network should be included in the graph.
* ``operation_history`` is a list saving all the network morphism operations.
* ``layer_id_to_input_node_ids`` is a dictionary mapping from layer identifiers to their input nodes identifiers.
* ``layer_id_to_output_node_ids`` is a dictionary mapping from layer identifiers to their output nodes identifiers
* ``adj_list`` is a two-dimensional list; the adjacency list of the graph. The first dimension is identified by tensor identifiers. In each edge list, the elements are two-element tuples of (tensor identifier, layer identifier).
* ``reverse_adj_list`` is a reverse adjacent list in the same format as adj_list.
* ``node_list`` is a list of integers. The indices of the list are the identifiers.
*
``layer_list`` is a list of stub layers. The indices of the list are the identifiers.
*
For ``StubConv (StubConv1d, StubConv2d, StubConv3d)``, the numbering follows the format: its node input id (or id list), node output id, input_channel, filters, kernel_size, stride, and padding.
*
For ``StubDense``, the numbering follows the format: its node input id (or id list), node output id, input_units, and units.
*
For ``StubBatchNormalization (StubBatchNormalization1d, StubBatchNormalization2d, StubBatchNormalization3d)``, the numbering follows the format: its node input id (or id list), node output id, and features numbers.
*
For ``StubDropout(StubDropout1d, StubDropout2d, StubDropout3d)``, the numbering follows the format: its node input id (or id list), node output id, and dropout rate.
*
For ``StubPooling (StubPooling1d, StubPooling2d, StubPooling3d)``, the numbering follows the format: its node input id (or id list), node output id, kernel_size, stride, and padding.
*
For else layers, the numbering follows the format: its node input id (or id list) and node output id.
TODO
----
Next step, we will change the API from s fixed network generator to a network generator with more available operators. We will use ONNX instead of JSON later as the intermediate representation spec in the future.
Random Tuner
============
In `Random Search for Hyper-Parameter Optimization <http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf>`__ we show that Random Search might be surprisingly effective despite its simplicity.
We suggest using Random Search as a baseline when no knowledge about the prior distribution of hyper-parameters is available.
Usage
-----
Example Configuration
.. code-block:: yaml
tuner:
name: Random
classArgs:
seed: 100 # optional
TPE Tuner
=========
The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach.
SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements,
and then subsequently choose new hyperparameters to test based on this model.
The TPE approach models P(x|y) and P(y) where x represents hyperparameters and y the associated evaluation matric.
P(x|y) is modeled by transforming the generative process of hyperparameters,
replacing the distributions of the configuration prior with non-parametric densities.
This optimization approach is described in detail in `Algorithms for Hyper-Parameter Optimization <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__.
Parallel TPE optimization
^^^^^^^^^^^^^^^^^^^^^^^^^
TPE approaches were actually run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete.
The original algorithm design was optimized for sequential computation.
If we were to use TPE with much concurrency, its performance will be bad.
We have optimized this case using the Constant Liar algorithm.
For these principles of optimization, please refer to our `research blog <../CommunitySharings/ParallelizingTpeSearch.rst>`__.
Usage
-----
To use TPE, you should add the following spec in your experiment's YAML config file:
.. code-block:: yaml
## minimal config ##
tuner:
name: TPE
classArgs:
optimize_mode: minimize
.. code-block:: yaml
## advanced config ##
tuner:
name: TPE
classArgs:
optimize_mode: maximize
seed: 12345
tpe_args:
constant_liar_type: 'mean'
n_startup_jobs: 10
n_ei_candidates: 20
linear_forgetting: 100
prior_weight: 0
gamma: 0.5
classArgs
^^^^^^^^^
.. list-table::
:widths: 10 20 10 60
:header-rows: 1
* - Field
- Type
- Default
- Description
* - ``optimize_mode``
- ``'minimize' | 'maximize'``
- ``'minimize'``
- Whether to minimize or maximize trial metrics.
* - ``seed``
- ``int | null``
- ``null``
- The random seed.
* - ``tpe_args.constant_liar_type``
- ``'best' | 'worst' | 'mean' | null``
- ``'best'``
- TPE algorithm itself does not support parallel tuning. This parameter specifies how to optimize for trial_concurrency > 1. How each liar works is explained in paper's section 6.1.
In general ``best`` suit for small trial number and ``worst`` suit for large trial number.
* - ``tpe_args.n_startup_jobs``
- ``int``
- ``20``
- The first N hyper-parameters are generated fully randomly for warming up.
If the search space is large, you can increase this value. Or if max_trial_number is small, you may want to decrease it.
* - ``tpe_args.n_ei_candidates``
- ``int``
- ``24``
- For each iteration TPE samples EI for N sets of parameters and choose the best one. (loosely speaking)
* - ``tpe_args.linear_forgetting``
- ``int``
- ``25``
- TPE will lower the weights of old trials. This controls how many iterations it takes for a trial to start decay.
* - ``tpe_args.prior_weight``
- ``float``
- ``1.0``
- TPE treats user provided search space as prior.
When generating new trials, it also incorporates the prior in trial history by transforming the search space to
one trial configuration (i.e., each parameter of this configuration chooses the mean of its candidate range).
Here, prior_weight determines the weight of this trial configuration in the history trial configurations.
With prior weight 1.0, the search space is treated as one good trial.
For example, "normal(0, 1)" effectly equals to a trial with x = 0 which has yielded good result.
* - ``tpe_args.gamma``
- ``float``
- ``0.25``
- Controls how many trials are considered "good".
The number is calculated as "min(gamma * sqrt(N), linear_forgetting)".
Experiment Config Reference (legacy)
====================================
This is the previous version (V1) of experiment configuration specification. It is still supported for now, but we recommend users to use `the new version of experiment configuration (V2) <../reference/experiment_config.rst>`_.
A config file is needed when creating an experiment. The path of the config file is provided to ``nnictl``.
The config file is in YAML format.
This document describes the rules to write the config file, and provides some examples and templates.
* `Experiment Config Reference <#experiment-config-reference>`__
* `Template <#template>`__
* `Configuration Spec <#configuration-spec>`__
* `authorName <#authorname>`__
* `experimentName <#experimentname>`__
* `trialConcurrency <#trialconcurrency>`__
* `maxExecDuration <#maxexecduration>`__
* `versionCheck <#versioncheck>`__
* `debug <#debug>`__
* `maxTrialNum <#maxtrialnum>`__
* `maxTrialDuration <#maxtrialduration>`__
* `trainingServicePlatform <#trainingserviceplatform>`__
* `searchSpacePath <#searchspacepath>`__
* `useAnnotation <#useannotation>`__
* `multiThread <#multithread>`__
* `nniManagerIp <#nnimanagerip>`__
* `logDir <#logdir>`__
* `logLevel <#loglevel>`__
* `logCollection <#logcollection>`__
* `tuner <#tuner>`__
* `builtinTunerName <#builtintunername>`__
* `codeDir <#codedir>`__
* `classFileName <#classfilename>`__
* `className <#classname>`__
* `classArgs <#classargs>`__
* `gpuIndices <#gpuindices>`__
* `includeIntermediateResults <#includeintermediateresults>`__
* `assessor <#assessor>`__
* `builtinAssessorName <#builtinassessorname>`__
* `codeDir <#codedir-1>`__
* `classFileName <#classfilename-1>`__
* `className <#classname-1>`__
* `classArgs <#classargs-1>`__
* `advisor <#advisor>`__
* `builtinAdvisorName <#builtinadvisorname>`__
* `codeDir <#codedir-2>`__
* `classFileName <#classfilename-2>`__
* `className <#classname-2>`__
* `classArgs <#classargs-2>`__
* `gpuIndices <#gpuindices-1>`__
* `trial <#trial>`__
* `localConfig <#localconfig>`__
* `gpuIndices <#gpuindices-2>`__
* `maxTrialNumPerGpu <#maxtrialnumpergpu>`__
* `useActiveGpu <#useactivegpu>`__
* `machineList <#machinelist>`__
* `ip <#ip>`__
* `port <#port>`__
* `username <#username>`__
* `passwd <#passwd>`__
* `sshKeyPath <#sshkeypath>`__
* `passphrase <#passphrase>`__
* `gpuIndices <#gpuindices-3>`__
* `maxTrialNumPerGpu <#maxtrialnumpergpu-1>`__
* `useActiveGpu <#useactivegpu-1>`__
* `pythonPath <#pythonPath>`__
* `kubeflowConfig <#kubeflowconfig>`__
* `operator <#operator>`__
* `storage <#storage>`__
* `nfs <#nfs>`__
* `keyVault <#keyvault>`__
* `azureStorage <#azurestorage>`__
* `uploadRetryCount <#uploadretrycount>`__
* `paiConfig <#paiconfig>`__
* `userName <#username>`__
* `password <#password>`__
* `token <#token>`__
* `host <#host>`__
* `reuse <#reuse>`__
* `Examples <#examples>`__
* `Local mode <#local-mode>`__
* `Remote mode <#remote-mode>`__
* `PAI mode <#pai-mode>`__
* `Kubeflow mode <#kubeflow-mode>`__
* `Kubeflow with azure storage <#kubeflow-with-azure-storage>`__
Template
--------
* **Light weight (without Annotation and Assessor)**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
searchSpacePath:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
* **Use Assessor**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
searchSpacePath:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
assessor:
#choice: Medianstop
builtinAssessorName:
classArgs:
#choice: maximize, minimize
optimize_mode:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
* **Use Annotation**
.. code-block:: yaml
authorName:
experimentName:
trialConcurrency:
maxExecDuration:
maxTrialNum:
#choice: local, remote, pai, kubeflow
trainingServicePlatform:
#choice: true, false, default: false
useAnnotation:
#choice: true, false, default: false
multiThread:
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName:
classArgs:
#choice: maximize, minimize
optimize_mode:
gpuIndices:
assessor:
#choice: Medianstop
builtinAssessorName:
classArgs:
#choice: maximize, minimize
optimize_mode:
trial:
command:
codeDir:
gpuNum:
#machineList can be empty if the platform is local
machineList:
- ip:
port:
username:
passwd:
Configuration Spec
------------------
authorName
^^^^^^^^^^
Required. String.
The name of the author who create the experiment.
*TBD: add default value.*
experimentName
^^^^^^^^^^^^^^
Required. String.
The name of the experiment created.
*TBD: add default value.*
trialConcurrency
^^^^^^^^^^^^^^^^
Required. Integer between 1 and 99999.
Specifies the max num of trial jobs run simultaneously.
If trialGpuNum is bigger than the free gpu numbers, and the trial jobs running simultaneously can not reach **trialConcurrency** number, some trial jobs will be put into a queue to wait for gpu allocation.
maxExecDuration
^^^^^^^^^^^^^^^
Optional. String. Default: 999d.
**maxExecDuration** specifies the max duration time of an experiment. The unit of the time is {**s**\ , **m**\ , **h**\ , **d**\ }, which means {*seconds*\ , *minutes*\ , *hours*\ , *days*\ }.
Note: The maxExecDuration spec set the time of an experiment, not a trial job. If the experiment reach the max duration time, the experiment will not stop, but could not submit new trial jobs any more.
versionCheck
^^^^^^^^^^^^
Optional. Bool. Default: true.
NNI will check the version of nniManager process and the version of trialKeeper in remote, pai and kubernetes platform. If you want to disable version check, you could set versionCheck be false.
debug
^^^^^
Optional. Bool. Default: false.
Debug mode will set versionCheck to false and set logLevel to be 'debug'.
maxTrialNum
^^^^^^^^^^^
Optional. Integer between 1 and 99999. Default: 99999.
Specifies the max number of trial jobs created by NNI, including succeeded and failed jobs.
maxTrialDuration
^^^^^^^^^^^^^^^^
Optional. String. Default: 999d.
**maxTrialDuration** specifies the max duration time of each trial job. The unit of the time is {**s**\ , **m**\ , **h**\ , **d**\ }, which means {*seconds*\ , *minutes*\ , *hours*\ , *days*\ }. If current trial job reach the max duration time, this trial job will stop.
trainingServicePlatform
^^^^^^^^^^^^^^^^^^^^^^^
Required. String.
Specifies the platform to run the experiment, including **local**\ , **remote**\ , **pai**\ , **kubeflow**\ , **frameworkcontroller**.
*
**local** run an experiment on local ubuntu machine.
*
**remote** submit trial jobs to remote ubuntu machines, and **machineList** field should be filed in order to set up SSH connection to remote machine.
*
**pai** submit trial jobs to `OpenPAI <https://github.com/Microsoft/pai>`__ of Microsoft. For more details of pai configuration, please refer to `Guide to PAI Mode <../TrainingService/PaiMode.rst>`__
*
**kubeflow** submit trial jobs to `kubeflow <https://www.kubeflow.org/docs/about/kubeflow/>`__\ , NNI support kubeflow based on normal kubernetes and `azure kubernetes <https://azure.microsoft.com/en-us/services/kubernetes-service/>`__. For detail please refer to `Kubeflow Docs <../TrainingService/KubeflowMode.rst>`__
*
**adl** submit trial jobs to `AdaptDL <https://www.kubeflow.org/docs/about/kubeflow/>`__\ , NNI support AdaptDL on Kubernetes cluster. For detail please refer to `AdaptDL Docs <../TrainingService/AdaptDLMode.rst>`__
*
TODO: explain frameworkcontroller.
searchSpacePath
^^^^^^^^^^^^^^^
Optional. Path to existing file.
Specifies the path of search space file, which should be a valid path in the local linux machine.
The only exception that **searchSpacePath** can be not fulfilled is when ``useAnnotation=True``.
useAnnotation
^^^^^^^^^^^^^
Optional. Bool. Default: false.
Use annotation to analysis trial code and generate search space.
Note: if **useAnnotation** is true, the searchSpacePath field should be removed.
multiThread
^^^^^^^^^^^
Optional. Bool. Default: false.
Enable multi-thread mode for dispatcher. If multiThread is enabled, dispatcher will start a thread to process each command from NNI Manager.
nniManagerIp
^^^^^^^^^^^^
Optional. String. Default: eth0 device IP.
Set the IP address of the machine on which NNI manager process runs. This field is optional, and if it's not set, eth0 device IP will be used instead.
Note: run ``ifconfig`` on NNI manager's machine to check if eth0 device exists. If not, **nniManagerIp** is recommended to set explicitly.
logDir
^^^^^^
Optional. Path to a directory. Default: ``<user home directory>/nni-experiments``.
Configures the directory to store logs and data of the experiment.
logLevel
^^^^^^^^
Optional. String. Default: ``info``.
Sets log level for the experiment. Available log levels are: ``trace``\ , ``debug``\ , ``info``\ , ``warning``\ , ``error``\ , ``fatal``.
logCollection
^^^^^^^^^^^^^
Optional. ``http`` or ``none``. Default: ``none``.
Set the way to collect log in remote, pai, kubeflow, frameworkcontroller platform. There are two ways to collect log, one way is from ``http``\ , trial keeper will post log content back from http request in this way, but this way may slow down the speed to process logs in trialKeeper. The other way is ``none``\ , trial keeper will not post log content back, and only post job metrics. If your log content is too big, you could consider setting this param be ``none``.
tuner
^^^^^
Required.
Specifies the tuner algorithm in the experiment, there are two kinds of ways to set tuner. One way is to use tuner provided by NNI sdk (built-in tuners), in which case you need to set **builtinTunerName** and **classArgs**. Another way is to use users' own tuner file, in which case **codeDirectory**\ , **classFileName**\ , **className** and **classArgs** are needed. *Users must choose exactly one way.*
builtinTunerName
^^^^^^^^^^^^^^^^
Required if using built-in tuners. String.
Specifies the name of system tuner, NNI sdk provides different tuners introduced `here <../Tuner/BuiltinTuner.rst>`__.
codeDir
^^^^^^^
Required if using customized tuners. Path relative to the location of config file.
Specifies the directory of tuner code.
classFileName
^^^^^^^^^^^^^
Required if using customized tuners. File path relative to **codeDir**.
Specifies the name of tuner file.
className
^^^^^^^^^
Required if using customized tuners. String.
Specifies the name of tuner class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of tuner algorithm. Please refer to `this file <../Tuner/BuiltinTuner.rst>`__ for the configurable arguments of each built-in tuner.
gpuIndices
^^^^^^^^^^
Optional. String. Default: empty.
Specifies the GPUs that can be used by the tuner process. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma ``,``. For example, ``1``\ , or ``0,1,3``. If the field is not set, no GPU will be visible to tuner (by setting ``CUDA_VISIBLE_DEVICES`` to be an empty string).
includeIntermediateResults
^^^^^^^^^^^^^^^^^^^^^^^^^^
Optional. Bool. Default: false.
If **includeIntermediateResults** is true, the last intermediate result of the trial that is early stopped by assessor is sent to tuner as final result.
assessor
^^^^^^^^
Specifies the assessor algorithm to run an experiment. Similar to tuners, there are two kinds of ways to set assessor. One way is to use assessor provided by NNI sdk. Users need to set **builtinAssessorName** and **classArgs**. Another way is to use users' own assessor file, and users need to set **codeDirectory**\ , **classFileName**\ , **className** and **classArgs**. *Users must choose exactly one way.*
By default, there is no assessor enabled.
builtinAssessorName
^^^^^^^^^^^^^^^^^^^
Required if using built-in assessors. String.
Specifies the name of built-in assessor, NNI sdk provides different assessors introduced `here <../Assessor/BuiltinAssessor.rst>`__.
codeDir
^^^^^^^
Required if using customized assessors. Path relative to the location of config file.
Specifies the directory of assessor code.
classFileName
^^^^^^^^^^^^^
Required if using customized assessors. File path relative to **codeDir**.
Specifies the name of assessor file.
className
^^^^^^^^^
Required if using customized assessors. String.
Specifies the name of assessor class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of assessor algorithm.
advisor
^^^^^^^
Optional.
Specifies the advisor algorithm in the experiment. Similar to tuners and assessors, there are two kinds of ways to specify advisor. One way is to use advisor provided by NNI sdk, need to set **builtinAdvisorName** and **classArgs**. Another way is to use users' own advisor file, and need to set **codeDirectory**\ , **classFileName**\ , **className** and **classArgs**.
When advisor is enabled, settings of tuners and advisors will be bypassed.
builtinAdvisorName
^^^^^^^^^^^^^^^^^^
Specifies the name of a built-in advisor. NNI sdk provides `BOHB <../Tuner/BohbAdvisor.rst>`__ and `Hyperband <../Tuner/HyperbandAdvisor.rst>`__.
codeDir
^^^^^^^
Required if using customized advisors. Path relative to the location of config file.
Specifies the directory of advisor code.
classFileName
^^^^^^^^^^^^^
Required if using customized advisors. File path relative to **codeDir**.
Specifies the name of advisor file.
className
^^^^^^^^^
Required if using customized advisors. String.
Specifies the name of advisor class.
classArgs
^^^^^^^^^
Optional. Key-value pairs. Default: empty.
Specifies the arguments of advisor.
gpuIndices
^^^^^^^^^^
Optional. String. Default: empty.
Specifies the GPUs that can be used. Single or multiple GPU indices can be specified. Multiple GPU indices are separated by comma ``,``. For example, ``1``\ , or ``0,1,3``. If the field is not set, no GPU will be visible to tuner (by setting ``CUDA_VISIBLE_DEVICES`` to be an empty string).
trial
^^^^^
Required. Key-value pairs.
In local and remote mode, the following keys are required.
*
**command**\ : Required string. Specifies the command to run trial process.
*
**codeDir**\ : Required string. Specifies the directory of your own trial file. This directory will be automatically uploaded in remote mode.
*
**gpuNum**\ : Optional integer. Specifies the num of gpu to run the trial process. Default value is 0.
In PAI mode, the following keys are required.
*
**command**\ : Required string. Specifies the command to run trial process.
*
**codeDir**\ : Required string. Specifies the directory of the own trial file. Files in the directory will be uploaded in PAI mode.
*
**gpuNum**\ : Required integer. Specifies the num of gpu to run the trial process. Default value is 0.
*
**cpuNum**\ : Required integer. Specifies the cpu number of cpu to be used in pai container.
*
**memoryMB**\ : Required integer. Set the memory size to be used in pai container, in megabytes.
*
**image**\ : Required string. Set the image to be used in pai.
*
**authFile**\ : Optional string. Used to provide Docker registry which needs authentication for image pull in PAI. `Reference <https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job>`__.
*
**shmMB**\ : Optional integer. Shared memory size of container.
*
**portList**\ : List of key-values pairs with ``label``\ , ``beginAt``\ , ``portNumber``. See `job tutorial of PAI <https://github.com/microsoft/pai/blob/master/docs/job_tutorial.rst>`__ for details.
.. cannot find `Reference <https://github.com/microsoft/pai/blob/2ea69b45faa018662bc164ed7733f6fdbb4c42b3/docs/faq.rst#q-how-to-use-private-docker-registry-job-image-when-submitting-an-openpai-job>`__ and `job tutorial of PAI <https://github.com/microsoft/pai/blob/master/docs/job_tutorial.rst>`__
In Kubeflow mode, the following keys are required.
*
**codeDir**\ : The local directory where the code files are in.
*
**ps**\ : An optional configuration for kubeflow's tensorflow-operator, which includes
*
**replicas**\ : The replica number of **ps** role.
*
**command**\ : The run script in **ps**\ 's container.
*
**gpuNum**\ : The gpu number to be used in **ps** container.
*
**cpuNum**\ : The cpu number to be used in **ps** container.
*
**memoryMB**\ : The memory size of the container.
*
**image**\ : The image to be used in **ps**.
*
**worker**\ : An optional configuration for kubeflow's tensorflow-operator.
*
**replicas**\ : The replica number of **worker** role.
*
**command**\ : The run script in **worker**\ 's container.
*
**gpuNum**\ : The gpu number to be used in **worker** container.
*
**cpuNum**\ : The cpu number to be used in **worker** container.
*
**memoryMB**\ : The memory size of the container.
*
**image**\ : The image to be used in **worker**.
localConfig
^^^^^^^^^^^
Optional in local mode. Key-value pairs.
Only applicable if **trainingServicePlatform** is set to ``local``\ , otherwise there should not be **localConfig** section in configuration file.
gpuIndices
^^^^^^^^^^
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (\ ``,``\ ), such as ``1`` or ``0,1,3``. By default, all GPUs available will be used.
maxTrialNumPerGpu
^^^^^^^^^^^^^^^^^
Optional. Integer. Default: 1.
Used to specify the max concurrency trial number on a GPU device.
useActiveGpu
^^^^^^^^^^^^
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If **useActiveGpu** is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
machineList
^^^^^^^^^^^
Required in remote mode. A list of key-value pairs with the following keys.
ip
^^
Required. IP address or host name that is accessible from the current machine.
The IP address or host name of remote machine.
port
^^^^
Optional. Integer. Valid port. Default: 22.
The ssh port to be used to connect machine.
username
^^^^^^^^
Required if authentication with username/password. String.
The account of remote machine.
passwd
^^^^^^
Required if authentication with username/password. String.
Specifies the password of the account.
sshKeyPath
^^^^^^^^^^
Required if authentication with ssh key. Path to private key file.
If users use ssh key to login remote machine, **sshKeyPath** should be a valid path to a ssh key file.
*Note: if users set passwd and sshKeyPath simultaneously, NNI will try passwd first.*
passphrase
^^^^^^^^^^
Optional. String.
Used to protect ssh key, which could be empty if users don't have passphrase.
gpuIndices
^^^^^^^^^^
Optional. String. Default: none.
Used to specify designated GPU devices for NNI, if it is set, only the specified GPU devices are used for NNI trial jobs. Single or multiple GPU indices can be specified. Multiple GPU indices should be separated with comma (\ ``,``\ ), such as ``1`` or ``0,1,3``. By default, all GPUs available will be used.
maxTrialNumPerGpu
^^^^^^^^^^^^^^^^^
Optional. Integer. Default: 1.
Used to specify the max concurrency trial number on a GPU device.
useActiveGpu
^^^^^^^^^^^^
Optional. Bool. Default: false.
Used to specify whether to use a GPU if there is another process. By default, NNI will use the GPU only if there is no other active process in the GPU. If **useActiveGpu** is set to true, NNI will use the GPU regardless of another processes. This field is not applicable for NNI on Windows.
pythonPath
^^^^^^^^^^
Optional. String.
Users can configure the python path environment on remote machine by setting **pythonPath**.
remoteConfig
^^^^^^^^^^^^
Optional field in remote mode. Users could set per machine information in ``machineList`` field, and set global configuration for remote mode in this field.
reuse
^^^^^
Optional. Bool. default: ``false``. It's an experimental feature.
If it's true, NNI will reuse remote jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
kubeflowConfig
^^^^^^^^^^^^^^
operator
^^^^^^^^
Required. String. Has to be ``tf-operator`` or ``pytorch-operator``.
Specifies the kubeflow's operator to be used, NNI support ``tf-operator`` in current version.
storage
^^^^^^^
Optional. String. Default. ``nfs``.
Specifies the storage type of kubeflow, including ``nfs`` and ``azureStorage``.
nfs
^^^
Required if using nfs. Key-value pairs.
*
**server** is the host of nfs server.
*
**path** is the mounted path of nfs.
keyVault
^^^^^^^^
Required if using azure storage. Key-value pairs.
Set **keyVault** to storage the private key of your azure storage account. Refer to `the doc <https://docs.microsoft.com/en-us/azure/key-vault/key-vault-manage-with-cli2>`__ .
*
**vaultName** is the value of ``--vault-name`` used in az command.
*
**name** is the value of ``--name`` used in az command.
azureStorage
^^^^^^^^^^^^
Required if using azure storage. Key-value pairs.
Set azure storage account to store code files.
*
**accountName** is the name of azure storage account.
*
**azureShare** is the share of the azure file storage.
uploadRetryCount
^^^^^^^^^^^^^^^^
Required if using azure storage. Integer between 1 and 99999.
If upload files to azure storage failed, NNI will retry the process of uploading, this field will specify the number of attempts to re-upload files.
paiConfig
^^^^^^^^^
userName
^^^^^^^^
Required. String.
The user name of your pai account.
password
^^^^^^^^
Required if using password authentication. String.
The password of the pai account.
token
^^^^^
Required if using token authentication. String.
Personal access token that can be retrieved from PAI portal.
host
^^^^
Required. String.
The hostname of IP address of PAI.
reuse
^^^^^
Optional. Bool. default: ``false``. It's an experimental feature.
If it's true, NNI will reuse OpenPAI jobs to run as many as possible trials. It can save time of creating new jobs. User needs to make sure each trial can run independent in same job, for example, avoid loading checkpoint from previous trials.
sharedStorage
^^^^^^^^^^^^^
storageType
^^^^^^^^^^^
Required. String.
The type of the storage, support ``NFS`` and ``AzureBlob``.
localMountPoint
^^^^^^^^^^^^^^^
Required. String.
The absolute or relative path that the storage has been or will be mounted in local. If the path does not exist, it will be created automatically. Recommended to use an absolute path. i.e. ``/tmp/nni-shared-storage``.
remoteMountPoint
^^^^^^^^^^^^^^^^
Required. String.
The absolute or relative path that the storage will be mounted in remote. If the path does not exist, it will be created automatically. Note that the directory must be empty if using AzureBlob. Recommended to use a relative path. i.e. ``./nni-shared-storage``.
localMounted
^^^^^^^^^^^^
Required. String.
One of ``usermount``, ``nnimount`` or ``nomount``. ``usermount`` means you have already mount this storage on localMountPoint. ``nnimount`` means nni will try to mount this storage on localMountPoint. ``nomount`` means storage will not mount in local machine, will support partial storages in the future.
nfsServer
^^^^^^^^^
Optional. String.
Required if using NFS storage. The NFS server host.
exportedDirectory
^^^^^^^^^^^^^^^^^
Optional. String.
Required if using NFS storage. The exported directory of NFS server.
storageAccountName
^^^^^^^^^^^^^^^^^^
Optional. String.
Required if using AzureBlob storage. The azure storage account name.
storageAccountKey
^^^^^^^^^^^^^^^^^
Optional. String.
Required if using AzureBlob storage. The azure storage account key.
containerName
^^^^^^^^^^^^^
Optional. String.
Required if using AzureBlob storage. The AzureBlob container name.
Examples
--------
Local mode
^^^^^^^^^^
If users want to run trial jobs in local machine, and use annotation to generate search space, could use the following config:
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
You can add assessor configuration.
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
#choice: Medianstop
builtinAssessorName: Medianstop
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
Or you could specify your own tuner and assessor file as following,
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
codeDir: /nni/tuner
classFileName: mytuner.py
className: MyTuner
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
codeDir: /nni/assessor
classFileName: myassessor.py
className: MyAssessor
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
Remote mode
^^^^^^^^^^^
If run trial jobs in remote machine, users could specify the remote machine information as following format:
.. code-block:: yaml
authorName: test
experimentName: test_experiment
trialConcurrency: 3
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: remote
searchSpacePath: /nni/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: /nni/mnist
gpuNum: 0
#machineList can be empty if the platform is local
machineList:
- ip: 10.10.10.10
port: 22
username: test
passwd: test
- ip: 10.10.10.11
port: 22
username: test
passwd: test
- ip: 10.10.10.12
port: 22
username: test
sshKeyPath: /nni/sshkey
passphrase: qwert
# Below is an example of specifying python environment.
pythonPath: ${replace_to_python_environment_path_in_your_remote_machine}
PAI mode
^^^^^^^^
.. code-block:: yaml
authorName: test
experimentName: nni_test1
trialConcurrency: 1
maxExecDuration:500h
maxTrialNum: 1
#choice: local, remote, pai, kubeflow
trainingServicePlatform: pai
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 main.py
codeDir: .
gpuNum: 4
cpuNum: 2
memoryMB: 10000
#The docker image to run NNI job on pai
image: msranni/nni:latest
paiConfig:
#The username to login pai
userName: test
#The password to login pai
passWord: test
#The host of restful server of pai
host: 10.10.10.10
Kubeflow mode
^^^^^^^^^^^^^
kubeflow with nfs storage.
.. code-block:: yaml
authorName: default
experimentName: example_mni
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 1
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 1
command: python3 mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
nfs:
server: 10.10.10.10
path: /var/nfs/general
Kubeflow with azure storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
authorName: default
experimentName: example_mni
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 1
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
#nniManagerIp: 10.10.10.10
tuner:
#choice: TPE, Random, Anneal, Evolution
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 1
command: python3 mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 4096
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
keyVault:
vaultName: Contoso-Vault
name: AzureStorageAccountKey
azureStorage:
accountName: storage
azureShare: share01
QuickStart
==========
Installation
------------
Currently, NNI supports running on Linux, macOS and Windows. Ubuntu 16.04 or higher, macOS 10.14.1, and Windows 10.1809 are tested and supported. Simply run the following ``pip install`` in an environment that has ``python >= 3.6``.
Linux and macOS
^^^^^^^^^^^^^^^
.. code-block:: bash
python3 -m pip install --upgrade nni
Windows
^^^^^^^
.. code-block:: bash
python -m pip install --upgrade nni
.. Note:: For Linux and macOS, ``--user`` can be added if you want to install NNI in your home directory, which does not require any special privileges.
.. Note:: If there is an error like ``Segmentation fault``, please refer to the :doc:`FAQ <FAQ>`.
.. Note:: For the system requirements of NNI, please refer to :doc:`Install NNI on Linux & Mac <InstallationLinux>` or :doc:`Windows <InstallationWin>`. If you want to use docker, refer to :doc:`HowToUseDocker <HowToUseDocker>`.
"Hello World" example on MNIST
------------------------------
NNI is a toolkit to help users run automated machine learning experiments. It can automatically do the cyclic process of getting hyperparameters, running trials, testing results, and tuning hyperparameters. Here, we'll show how to use NNI to help you find the optimal hyperparameters on the MNIST dataset.
Here is an example script to train a CNN on the MNIST dataset **without NNI**:
.. code-block:: python
def main(args):
# load data
train_loader = torch.utils.data.DataLoader(datasets.MNIST(...), batch_size=args['batch_size'], shuffle=True)
test_loader = torch.tuils.data.DataLoader(datasets.MNIST(...), batch_size=1000, shuffle=True)
# build model
model = Net(hidden_size=args['hidden_size'])
optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])
# train
for epoch in range(10):
train(args, model, device, train_loader, optimizer, epoch)
test_acc = test(args, model, device, test_loader)
print(test_acc)
print('final accuracy:', test_acc)
if __name__ == '__main__':
params = {
'batch_size': 32,
'hidden_size': 128,
'lr': 0.001,
'momentum': 0.5
}
main(params)
The above code can only try one set of parameters at a time. If you want to tune the learning rate, you need to manually modify the hyperparameter and start the trial again and again.
NNI is born to help users tune jobs, whose working process is presented below:
.. code-block:: text
input: search space, trial code, config file
output: one optimal hyperparameter configuration
1: For t = 0, 1, 2, ..., maxTrialNum,
2: hyperparameter = chose a set of parameter from search space
3: final result = run_trial_and_evaluate(hyperparameter)
4: report final result to NNI
5: If reach the upper limit time,
6: Stop the experiment
7: return hyperparameter value with best final result
.. note::
If you want to use NNI to automatically train your model and find the optimal hyper-parameters, there are two approaches:
1. Write a config file and start the experiment from the command line.
2. Config and launch the experiment directly from a Python file
In the this part, we will focus on the first approach. For the second approach, please refer to `this tutorial <HowToLaunchFromPython.rst>`__\ .
Step 1: Modify the ``Trial`` Code
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Modify your ``Trial`` file to get the hyperparameter set from NNI and report the final results to NNI.
.. code-block:: diff
+ import nni
def main(args):
# load data
train_loader = torch.utils.data.DataLoader(datasets.MNIST(...), batch_size=args['batch_size'], shuffle=True)
test_loader = torch.tuils.data.DataLoader(datasets.MNIST(...), batch_size=1000, shuffle=True)
# build model
model = Net(hidden_size=args['hidden_size'])
optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])
# train
for epoch in range(10):
train(args, model, device, train_loader, optimizer, epoch)
test_acc = test(args, model, device, test_loader)
- print(test_acc)
+ nni.report_intermediate_result(test_acc)
- print('final accuracy:', test_acc)
+ nni.report_final_result(test_acc)
if __name__ == '__main__':
- params = {'batch_size': 32, 'hidden_size': 128, 'lr': 0.001, 'momentum': 0.5}
+ params = nni.get_next_parameter()
main(params)
*Example:* :githublink:`mnist.py <examples/trials/mnist-pytorch/mnist.py>`
Step 2: Define the Search Space
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Define a ``Search Space`` in a YAML file, including the ``name`` and the ``distribution`` (discrete-valued or continuous-valued) of all the hyperparameters you want to search.
.. code-block:: yaml
searchSpace:
batch_size:
_type: choice
_value: [16, 32, 64, 128]
hidden_size:
_type: choice
_value: [128, 256, 512, 1024]
lr:
_type: choice
_value: [0.0001, 0.001, 0.01, 0.1]
momentum:
_type: uniform
_value: [0, 1]
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>`
You can also write your search space in a JSON file and specify the file path in the configuration. For detailed tutorial on how to write the search space, please see `here <SearchSpaceSpec.rst>`__.
Step 3: Config the Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In addition to the search_space defined in the `step2 <step-2-define-the-search-space>`__, you need to config the experiment in the YAML file. It specifies the key information of the experiment, such as the trial files, tuning algorithm, max trial number, and max duration, etc.
.. code-block:: yaml
experimentName: MNIST # An optional name to distinguish the experiments
trialCommand: python3 mnist.py # NOTE: change "python3" to "python" if you are using Windows
trialConcurrency: 2 # Run 2 trials concurrently
maxTrialNumber: 10 # Generate at most 10 trials
maxExperimentDuration: 1h # Stop generating trials after 1 hour
tuner: # Configure the tuning algorithm
name: TPE
classArgs: # Algorithm specific arguments
optimize_mode: maximize
trainingService: # Configure the training platform
platform: local
Experiment config reference could be found `here <../reference/experiment_config.rst>`__.
.. _nniignore:
.. Note:: If you are planning to use remote machines or clusters as your :doc:`training service <../TrainingService/Overview>`, to avoid too much pressure on network, NNI limits the number of files to 2000 and total size to 300MB. If your codeDir contains too many files, you can choose which files and subfolders should be excluded by adding a ``.nniignore`` file that works like a ``.gitignore`` file. For more details on how to write this file, see the `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__.
*Example:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>` and :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
All the code above is already prepared and stored in :githublink:`examples/trials/mnist-pytorch/<examples/trials/mnist-pytorch>`.
Step 4: Launch the Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Linux and macOS
***************
Run the **config_detailed.yml** file from your command line to start the experiment.
.. code-block:: bash
nnictl create --config nni/examples/trials/mnist-pytorch/config_detailed.yml
Windows
*******
Change ``python3`` to ``python`` of the ``trialCommand`` field in the **config_detailed.yml** file, and run the **config_detailed.yml** file from your command line to start the experiment.
.. code-block:: bash
nnictl create --config nni\examples\trials\mnist-pytorch\config_detailed.yml
.. Note:: ``nnictl`` is a command line tool that can be used to control experiments, such as start/stop/resume an experiment, start/stop NNIBoard, etc. Click :doc:`here <../reference/nnictl>` for more usage of ``nnictl``.
Wait for the message ``INFO: Successfully started experiment!`` in the command line. This message indicates that your experiment has been successfully started. And this is what we expect to get:
.. code-block:: text
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: [Your IP]:8080
-----------------------------------------------------------------------
You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
commands description
1. nnictl experiment show show the information of experiments
2. nnictl trial ls list all of trial jobs
3. nnictl top monitor the status of running experiments
4. nnictl log stderr show stderr log content
5. nnictl log stdout show stdout log content
6. nnictl stop stop an experiment
7. nnictl trial kill kill a trial job by id
8. nnictl --help get help information about nnictl
-----------------------------------------------------------------------
If you prepared ``trial``\ , ``search space``\ , and ``config`` according to the above steps and successfully created an NNI job, NNI will automatically tune the optimal hyper-parameters and run different hyper-parameter sets for each trial according to the defined search space. You can see its progress through the WebUI clearly.
Step 5: View the Experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^
After starting the experiment successfully, you can find a message in the command-line interface that tells you the ``Web UI url`` like this:
.. code-block:: text
The Web UI urls are: [Your IP]:8080
Open the ``Web UI url`` (Here it's: ``[Your IP]:8080``\ ) in your browser, you can view detailed information about the experiment and all the submitted trial jobs as shown below. If you cannot open the WebUI link in your terminal, please refer to the `FAQ <FAQ.rst#could-not-open-webui-link>`__.
View Overview Page
******************
Information about this experiment will be shown in the WebUI, including the experiment profile and search space message. NNI also supports downloading this information and the parameters through the **Experiment summary** button.
.. image:: ../../img/webui-img/full-oview.png
:target: ../../img/webui-img/full-oview.png
:alt: overview
View Trials Detail Page
***********************
You could see the best trial metrics and hyper-parameter graph in this page. And the table content includes more columns when you click the button ``Add/Remove columns``.
.. image:: ../../img/webui-img/full-detail.png
:target: ../../img/webui-img/full-detail.png
:alt: detail
View Experiments Management Page
********************************
On the ``All experiments`` page, you can see all the experiments on your machine.
.. image:: ../../img/webui-img/managerExperimentList/expList.png
:target: ../../img/webui-img/managerExperimentList/expList.png
:alt: Experiments list
For more detailed usage of WebUI, please refer to `this doc <./WebUI.rst>`__.
Related Topic
-------------
* `How to debug? <HowToDebug.rst>`__
* `How to write a trial? <../TrialExample/Trials.rst>`__
* `How to try different Tuners? <../Tuner/BuiltinTuner.rst>`__
* `How to try different Assessors? <../Assessor/BuiltinAssessor.rst>`__
* `How to run an experiment on the different training platforms? <../training_services.rst>`__
* `How to use Annotation? <AnnotationSpec.rst>`__
* `How to use the command line tool nnictl? <Nnictl.rst>`__
* `How to launch Tensorboard on WebUI? <Tensorboard.rst>`__
.. 90b7c298df11d68ba419a1feaf453cfc
快速入门
==========
安装
----
目前NNI支持了 LinuxmacOS Windows系统。 其中,Ubuntu 16.04 及更高版本、macOS 10.14.1 Windows 10.1809 均经过测试并支持。 ``python >= 3.6`` 环境中,只需运行 ``pip install`` 即可完成安装。
Linux macOS
^^^^^^^^^^^^^^
.. code-block:: bash
python3 -m pip install --upgrade nni
Windows
^^^^^^^
.. code-block:: bash
python -m pip install --upgrade nni
.. Note:: Linux macOS 上,如果要将 NNI 安装到当前用户的 home 目录中,可使用 ``--user`` ;这不需要特殊权限。
.. Note:: 如果出现 ``Segmentation fault`` 这样的错误,参考 :doc:`常见问题 <FAQ>`
.. Note:: NNI 的系统需求,参考 :doc:`Linux & Mac <InstallationLinux>` 或者 :doc:`Windows <InstallationWin>` 的安装教程。如果想要使用 docker, 参考 :doc:`如何使用 Docker <HowToUseDocker>`
MNIST 上的 "Hello World"
------------------------
NNI 是一个能进行自动机器学习实验的工具包。 它可以自动进行获取超参、运行 Trial,测试结果,调优超参的循环。 在这里,将演示如何使用 NNI 帮助找到 MNIST 模型的最佳超参数。
这是还 **没有 NNI** 的示例代码,用 CNN MNIST 数据集上训练:
.. code-block:: python
def main(args):
# 下载数据
train_loader = torch.utils.data.DataLoader(datasets.MNIST(...), batch_size=args['batch_size'], shuffle=True)
test_loader = torch.tuils.data.DataLoader(datasets.MNIST(...), batch_size=1000, shuffle=True)
# 构建模型
model = Net(hidden_size=args['hidden_size'])
optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])
# 训练
for epoch in range(10):
train(args, model, device, train_loader, optimizer, epoch)
test_acc = test(args, model, device, test_loader)
print(test_acc)
print('final accuracy:', test_acc)
if __name__ == '__main__':
params = {
'batch_size': 32,
'hidden_size': 128,
'lr': 0.001,
'momentum': 0.5
}
main(params)
上面的代码一次只能尝试一组参数,如果想要调优学习率,需要手工改动超参,并一次次尝试。
NNI 用来帮助超参调优。它的流程如下:
.. code-block:: text
输入: 搜索空间, Trial 代码, 配置文件
输出: 一组最优的参数配置
1: For t = 0, 1, 2, ..., maxTrialNum,
2: hyperparameter = 从搜索空间选择一组参数
3: final result = run_trial_and_evaluate(hyperparameter)
4: 返回最终结果给 NNI
5: If 时间达到上限,
6: 停止实验
7: 返回最好的实验结果
.. note::
如果需要使用 NNI 来自动训练模型,找到最佳超参,有两种实现方式:
1. 编写配置文件,然后使用命令行启动 experiment
2. 直接从 Python 文件中配置并启动 experiment
在本节中,我们将重点介绍第一种实现方式。如果希望使用第二种实现方式,请参考 `教程 <HowToLaunchFromPython.rst>`__\
第一步:修改 ``Trial`` 代码
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
修改 ``Trial`` 代码来从 NNI 获取超参,并向 NNI 报告训练结果。
.. code-block:: diff
+ import nni
def main(args):
# 下载数据
train_loader = torch.utils.data.DataLoader(datasets.MNIST(...), batch_size=args['batch_size'], shuffle=True)
test_loader = torch.tuils.data.DataLoader(datasets.MNIST(...), batch_size=1000, shuffle=True)
# 构造模型
model = Net(hidden_size=args['hidden_size'])
optimizer = optim.SGD(model.parameters(), lr=args['lr'], momentum=args['momentum'])
# 训练
for epoch in range(10):
train(args, model, device, train_loader, optimizer, epoch)
test_acc = test(args, model, device, test_loader)
- print(test_acc)
+ nni.report_intermeidate_result(test_acc)
- print('final accuracy:', test_acc)
+ nni.report_final_result(test_acc)
if __name__ == '__main__':
- params = {'batch_size': 32, 'hidden_size': 128, 'lr': 0.001, 'momentum': 0.5}
+ params = nni.get_next_parameter()
main(params)
*示例:* :githublink:`mnist.py <examples/trials/mnist-pytorch/mnist.py>`
第二步:定义搜索空间
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
编写 YAML 格式的 **搜索空间** 文件,包括所有需要搜索的超参的 **名称** **分布** (离散和连续值均可)。
.. code-block:: yaml
searchSpace:
batch_size:
_type: choice
_value: [16, 32, 64, 128]
hidden_size:
_type: choice
_value: [128, 256, 512, 1024]
lr:
_type: choice
_value: [0.0001, 0.001, 0.01, 0.1]
momentum:
_type: uniform
_value: [0, 1]
*示例:* :githublink:`config_detailed.yml <examples/trials/mnist-pytorch/config_detailed.yml>`
也可以使用 JSON 文件来编写搜索空间,并在配置中确认文件路径。关于如何编写搜索空间,可以参考 `教程 <SearchSpaceSpec.rst>`__.
第三步:配置 experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
除了在第二步中定义的搜索空间,还需要定义 YAML 格式的 **配置** 文件,声明 experiment 的关键信息,例如 Trail 文件,调优算法,最大 Trial 运行次数和最大持续时间等。
.. code-block:: yaml
experimentName: MNIST # 用于区分 experiment 的名字,可选项
trialCommand: python3 mnist.py # 注意:如果使用 Windows,请将 "python3" 修改为 "python"
trialConcurrency: 2 # 同时运行 2 trial
maxTrialNumber: 10 # 最多生成 10 trial
maxExperimentDuration: 1h # 1 小时后停止生成 trial
tuner: # 配置调优算法
name: TPE
classArgs: # 算法特定参数
optimize_mode: maximize
trainingService: # 配置训练平台
platform: local
Experiment 的配置文件可以参考 `文档 <../reference/experiment_config.rst>`__.
.. _nniignore:
.. Note:: 如果要使用远程服务器或集群作为 :doc:`训练平台 <../TrainingService/Overview>`,为了避免产生过大的网络压力,NNI 限制了文件的最大数量为 2000,大小为 300 MB 如果代码目录中包含了过多的文件,可添加 ``.nniignore`` 文件来排除部分,与 ``.gitignore`` 文件用法类似。 参考 `git documentation <https://git-scm.com/docs/gitignore#_pattern_format>`__ ,了解更多如何编写此文件的详细信息。
*示例:* :githublink:`config.yml <examples/trials/mnist-pytorch/config.yml>` :githublink:`.nniignore <examples/trials/mnist-pytorch/.nniignore>`
上面的代码都已准备好,并保存在 :githublink:`examples/trials/mnist-pytorch/ <examples/trials/mnist-pytorch>`
第四步:运行 experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Linux macOS
**************
从命令行使用 **config.yml** 文件启动 MNIST experiment
.. code-block:: bash
nnictl create --config nni/examples/trials/mnist-pytorch/config_detailed.yml
Windows
*******
**config_detailed.yml** 文件的 ``trialCommand`` 项中将 ``python3`` 修改为 ``python``,然后从命令行使用 **config_detailed.yml** 文件启动 MNIST experiment
.. code-block:: bash
nnictl create --config nni\examples\trials\mnist-pytorch\config_detailed.yml
.. Note:: ``nnictl`` 是一个命令行工具,用来控制 NNI experiment,如启动、停止、继续 experiment,启动、停止 NNIBoard 等等。 点击 :doc:`这里 <../reference/nnictl>` 查看 ``nnictl`` 的更多用法。
在命令行中等待输出 ``INFO: Successfully started experiment!`` 。 此消息表明实验已成功启动。 期望的输出如下:
.. code-block:: text
INFO: Starting restful server...
INFO: Successfully started Restful server!
INFO: Setting local config...
INFO: Successfully set local config!
INFO: Starting experiment...
INFO: Successfully started experiment!
-----------------------------------------------------------------------
The experiment id is egchD4qy
The Web UI urls are: [Your IP]:8080
-----------------------------------------------------------------------
You can use these commands to get more information about the experiment
-----------------------------------------------------------------------
commands description
1. nnictl experiment show show the information of experiments
2. nnictl trial ls list all of trial jobs
3. nnictl top monitor the status of running experiments
4. nnictl log stderr show stderr log content
5. nnictl log stdout show stdout log content
6. nnictl stop stop an experiment
7. nnictl trial kill kill a trial job by id
8. nnictl --help get help information about nnictl
-----------------------------------------------------------------------
如果根据上述步骤准备好了相应 ``Trial`` **搜索空间** **配置** ,并成功创建的 NNI 任务。NNI 会自动开始通过配置的搜索空间来运行不同的超参集合,搜索最好的超参。 通过 Web 界面可看到 NNI 的进度。
第五步:查看 experiment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
启动 experiment 后,可以在命令行界面找到如下的 **Web 界面地址**
.. code-block:: text
The Web UI urls are: [Your IP]:8080
在浏览器中打开 **Web 界面地址** (即: ``[IP 地址]:8080`` ),就可以看到 experiment 的详细信息,以及所有的 Trial 任务。 如果无法打开终端中的 Web 界面链接,可以参考 `常见问题 <FAQ.rst>`__
查看概要页面
******************
Experiment 相关信息会显示在界面上,包括配置和搜索空间等。 NNI 还支持通过 **Experiment summary** 按钮下载这些信息和参数。
.. image:: ../../img/webui-img/full-oview.png
:target: ../../img/webui-img/full-oview.png
:alt: overview
查看 Trial 详情页面
**********************************
可以在此页面中看到最佳的 ``Trial`` 指标和超参数图。 您可以点击 ``Add/Remove columns`` 按钮向表格中添加更多列。
.. image:: ../../img/webui-img/full-detail.png
:target: ../../img/webui-img/full-detail.png
:alt: detail
查看 experiment 管理页面
**********************************
``All experiments`` 页面可以查看计算机上的所有实验。
.. image:: ../../img/webui-img/managerExperimentList/expList.png
:target: ../../img/webui-img/managerExperimentList/expList.png
:alt: Experiments list
更多信息可参考 `此文档 <./WebUI.rst>`__
相关主题
-------------
* `进行Debug <HowToDebug.rst>`__
* `如何实现 Trial 代码 <../TrialExample/Trials.rst>`__
* `尝试不同的 Tuner <../Tuner/BuiltinTuner.rst>`__
* `尝试不同的 Assessor <../Assessor/BuiltinAssessor.rst>`__
* `在不同训练平台上运行 experiment <../training_services.rst>`__
* `如何使用 Annotation <AnnotationSpec.rst>`__
* `如何使用命令行工具 nnictl <Nnictl.rst>`__
* ` Web 界面中启动 TensorBoard <Tensorboard.rst>`__
Python API Reference of Auto Tune
=================================
.. contents::
Trial
-----
.. autofunction:: nni.get_next_parameter
.. autofunction:: nni.get_current_parameter
.. autofunction:: nni.report_intermediate_result
.. autofunction:: nni.report_final_result
.. autofunction:: nni.get_experiment_id
.. autofunction:: nni.get_trial_id
.. autofunction:: nni.get_sequence_id
Tuner
-----
.. autoclass:: nni.tuner.Tuner
:members:
.. autoclass:: nni.algorithms.hpo.tpe_tuner.TpeTuner
:members:
.. autoclass:: nni.algorithms.hpo.random_tuner.RandomTuner
:members:
.. autoclass:: nni.algorithms.hpo.hyperopt_tuner.HyperoptTuner
:members:
.. autoclass:: nni.algorithms.hpo.evolution_tuner.EvolutionTuner
:members:
.. autoclass:: nni.algorithms.hpo.smac_tuner.SMACTuner
:members:
.. autoclass:: nni.algorithms.hpo.gridsearch_tuner.GridSearchTuner
:members:
.. autoclass:: nni.algorithms.hpo.networkmorphism_tuner.NetworkMorphismTuner
:members:
.. autoclass:: nni.algorithms.hpo.metis_tuner.MetisTuner
:members:
.. autoclass:: nni.algorithms.hpo.ppo_tuner.PPOTuner
:members:
.. autoclass:: nni.algorithms.hpo.batch_tuner.BatchTuner
:members:
.. autoclass:: nni.algorithms.hpo.gp_tuner.GPTuner
:members:
Assessor
--------
.. autoclass:: nni.assessor.Assessor
:members:
.. autoclass:: nni.assessor.AssessResult
:members:
.. autoclass:: nni.algorithms.hpo.curvefitting_assessor.CurvefittingAssessor
:members:
.. autoclass:: nni.algorithms.hpo.medianstop_assessor.MedianstopAssessor
:members:
Advisor
-------
.. autoclass:: nni.runtime.msg_dispatcher_base.MsgDispatcherBase
:members:
.. autoclass:: nni.algorithms.hpo.hyperband_advisor.Hyperband
:members:
.. autoclass:: nni.algorithms.hpo.bohb_advisor.BOHB
:members:
Utilities
---------
.. autofunction:: nni.utils.merge_parameter
.. autofunction:: nni.trace
.. autofunction:: nni.dump
.. autofunction:: nni.load
Builtin-Assessors
=================
In order to save on computing resources, NNI supports an early stopping policy and has an interface called **Assessor** to do this job.
Assessor receives the intermediate result from a trial and decides whether the trial should be killed using a specific algorithm. Once the trial experiment meets the early stopping conditions (which means Assessor is pessimistic about the final results), the assessor will kill the trial and the status of the trial will be `EARLY_STOPPED`.
Here is an experimental result of MNIST after using the 'Curvefitting' Assessor in 'maximize' mode. You can see that Assessor successfully **early stopped** many trials with bad hyperparameters in advance. If you use Assessor, you may get better hyperparameters using the same computing resources.
Implemented code directory: :githublink:`config_assessor.yml <examples/trials/mnist-pytorch/config_assessor.yml>`
.. image:: ../img/Assessor.png
.. toctree::
:maxdepth: 1
Overview<./Assessor/BuiltinAssessor>
Medianstop<./Assessor/MedianstopAssessor>
Curvefitting<./Assessor/CurvefittingAssessor>
.. d5351e951811dcaeeda7f270427187fd
内置 Assessor
=================
为了节省计算资源,NNI 支持提前终止策略,并且通过叫做 **Assessor** 的接口来执行此操作。
Assessor 从 Trial 中接收中间结果,并通过指定的算法决定此 Trial 是否应该终止。 一旦 Trial 满足了提前终止策略(这表示 Assessor 认为最终结果不会太好),Assessor 会终止此 Trial,并将其状态标志为 `EARLY_STOPPED`。
这是 MNIST 在 "最大化" 模式下使用 "曲线拟合" Assessor 的实验结果。 可以看到 Assessor 成功的 **提前终止** 了许多结果不好超参组合的 Trial。 使用 Assessor,能在相同的计算资源下,得到更好的结果。
实验代码: :githublink:`config_assessor.yml <examples/trials/mnist-pytorch/config_assessor.yml>`
.. image:: ../img/Assessor.png
.. toctree::
:maxdepth: 1
概述<./Assessor/BuiltinAssessor>
Medianstop<./Assessor/MedianstopAssessor>
Curvefitting(曲线拟合)<./Assessor/CurvefittingAssessor>
Builtin-Tuners
==============
NNI provides an easy way to adopt an approach to set up parameter tuning algorithms, we call them **Tuner**.
Tuner receives metrics from `Trial` to evaluate the performance of a specific parameters/architecture configuration. Tuner sends the next hyper-parameter or architecture configuration to Trial.
The following table briefly describes the built-in tuners provided by NNI. Click the **Tuner's name** to get the Tuner's installation requirements, suggested scenario, and an example configuration. A link for a detailed description of each algorithm is located at the end of the suggested scenario for each tuner. Here is an `article <../CommunitySharings/HpoComparison.rst>`__ comparing different Tuners on several problems.
.. list-table::
:header-rows: 1
:widths: auto
* - Tuner
- Brief Introduction of Algorithm
* - `TPE <./TpeTuner.rst>`__
- The Tree-structured Parzen Estimator (TPE) is a sequential model-based optimization (SMBO) approach. SMBO methods sequentially construct models to approximate the performance of hyperparameters based on historical measurements, and then subsequently choose new hyperparameters to test based on this model. `Reference Paper <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
* - `Random Search <./RandomTuner.rst>`__
- In Random Search for Hyper-Parameter Optimization show that Random Search might be surprisingly simple and effective. We suggest that we could use Random Search as the baseline when we have no knowledge about the prior distribution of hyper-parameters. `Reference Paper <http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf>`__
* - `Anneal <./AnnealTuner.rst>`__
- This simple annealing algorithm begins by sampling from the prior, but tends over time to sample from points closer and closer to the best ones observed. This algorithm is a simple variation on the random search that leverages smoothness in the response surface. The annealing rate is not adaptive.
* - `Naïve Evolution <./EvolutionTuner.rst>`__
- Naïve Evolution comes from Large-Scale Evolution of Image Classifiers. It randomly initializes a population-based on search space. For each generation, it chooses better ones and does some mutation (e.g., change a hyperparameter, add/remove one layer) on them to get the next generation. Naïve Evolution requires many trials to work, but it's very simple and easy to expand new features. `Reference paper <https://arxiv.org/pdf/1703.01041.pdf>`__
* - `SMAC <./SmacTuner.rst>`__
- SMAC is based on Sequential Model-Based Optimization (SMBO). It adapts the most prominent previously used model class (Gaussian stochastic process models) and introduces the model class of random forests to SMBO, in order to handle categorical parameters. The SMAC supported by NNI is a wrapper on the SMAC3 GitHub repo.
Notice, SMAC needs to be installed by ``pip install nni[SMAC]`` command. `Reference Paper, <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__ `GitHub Repo <https://github.com/automl/SMAC3>`__
* - `Batch tuner <./BatchTuner.rst>`__
- Batch tuner allows users to simply provide several configurations (i.e., choices of hyper-parameters) for their trial code. After finishing all the configurations, the experiment is done. Batch tuner only supports the type choice in search space spec.
* - `Grid Search <./GridsearchTuner.rst>`__
- Grid Search performs an exhaustive searching through the search space.
* - `Hyperband <./HyperbandAdvisor.rst>`__
- Hyperband tries to use limited resources to explore as many configurations as possible and returns the most promising ones as a final result. The basic idea is to generate many configurations and run them for a small number of trials. The half least-promising configurations are thrown out, the remaining are further trained along with a selection of new configurations. The size of these populations is sensitive to resource constraints (e.g. allotted search time). `Reference Paper <https://arxiv.org/pdf/1603.06560.pdf>`__
* - `Metis Tuner <./MetisTuner.rst>`__
- Metis offers the following benefits when it comes to tuning parameters: While most tools only predict the optimal configuration, Metis gives you two outputs: (a) current prediction of optimal configuration, and (b) suggestion for the next trial. No more guesswork. While most tools assume training datasets do not have noisy data, Metis actually tells you if you need to re-sample a particular hyper-parameter. `Reference Paper <https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/>`__
* - `BOHB <./BohbAdvisor.rst>`__
- BOHB is a follow-up work to Hyperband. It targets the weakness of Hyperband that new configurations are generated randomly without leveraging finished trials. For the name BOHB, HB means Hyperband, BO means Bayesian Optimization. BOHB leverages finished trials by building multiple TPE models, a proportion of new configurations are generated through these models. `Reference Paper <https://arxiv.org/abs/1807.01774>`__
* - `GP Tuner <./GPTuner.rst>`__
- Gaussian Process Tuner is a sequential model-based optimization (SMBO) approach with Gaussian Process as the surrogate. `Reference Paper <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__, `Github Repo <https://github.com/fmfn/BayesianOptimization>`__
* - `PBT Tuner <./PBTTuner.rst>`__
- PBT Tuner is a simple asynchronous optimization algorithm which effectively utilizes a fixed computational budget to jointly optimize a population of models and their hyperparameters to maximize performance. `Reference Paper <https://arxiv.org/abs/1711.09846v1>`__
* - `DNGO Tuner <./DngoTuner.rst>`__
- Use of neural networks as an alternative to GPs to model distributions over functions in bayesian optimization.
.. toctree::
:maxdepth: 1
TPE <Tuner/TpeTuner>
Random Search <Tuner/RandomTuner>
Anneal <Tuner/AnnealTuner>
Naive Evolution <Tuner/EvolutionTuner>
SMAC <Tuner/SmacTuner>
Metis Tuner <Tuner/MetisTuner>
Batch Tuner <Tuner/BatchTuner>
Grid Search <Tuner/GridsearchTuner>
GP Tuner <Tuner/GPTuner>
Network Morphism <Tuner/NetworkmorphismTuner>
Hyperband <Tuner/HyperbandAdvisor>
BOHB <Tuner/BohbAdvisor>
PBT Tuner <Tuner/PBTTuner>
DNGO Tuner <Tuner/DngoTuner>
.. 10b9097fcfec13f98bb6914b40bd0337
内置 Tuner
==========
为了让机器学习和深度学习模型适应不同的任务和问题,我们需要进行超参数调优,而自动化调优依赖于优秀的调优算法。NNI 内置了先进的调优算法,并且提供了易于使用的 API。
在 NNI 中,调优算法被称为“tuner”。Tuner 向 trial 发送超参数,接收运行结果从而评估这组超参的性能,然后将下一组超参发送给新的 trial。
下表简要介绍了 NNI 内置的调优算法。点击 tuner 的名称可以查看其安装需求、推荐使用场景、示例配置文件等详细信息。`这篇文章 <../CommunitySharings/HpoComparison.rst>`__ 对比了各个 tuner 在不同场景下的性能。
.. list-table::
:header-rows: 1
:widths: auto
* - Tuner
- 算法简介
* - `TPE <./TpeTuner.rst>`__
- Tree-structured Parzen Estimator (TPE) 是一种基于序列模型的优化方法 (sequential model-based optimization, SMBO)。SMBO方法根据历史数据来顺序地构造模型,从而预估超参性能,并基于此模型来选择新的超参。`参考论文 <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
* - `Random Search (随机搜索) <./RandomTuner.rst>`__
- 随机搜索在超算优化中表现出了令人意外的性能。如果没有对超参分布的先验知识,我们推荐使用随机搜索作为基线方法。`参考论文 <http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf>`__
* - `Anneal (退火) <./AnnealTuner.rst>`__
- 朴素退火算法首先基于先验进行采样,然后逐渐逼近实际性能较好的采样点。该算法是随即搜索的变体,利用了反应曲面的平滑性。该实现中退火率不是自适应的。
* - `Naive Evolution(朴素进化) <./EvolutionTuner.rst>`__
- 朴素进化算法来自于 Large-Scale Evolution of Image Classifiers。它基于搜索空间随机生成一个种群,在每一代中选择较好的结果,并对其下一代进行变异。朴素进化算法需要很多 Trial 才能取得最优效果,但它也非常简单,易于扩展。`参考论文 <https://arxiv.org/pdf/1703.01041.pdf>`__
* - `SMAC <./SmacTuner.rst>`__
- SMAC 是基于序列模型的优化方法 (SMBO)。它利用使用过的最突出的模型(高斯随机过程模型),并将随机森林引入到SMBO中,来处理分类参数。NNI 的 SMAC tuner 封装了 GitHub 上的 `SMAC3 <https://github.com/automl/SMAC3>`__。`参考论文 <https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf>`__
注意:SMAC 算法需要使用 ``pip install nni[SMAC]`` 安装依赖,暂不支持 Windows 操作系统。
* - `Batch(批处理) <./BatchTuner.rst>`__
- 批处理允许用户直接提供若干组配置,为每种配置运行一个 trial。
* - `Grid Search(网格遍历) <./GridsearchTuner.rst>`__
- 网格遍历会穷举搜索空间中的所有超参组合。
* - `Hyperband <./HyperbandAdvisor.rst>`__
- Hyperband 试图用有限的资源探索尽可能多的超参组合。该算法的思路是,首先生成大量超参配置,将每组超参运行较短的一段时间,随后抛弃其中效果较差的一半,让较好的超参继续运行,如此重复多轮。`参考论文 <https://arxiv.org/pdf/1603.06560.pdf>`__
* - `Metis <./MetisTuner.rst>`__
- 大多数调参工具仅仅预测最优配置,而 Metis 的优势在于它有两个输出:(a) 最优配置的当前预测结果, 以及 (b) 下一次 trial 的建议。大多数工具假设训练集没有噪声数据,但 Metis 会知道是否需要对某个超参重新采样。`参考论文 <https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/>`__
* - `BOHB <./BohbAdvisor.rst>`__
- BOHB 是 Hyperband 算法的后续工作。 Hyperband 在生成新的配置时,没有利用已有的 trial 结果,而本算法利用了 trial 结果。BOHB 中,HB 表示 Hyperband,BO 表示贝叶斯优化(Byesian Optimization)。 BOHB 会建立多个 TPE 模型,从而利用已完成的 Trial 生成新的配置。`参考论文 <https://arxiv.org/abs/1807.01774>`__
* - `GP (高斯过程) <./GPTuner.rst>`__
- GP Tuner 是基于序列模型的优化方法 (SMBO),使用高斯过程进行 surrogate。`参考论文 <https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf>`__
* - `PBT <./PBTTuner.rst>`__
- PBT Tuner 是一种简单的异步优化算法,在固定的计算资源下,它能有效的联合优化一组模型及其超参来最优化性能。`参考论文 <https://arxiv.org/abs/1711.09846v1>`__
* - `DNGO <./DngoTuner.rst>`__
- DNGO 是基于序列模型的优化方法 (SMBO),该算法使用神经网络(而不是高斯过程)去建模贝叶斯优化中所需要的函数分布。
.. toctree::
:maxdepth: 1
TPE <Tuner/TpeTuner>
Random Search(随机搜索) <Tuner/RandomTuner>
Anneal(退火) <Tuner/AnnealTuner>
Naïve Evolution(朴素进化) <Tuner/EvolutionTuner>
SMAC <Tuner/SmacTuner>
Metis Tuner <Tuner/MetisTuner>
Batch Tuner(批处理) <Tuner/BatchTuner>
Grid Search(网格遍历) <Tuner/GridsearchTuner>
GP Tuner <Tuner/GPTuner>
Network Morphism <Tuner/NetworkmorphismTuner>
Hyperband <Tuner/HyperbandAdvisor>
BOHB <Tuner/BohbAdvisor>
PBT Tuner <Tuner/PBTTuner>
DNGO Tuner <Tuner/DngoTuner>
Advanced Usage
==============
.. toctree::
:maxdepth: 2
Customize Basic Pruner <../tutorials/pruning_customize>
Customize Quantizer <../tutorials/quantization_customize>
Customize Scheduled Pruning Process <pruning_scheduler>
Utilities <compression_utils>
Pruning Config Specification
============================
The Keys in Config List
-----------------------
Compression Config Specification
================================
Each sub-config in the config list is a dict, and the scope of each setting (key) is only internal to each sub-config.
If multiple sub-configs are configured for the same layer, the later ones will overwrite the previous ones.
Shared Keys in both Pruning & Quantization Config
-------------------------------------------------
op_types
^^^^^^^^
......@@ -20,9 +20,20 @@ op_names
The name of the layers targeted by this sub-config.
If ``op_types`` is set in this sub-config, the selected layer should satisfy both type and name.
exclude
^^^^^^^
The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
If ``exclude`` is set in sub-config, the layers selected by this config will not be compressed.
The Keys in Pruning Config
--------------------------
op_partial_names
^^^^^^^^^^^^^^^^
This key will share with `Quantization Config` in the future.
This key is for the layers to be pruned with names that have the same sub-string. NNI will find all names in the model,
find names that contain one of ``op_partial_names``, and append them into the ``op_names``.
......@@ -60,8 +71,25 @@ In ``total_sparsity`` example, there are 1200 parameters that need to be masked
To avoid this situation, ``max_sparsity_per_layer`` can be set as 0.9, this means up to 450 parameters can be masked in ``layer_1``,
and 900 parameters can be masked in ``layer_2``.
exclude
^^^^^^^
The Keys in Quantization Config List
------------------------------------
The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
If ``exclude`` is set in sub-config, the layers selected by this config will not be pruned.
quant_types
^^^^^^^^^^^
Currently, nni support three kind of quantization types: 'weight', 'input', 'output'.
It can be set as ``str`` or ``List[str]``.
Note that 'weight' and 'input' are always quantize together, e.g., ``['input', 'weight']``.
quant_bits
^^^^^^^^^^
Bits length of quantization, key is the quantization type set in ``quant_types``, value is the length,
eg. {'weight': 8}, when the type is int, all quantization types share same bits length.
quant_start_step
^^^^^^^^^^^^^^^^
Specific key for ``QAT Quantizer``. Disable quantization until model are run by certain number of steps,
this allows the network to enter a more stable.
State where output quantization ranges do not exclude a significant fraction of values, default value is 0.
Analysis Utils for Model Compression
====================================
.. contents::
We provide several easy-to-use tools for users to analyze their model during model compression.
Sensitivity Analysis
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment