Unverified Commit 5bf2cb19 authored by Bill Wu's avatar Bill Wu Committed by GitHub
Browse files

HPO Benchmark Fixes and New Features (#3925)

parent 7fc5af07
HPO Benchmarks
==============
.. toctree::
:hidden:
HPO Benchmark Example Statistics <hpo_benchmark_stats>
We provide a benchmarking tool to compare the performances of tuners provided by NNI (and users' custom tuners) on different
types of tasks. This tool uses the `automlbenchmark repository <https://github.com/openml/automlbenchmark)>`_ to run different *benchmarks* on the NNI *tuners*.
The tool is located in ``examples/trials/benchmarking/automlbenchmark``. This document provides a brief introduction to the tool, its usage, and currently available benchmarks.
Overview and Terminologies
^^^^^^^^^^^^^^^^^^^^^^^^^^
Ideally, an **HPO Benchmark** provides a tuner with a search space, calls the tuner repeatedly, and evaluates how the tuner probes
the search space and approaches to good solutions. In addition, inside the benchmark, an evaluator should be associated to
each search space for evaluating the score of points in this search space to give feedbacks to the tuner. For instance,
the search space could be the space of hyperparameters for a neural network. Then the evaluator should contain train data,
test data, and a criterion. To evaluate a point in the search space, the evaluator will train the network on the train data
and report the score of the model on the test data as the score for the point.
However, a **benchmark** provided by the automlbenchmark repository only provides part of the functionality of the evaluator.
More concretely, it assumes that it is evaluating a **framework**. Different from a tuner, given train data, a **framework**
can directly solve a **task** and predict on the test set. The **benchmark** from the automlbenchmark repository directly provides
train and test datasets to a **framework**, evaluates the prediction on the test set, and reports this score as the final score.
Therefore, to implement **HPO Benchmark** using automlbenchmark, we pair up a tuner with a search space to form a **framework**,
and handle the repeated trial-evaluate-feedback loop in the **framework** abstraction. In other words, each **HPO Benchmark**
contains two main components: a **benchmark** from the automlbenchmark library, and an **architecture** which defines the search
space and the evaluator. To further clarify, we provide the definition for the terminologies used in this document.
* **tuner**\ : a `tuner or advisor provided by NNI <https://nni.readthedocs.io/en/stable/builtin_tuner.html>`_, or a custom tuner provided by the user.
* **task**\ : an abstraction used by automlbenchmark. A task can be thought of as a tuple (dataset, metric). It provides train and test datasets to the frameworks. Then, based on the returns predictions on the test set, the task evaluates the metric (e.g., mse for regression, f1 for classification) and reports the score.
* **benchmark**\ : an abstraction used by automlbenchmark. A benchmark is a set of tasks, along with other external constraints such as time limits.
* **framework**\ : an abstraction used by automlbenchmark. Given a task, a framework solves the proposed regression or classification problem using train data and produces predictions on the test set. In our implementation, each framework is an architecture, which defines a search space. To evaluate a task given by the benchmark on a specific tuner, we let the tuner continuously tune the hyperparameters (by giving it cross-validation score on the train data as feedback) until the time or trial limit is reached. Then, the architecture is retrained on the entire train set using the best set of hyperparameters.
* **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). See ``./nni/extensions/NNI/architectures`` for examples.
Supported HPO Benchmarks
^^^^^^^^^^^^^^^^^^^^^^^^
From the previous discussion, we can see that to define an **HPO Benchmark**, we need to specify a **benchmark** and an **architecture**.
Currently, the only architectures we support are random forest and MLP. We use the
`scikit-learn implementation <https://scikit-learn.org/stable/modules/classes.html#>`_. Typically, there are a number of
hyperparameters that may directly affect the performances of random forest and MLP models. We design the search
spaces to be the following.
Search Space for Random Forest:
.. code-block:: json
{
"n_estimators": {"_type":"randint", "_value": [4, 2048]},
"max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},
"min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
"min_samples_split": {"_type":"randint", "_value": [2, 16]},
"max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]}
}
Search Space for MLP:
.. code-block:: json
{
"hidden_layer_sizes": {"_type":"choice", "_value": [[16], [64], [128], [256], [16, 16], [64, 64], [128, 128], [256, 256], [16, 16, 16], [64, 64, 64], [128, 128, 128], [256, 256, 256], [256, 128, 64, 16], [128, 64, 16], [64, 16], [16, 64, 128, 256], [16, 64, 128], [16, 64]]},
"learning_rate_init": {"_type":"choice", "_value": [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001]},
"alpha": {"_type":"choice", "_value": [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001]},
"momentum": {"_type":"uniform","_value":[0, 1]},
"beta_1": {"_type":"uniform","_value":[0, 1]},
"tol": {"_type":"choice", "_value": [0.001, 0.0005, 0.0001, 0.00005, 0.00001]},
"max_iter": {"_type":"randint", "_value": [2, 256]}
}
In addition, we write the search space in different ways (e.g., using "choice" or "randint" or "loguniform").
The architecture implementation and search space definition can be found in ``./nni/extensions/NNI/architectures/``.
You may replace the search space definition in this file to experiment different search spaces.
For the automlbenchmarks, in addition to the built-in benchmarks provided by automl
(defined in ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``), we design several
additional benchmarks, defined in ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks``.
One example of larger benchmarks is "nnismall", which consists of 8 regression tasks, 8 binary classification tasks, and
8 multi-class classification tasks. We also provide three separate 8-task benchmarks "nnismall-regression", "nnismall-binary", and "nnismall-multiclass"
corresponding to the three types of tasks in nnismall. These tasks are suitable to solve with random forest and MLP.
The following table summarizes the benchmarks we provide. For ``nnismall``, please check ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``
for a more detailed description for each task. Also, since all tasks are from the OpenML platform, you can find the descriptions
of all datasets at `this webpage <https://www.openml.org/search?type=data>`_.
Benchmark for Tuners .. list-table::
==================== :header-rows: 1
:widths: 1 2 2 2
We provide a benchmarking tool to compare the performances of tuners provided by NNI (and users' custom tuners) on different tasks. The implementation of this tool is based on the automlbenchmark repository (https://github.com/openml/automlbenchmark), which provides services of running different *frameworks* against different *benchmarks* consisting of multiple *tasks*. The tool is located in ``examples/trials/benchmarking/automlbenchmark``. This document provides a brief introduction to the tool and its usage.
* - Benchmark name
Terminology - Description
^^^^^^^^^^^ - Task List
- Location
* - nnivalid
* **task**\ : a task can be thought of as (dataset, evaluator). It gives out a dataset containing (train, valid, test), and based on the received predictions, the evaluator evaluates a given metric (e.g., mse for regression, f1 for classification). - A three-task benchmark to validate benchmark installation.
* **benchmark**\ : a benchmark is a set of tasks, along with other external constraints such as time and resource. - ``kc2, iris, cholesterol``
* **framework**\ : given a task, a framework conceives answers to the proposed regression or classification problem and produces predictions. Note that the automlbenchmark framework does not pose any restrictions on the hypothesis space of a framework. In our implementation in this folder, each framework is a tuple (tuner, architecture), where architecture provides the hypothesis space (and search space for tuner), and tuner determines the strategy of hyperparameter optimization. - ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/``
* **tuner**\ : a tuner or advisor defined in the hpo folder, or a custom tuner provided by the user. * - nnismall-regression
* **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). In our implementation, the architecture calls tuner multiple times to obtain possible hyperparameter configurations, and produces the final prediction for a task. See ``./nni/extensions/NNI/architectures`` for examples. - An eight-task benchmark consisting of **regression** tasks only.
- ``cholesterol, liver-disorders, kin8nm, cpu_small, titanic_2, boston, stock, space_ga``
Note: currently, the only architecture supported is random forest. The architecture implementation and search space definition can be found in ``./nni/extensions/NNI/architectures/run_random_forest.py``. The tasks in benchmarks "nnivalid" and "nnismall" are suitable to solve with random forests. - ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/``
* - nnismall-binary
- An eight-task benchmark consisting of **binary classification** tasks only.
- ``Australian, blood-transfusion, christine, credit-g, kc1, kr-vs-kp, phoneme, sylvine``
- ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/``
* - nnismall-multiclass
- An eight-task benchmark consisting of **multi-class classification** tasks only.
- ``car, cnae-9, dilbert, fabert, jasmine, mfeat-factors, segment, vehicle``
- ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/``
* - nnismall
- A 24-task benchmark that is the superset of nnismall-regression, nnismall-binary, and nnismall-multiclass.
- ``cholesterol, liver-disorders, kin8nm, cpu_small, titanic_2, boston, stock, space_ga, Australian, blood-transfusion, christine, credit-g, kc1, kr-vs-kp, phoneme, sylvine, car, cnae-9, dilbert, fabert, jasmine, mfeat-factors, segment, vehicle``
- ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/``
Setup Setup
^^^^^ ^^^^^
...@@ -32,26 +131,47 @@ Run predefined benchmarks on existing tuners ...@@ -32,26 +131,47 @@ Run predefined benchmarks on existing tuners
./runbenchmark_nni.sh [tuner-names] ./runbenchmark_nni.sh [tuner-names]
This script runs the benchmark 'nnivalid', which consists of a regression task, a binary classification task, and a multi-class classification task. After the script finishes, you can find a summary of the results in the folder results_[time]/reports/. To run on other predefined benchmarks, change the ``benchmark`` variable in ``runbenchmark_nni.sh``. Some benchmarks are defined in ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks``\ , and others are defined in ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``. One example of larger benchmarks is "nnismall", which consists of 8 regression tasks, 8 binary classification tasks, and 8 multi-class classification tasks. This script runs the benchmark 'nnivalid', which consists of a regression task, a binary classification task, and a
multi-class classification task. After the script finishes, you can find a summary of the results in the folder results_[time]/reports/.
To run on other predefined benchmarks, change the ``benchmark`` variable in ``runbenchmark_nni.sh``. To change to another
search space (by using another architecture), chang the `arch_type` parameter in ``./nni/frameworks.yaml``. Note that currently,
we only support ``random_forest`` or ``mlp`` as the `arch_type`. To experiment on other search spaces with the same
architecture, please change the search space defined in ``./nni/extensions/NNI/architectures/run_[architecture].py``.
By default, the script runs the benchmark on all embedded tuners in NNI. If provided a list of tuners in [tuner-names], it only runs the tuners in the list. Currently, the following tuner names are supported: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner", "Hyperband", "BOHB". It is also possible to evaluate custom tuners. See the next sections for details. The ``./nni/frameworks.yaml`` is the actual configuration file for the HPO Benchmark. The ``limit_type`` parameter specifies
the limits for running the benchmark on one tuner. If ``limit_type`` is set to `ntrials`, then the tuner is called for
`trial_limit` times and then stopped. If ``limit_type`` is set to `time`, then the tuner is continuously called until
timeout for the benchmark is reached. The timeout for the benchmarks can be changed in the each benchmark file located
in ``./nni/benchmarks``.
By default, the script runs the specified tuners against the specified benchmark one by one. To run all the experiments simultaneously in the background, set the "serialize" flag to false in ``runbenchmark_nni.sh``. By default, the script runs the benchmark on all embedded tuners in NNI. If provided a list of tuners in [tuner-names],
it only runs the tuners in the list. Currently, the following tuner names are supported: "TPE", "Random", "Anneal",
"Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner", "Hyperband", "BOHB". It is also possible to run the benchmark
on custom tuners. See the next sections for details.
Note: the SMAC tuner, DNGO tuner, and the BOHB advisor has to be manually installed before any experiments can be run on it. Please refer to `this page <https://nni.readthedocs.io/en/stable/Tuner/BuiltinTuner.html?highlight=nni>`_ for more details on installing SMAC and BOHB. By default, the script runs the specified tuners against the specified benchmark one by one. To run the experiment for
all tuners simultaneously in the background, set the "serialize" flag to false in ``runbenchmark_nni.sh``.
Note: the SMAC tuner, DNGO tuner, and the BOHB advisor has to be manually installed before running benchmarks on them.
Please refer to `this page <https://nni.readthedocs.io/en/stable/Tuner/BuiltinTuner.html?highlight=nni>`_ for more details
on installation.
Run customized benchmarks on existing tuners Run customized benchmarks on existing tuners
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To run customized benchmarks, add a benchmark_name.yaml file in the folder ``./nni/benchmarks``\ , and change the ``benchmark`` variable in ``runbenchmark_nni.sh``. See ``./automlbenchmark/resources/benchmarks/`` for some examples of defining a custom benchmark. You can design your own benchmarks and evaluate the performance of NNI tuners on them. To run customized benchmarks,
add a benchmark_name.yaml file in the folder ``./nni/benchmarks``, and change the ``benchmark`` variable in ``runbenchmark_nni.sh``.
See ``./automlbenchmark/resources/benchmarks/`` for some examples of defining a custom benchmark.
Run benchmarks on custom tuners Run benchmarks on custom tuners
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.Tuner`` and correctly implements the required APIs. For more information on implementing a custom tuner, please refer to `here <https://nni.readthedocs.io/en/stable/Tuner/CustomizeTuner.html>`_. Next, perform the following steps: You may also use the benchmark to compare a custom tuner written by yourself with the NNI built-in tuners. To use custom
tuners, first make sure that the tuner inherits from ``nni.tuner.Tuner`` and correctly implements the required APIs. For
more information on implementing a custom tuner, please refer to `here <https://nni.readthedocs.io/en/stable/Tuner/CustomizeTuner.html>`_.
Next, perform the following steps:
#. Install the custom tuner with command ``nnictl algo register``. Check `this document <https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html>`_ for details. #. Install the custom tuner via the command ``nnictl algo register``. Check `this document <https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html>`_ for details.
#. In ``./nni/frameworks.yaml``\ , add a new framework extending the base framework NNI. Make sure that the parameter ``tuner_type`` corresponds to the "builtinName" of tuner installed in step 1. #. In ``./nni/frameworks.yaml``\ , add a new framework extending the base framework NNI. Make sure that the parameter ``tuner_type`` corresponds to the "builtinName" of tuner installed in step 1.
#. Run the following command #. Run the following command
...@@ -59,182 +179,4 @@ To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.T ...@@ -59,182 +179,4 @@ To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.T
./runbenchmark_nni.sh new-tuner-builtinName ./runbenchmark_nni.sh new-tuner-builtinName
A Benchmark Example The benchmark will automatically find and match the tuner newly added to your NNI installation.
^^^^^^^^^^^^^^^^^^^
As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner". As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used.
After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead.
.. image:: ../img/hpo_benchmark/performances.png
:target: ../img/hpo_benchmark/performances.png
:alt:
In addition, when the results are parsed, the tuners are ranked based on their final performance. ``results[time]/reports/rankings.txt`` presents the average ranking of the tuners for each metric (logloss, rmse, auc). Here we present the data in the first three tables. Also, for every tuner, their performance for each type of metric is summarized (another view of the same data). We present this statistics in the fourth table.
Average rankings for metric rmse:
.. list-table::
:header-rows: 1
* - Tuner Name
- Average Ranking
* - Anneal
- 3.75
* - Random
- 4.00
* - Evolution
- 4.44
* - DNGOTuner
- 4.44
* - SMAC
- 4.56
* - TPE
- 4.94
* - GPTuner
- 4.94
* - MetisTuner
- 4.94
Average rankings for metric auc:
.. list-table::
:header-rows: 1
* - Tuner Name
- Average Ranking
* - SMAC
- 3.67
* - GPTuner
- 4.00
* - Evolution
- 4.22
* - Anneal
- 4.39
* - MetisTuner
- 4.39
* - TPE
- 4.67
* - Random
- 5.33
* - DNGOTuner
- 5.33
Average rankings for metric logloss:
.. list-table::
:header-rows: 1
* - Tuner Name
- Average Ranking
* - Random
- 3.36
* - DNGOTuner
- 3.50
* - SMAC
- 3.93
* - GPTuner
- 4.64
* - TPE
- 4.71
* - Anneal
- 4.93
* - Evolution
- 5.00
* - MetisTuner
- 5.93
Average rankings for tuners:
.. list-table::
:header-rows: 1
* - Tuner Name
- rmse
- auc
- logloss
* - TPE
- 4.94
- 4.67
- 4.71
* - Random
- 4.00
- 5.33
- 3.36
* - Anneal
- 3.75
- 4.39
- 4.93
* - Evolution
- 4.44
- 4.22
- 5.00
* - GPTuner
- 4.94
- 4.00
- 4.64
* - MetisTuner
- 4.94
- 4.39
- 5.93
* - SMAC
- 4.56
- 3.67
- 3.93
* - DNGOTuner
- 4.44
- 5.33
- 3.50
Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
.. image:: ../img/hpo_benchmark/car_fold1_1.jpg
:target: ../img/hpo_benchmark/car_fold1_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/car_fold1_2.jpg
:target: ../img/hpo_benchmark/car_fold1_2.jpg
:alt:
For example, the previous two graphs are generated for fold 1 of the task "car". In the first graph, we can observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its score improved three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment duration. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. Regardless, although this pattern can to some extent be interpreted as a tuner's position on the explore-exploit tradeoff, it cannot be used for a comprehensive evaluation of a tuner's effectiveness.
.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
:target: ../img/hpo_benchmark/christine_fold0_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg
:target: ../img/hpo_benchmark/christine_fold0_2.jpg
:alt:
.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:alt:
.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:alt:
.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:alt:
HPO Benchmark Example Statistics
================================
A Benchmark Example
^^^^^^^^^^^^^^^^^^^
As an example, we ran the "nnismall" benchmark with the random forest search space on the following 8 tuners: "TPE",
"Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner". For convenience of reference, we also list
the search space we experimented on here. Note that the way in which the search space is written may significantly affect
hyperparameter optimization performance, and we plan to conduct further experiments on how well NNI built-in tuners adapt
to different search space formulations using this benchmarking tool.
.. code-block:: json
{
"n_estimators": {"_type":"randint", "_value": [8, 512]},
"max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},
"min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
"min_samples_split": {"_type":"randint", "_value": [2, 16]},
"max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]}
}
As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on
one tuner. For a more detailed description of the tasks, please check
``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class
classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used.
After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``.
Since the file is large, we only show the following screenshot and summarize other important statistics instead.
.. image:: ../img/hpo_benchmark/performances.png
:target: ../img/hpo_benchmark/performances.png
:alt:
When the results are parsed, the tuners are also ranked based on their final performance. The following three tables show
the average ranking of the tuners for each metric (logloss, rmse, auc).
Also, for every tuner, their performance for each type of metric is summarized (another view of the same data).
We present this statistics in the fourth table. Note that this information can be found at ``results[time]/reports/rankings.txt``.
Average rankings for metric rmse (for regression tasks). We found that Anneal performs the best among all NNI built-in tuners.
.. list-table::
:header-rows: 1
* - Tuner Name
- Average Ranking
* - Anneal
- 3.75
* - Random
- 4.00
* - Evolution
- 4.44
* - DNGOTuner
- 4.44
* - SMAC
- 4.56
* - TPE
- 4.94
* - GPTuner
- 4.94
* - MetisTuner
- 4.94
Average rankings for metric auc (for classification tasks). We found that SMAC performs the best among all NNI built-in tuners.
.. list-table::
:header-rows: 1
* - Tuner Name
- Average Ranking
* - SMAC
- 3.67
* - GPTuner
- 4.00
* - Evolution
- 4.22
* - Anneal
- 4.39
* - MetisTuner
- 4.39
* - TPE
- 4.67
* - Random
- 5.33
* - DNGOTuner
- 5.33
Average rankings for metric logloss (for classification tasks). We found that Random performs the best among all NNI built-in tuners.
.. list-table::
:header-rows: 1
* - Tuner Name
- Average Ranking
* - Random
- 3.36
* - DNGOTuner
- 3.50
* - SMAC
- 3.93
* - GPTuner
- 4.64
* - TPE
- 4.71
* - Anneal
- 4.93
* - Evolution
- 5.00
* - MetisTuner
- 5.93
To view the same data in another way, for each tuner, we present the average rankings on different types of metrics. From the table, we can find that, for example, the DNGOTuner performs better for the tasks whose metric is "logloss" than for the tasks with metric "auc". We hope this information can to some extent guide the choice of tuners given some knowledge of task types.
.. list-table::
:header-rows: 1
* - Tuner Name
- rmse
- auc
- logloss
* - TPE
- 4.94
- 4.67
- 4.71
* - Random
- 4.00
- 5.33
- 3.36
* - Anneal
- 3.75
- 4.39
- 4.93
* - Evolution
- 4.44
- 4.22
- 5.00
* - GPTuner
- 4.94
- 4.00
- 4.64
* - MetisTuner
- 4.94
- 4.39
- 5.93
* - SMAC
- 4.56
- 3.67
- 3.93
* - DNGOTuner
- 4.44
- 5.33
- 3.50
Besides these reports, our script also generates two graphs for each fold of each task: one graph presents the best score received by each tuner until trial x, and another graph shows the score that each tuner receives in trial x. These two graphs can give some information regarding how the tuners are "converging" to their final solution. We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
.. image:: ../img/hpo_benchmark/car_fold1_1.jpg
:target: ../img/hpo_benchmark/car_fold1_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/car_fold1_2.jpg
:target: ../img/hpo_benchmark/car_fold1_2.jpg
:alt:
The previous two graphs are generated for fold 1 of the task "car". In the first graph, we observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its best score improved for three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. This may be interpreted as the Anneal tuner explores more aggressively than the GPTuner and thus its scores for different trials vary a lot. Regardless, although this pattern can to some extent hint a tuner's position on the explore-exploit tradeoff, it is not a comprehensive evaluation of a tuner's effectiveness.
.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
:target: ../img/hpo_benchmark/christine_fold0_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg
:target: ../img/hpo_benchmark/christine_fold0_2.jpg
:alt:
.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:alt:
.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:alt:
.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:alt:
.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:alt:
...@@ -25,4 +25,4 @@ according to their needs. ...@@ -25,4 +25,4 @@ according to their needs.
WebUI <Tutorial/WebUI> WebUI <Tutorial/WebUI>
How to Debug <Tutorial/HowToDebug> How to Debug <Tutorial/HowToDebug>
Advanced <hpo_advanced> Advanced <hpo_advanced>
Benchmark for Tuners <hpo_benchmark> HPO Benchmarks <hpo_benchmark>
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300
- name: Australian
openml_task_id: 146818
- name: blood-transfusion
openml_task_id: 10101
- name: christine
openml_task_id: 168908
- name: credit-g
openml_task_id: 31
- name: kc1
openml_task_id: 3917
- name: kr-vs-kp
openml_task_id: 3
- name: phoneme
openml_task_id: 9952
- name: sylvine
openml_task_id: 168912
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300
- name: car
openml_task_id: 146821
- name: cnae-9
openml_task_id: 9981
- name: dilbert
openml_task_id: 168909
- name: fabert
openml_task_id: 168910
- name: jasmine
openml_task_id: 168911
- name: mfeat-factors
openml_task_id: 12
- name: segment
openml_task_id: 146822
- name: vehicle
openml_task_id: 53
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300
- name: cholesterol
openml_task_id: 2295
- name: liver-disorders
openml_task_id: 52948
- name: kin8nm
openml_task_id: 2280
- name: cpu_small
openml_task_id: 4883
- name: titanic_2
openml_task_id: 211993
- name: boston
openml_task_id: 4857
- name: stock
openml_task_id: 2311
- name: space_ga
openml_task_id: 4835
\ No newline at end of file
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
import logging
import sklearn
import time
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.model_selection import cross_val_score
from amlb.benchmark import TaskConfig
from amlb.data import Dataset
from amlb.datautils import impute
from amlb.utils import Timer
from amlb.results import save_predictions_to_file
arch_choices = [(16), (64), (128), (256),
(16, 16), (64, 64), (128, 128), (256, 256),
(16, 16, 16), (64, 64, 64), (128, 128, 128), (256, 256, 256),
(256, 128, 64, 16), (128, 64, 16), (64, 16),
(16, 64, 128, 256), (16, 64, 128), (16, 64)]
SEARCH_SPACE = {
"hidden_layer_sizes": {"_type":"choice", "_value": arch_choices},
"learning_rate_init": {"_type":"choice", "_value": [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001]},
"alpha": {"_type":"choice", "_value": [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001]},
"momentum": {"_type":"uniform","_value":[0, 1]},
"beta_1": {"_type":"uniform","_value":[0, 1]},
"tol": {"_type":"choice", "_value": [0.001, 0.0005, 0.0001, 0.00005, 0.00001]},
"max_iter": {"_type":"randint", "_value": [2, 256]},
}
def preprocess_mlp(dataset, log):
'''
For MLP:
- For numerical features, normalize them after null imputation.
- For categorical features, use one-hot encoding after null imputation.
'''
cat_columns, num_columns = [], []
shift_amount = 0
for i, f in enumerate(dataset.features):
if f.is_target:
shift_amount += 1
continue
elif f.is_categorical():
cat_columns.append(i - shift_amount)
else:
num_columns.append(i - shift_amount)
cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
('onehot_encoder', OneHotEncoder()),
])
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
('standard_scaler', StandardScaler()),
])
data_pipeline = ColumnTransformer([
('categorical', cat_pipeline, cat_columns),
('numerical', num_pipeline, num_columns),
])
data_pipeline.fit(np.concatenate([dataset.train.X, dataset.test.X], axis=0))
X_train = data_pipeline.transform(dataset.train.X)
X_test = data_pipeline.transform(dataset.test.X)
return X_train, X_test
def run_mlp(dataset, config, tuner, log):
"""
Using the given tuner, tune a random forest within the given time constraint.
This function uses cross validation score as the feedback score to the tuner.
The search space on which tuners search on is defined above empirically as a global variable.
"""
limit_type, trial_limit = config.framework_params['limit_type'], None
if limit_type == 'ntrials':
trial_limit = int(config.framework_params['trial_limit'])
X_train, X_test = preprocess_mlp(dataset, log)
y_train, y_test = dataset.train.y, dataset.test.y
is_classification = config.type == 'classification'
estimator = MLPClassifier if is_classification else MLPRegressor
best_score, best_params, best_model = None, None, None
score_higher_better = True
tuner.update_search_space(SEARCH_SPACE)
start_time = time.time()
trial_count = 0
intermediate_scores = []
intermediate_best_scores = [] # should be monotonically increasing
while True:
try:
param_idx, cur_params = tuner.generate_parameters()
if cur_params is not None and cur_params != {}:
trial_count += 1
train_params = cur_params.copy()
if 'TRIAL_BUDGET' in cur_params:
train_params.pop('TRIAL_BUDGET')
log.info("Trial {}: \n{}\n".format(param_idx, train_params))
cur_model = estimator(random_state=config.seed, **train_params)
# Here score is the output of score() from the estimator
cur_score = cross_val_score(cur_model, X_train, y_train)
cur_score = sum(cur_score) / float(len(cur_score))
if np.isnan(cur_score):
cur_score = 0
log.info("Score: {}\n".format(cur_score))
if best_score is None or (score_higher_better and cur_score > best_score) or (not score_higher_better and cur_score < best_score):
best_score, best_params, best_model = cur_score, cur_params, cur_model
intermediate_scores.append(cur_score)
intermediate_best_scores.append(best_score)
tuner.receive_trial_result(param_idx, cur_params, cur_score)
if limit_type == 'time':
current_time = time.time()
elapsed_time = current_time - start_time
if elapsed_time >= config.max_runtime_seconds:
break
elif limit_type == 'ntrials':
if trial_count >= trial_limit:
break
except:
break
# This line is required to fully terminate some advisors
tuner.handle_terminate()
log.info("Tuning done, the best parameters are:\n{}\n".format(best_params))
# retrain on the whole dataset
with Timer() as training:
best_model.fit(X_train, y_train)
predictions = best_model.predict(X_test)
probabilities = best_model.predict_proba(X_test) if is_classification else None
return probabilities, predictions, training, y_test, intermediate_scores, intermediate_best_scores
...@@ -21,28 +21,38 @@ from amlb.results import save_predictions_to_file ...@@ -21,28 +21,38 @@ from amlb.results import save_predictions_to_file
SEARCH_SPACE = { SEARCH_SPACE = {
"n_estimators": {"_type":"randint", "_value": [8, 512]}, "n_estimators": {"_type":"randint", "_value": [4, 2048]},
"max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]}, # 0 for None "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]}, # 0 for None
"min_samples_leaf": {"_type":"randint", "_value": [1, 8]}, "min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
"min_samples_split": {"_type":"randint", "_value": [2, 16]}, "min_samples_split": {"_type":"randint", "_value": [2, 16]},
"max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]} # 0 for None "max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]} # 0 for None
} }
SEARCH_SPACE_CHOICE = { # change SEARCH_SPACE to the following spaces to experiment on different search spaces
"n_estimators": {"_type":"choice", "_value": [8, 16, 32, 64, 128, 256, 512]},
"max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 0]}, # 0 for None # SEARCH_SPACE_CHOICE = {
"min_samples_leaf": {"_type":"choice", "_value": [1, 2, 4, 8]}, # "n_estimators": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]},
"min_samples_split": {"_type":"choice", "_value": [2, 4, 8, 16]}, # "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]}, # 0 for None
"max_leaf_nodes": {"_type":"choice", "_value": [8, 32, 128, 512, 0]} # 0 for None # "min_samples_leaf": {"_type":"choice", "_value": [1, 2, 4, 8]},
} # "min_samples_split": {"_type":"choice", "_value": [2, 4, 8, 16]},
# "max_leaf_nodes": {"_type":"choice", "_value": [8, 32, 128, 512, 1024, 2048, 4096, 0]} # 0 for None
SEARCH_SPACE_SIMPLE = { # }
"n_estimators": {"_type":"choice", "_value": [10]},
"max_depth": {"_type":"choice", "_value": [5]}, # SEARCH_SPACE_LOG = {
"min_samples_leaf": {"_type":"choice", "_value": [8]}, # "n_estimators": {"_type":"loguniform", "_value": [4, 2048]},
"min_samples_split": {"_type":"choice", "_value": [16]}, # "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]}, # 0 for None
"max_leaf_nodes": {"_type":"choice", "_value": [64]} # "min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
} # "min_samples_split": {"_type":"randint", "_value": [2, 16]},
# "max_leaf_nodes": {"_type":"loguniform", "_value": [4, 4096]} # 0 for None
# }
# SEARCH_SPACE_SIMPLE = {
# "n_estimators": {"_type":"choice", "_value": [10]},
# "max_depth": {"_type":"choice", "_value": [5]},
# "min_samples_leaf": {"_type":"choice", "_value": [8]},
# "min_samples_split": {"_type":"choice", "_value": [16]},
# "max_leaf_nodes": {"_type":"choice", "_value": [64]}
# }
def preprocess_random_forest(dataset, log): def preprocess_random_forest(dataset, log):
...@@ -111,16 +121,18 @@ def run_random_forest(dataset, config, tuner, log): ...@@ -111,16 +121,18 @@ def run_random_forest(dataset, config, tuner, log):
while True: while True:
try: try:
trial_count += 1
param_idx, cur_params = tuner.generate_parameters() param_idx, cur_params = tuner.generate_parameters()
if cur_params is not None and cur_params != {}:
trial_count += 1
train_params = cur_params.copy() train_params = cur_params.copy()
train_params = {x: int(train_params[x]) for x in train_params.keys()}
if 'TRIAL_BUDGET' in cur_params: if 'TRIAL_BUDGET' in cur_params:
train_params.pop('TRIAL_BUDGET') train_params.pop('TRIAL_BUDGET')
if cur_params['max_leaf_nodes'] == 0: if cur_params['max_leaf_nodes'] == 0:
train_params.pop('max_leaf_nodes') train_params.pop('max_leaf_nodes')
if cur_params['max_depth'] == 0: if cur_params['max_depth'] == 0:
train_params.pop('max_depth') train_params.pop('max_depth')
log.info("Trial {}: \n{}\n".format(param_idx, cur_params)) log.info("Trial {}: \n{}\n".format(param_idx, train_params))
cur_model = estimator(random_state=config.seed, **train_params) cur_model = estimator(random_state=config.seed, **train_params)
......
...@@ -2,6 +2,7 @@ ...@@ -2,6 +2,7 @@
# Licensed under the MIT license. # Licensed under the MIT license.
from .architectures.run_random_forest import * from .architectures.run_random_forest import *
from .architectures.run_mlp import *
def run_experiment(dataset, config, tuner, log): def run_experiment(dataset, config, tuner, log):
...@@ -11,5 +12,8 @@ def run_experiment(dataset, config, tuner, log): ...@@ -11,5 +12,8 @@ def run_experiment(dataset, config, tuner, log):
if config.framework_params['arch_type'] == 'random_forest': if config.framework_params['arch_type'] == 'random_forest':
return run_random_forest(dataset, config, tuner, log) return run_random_forest(dataset, config, tuner, log)
elif config.framework_params['arch_type'] == 'mlp':
return run_mlp(dataset, config, tuner, log)
else: else:
raise RuntimeError('The requested arch type in framework.yaml is unavailable.') raise RuntimeError('The requested arch type in framework.yaml is unavailable.')
...@@ -6,7 +6,7 @@ NNI: ...@@ -6,7 +6,7 @@ NNI:
project: https://github.com/microsoft/nni project: https://github.com/microsoft/nni
# type in ['TPE', 'Random', 'Anneal', 'Evolution', 'SMAC', 'GPTuner', 'MetisTuner', 'DNGOTuner', 'Hyperband', 'BOHB'] # type in ['TPE', 'Random', 'Anneal', 'Evolution', 'SMAC', 'GPTuner', 'MetisTuner', 'DNGOTuner', 'Hyperband', 'BOHB']
# arch_type in ['random_forest'] # arch_type in ['random_forest', 'mlp']
# limit_type in ['time', 'ntrials'] # limit_type in ['time', 'ntrials']
# limit must be an integer # limit must be an integer
......
...@@ -3,8 +3,8 @@ ...@@ -3,8 +3,8 @@
time=$(date "+%Y%m%d%H%M%S") time=$(date "+%Y%m%d%H%M%S")
installation='automlbenchmark' installation='automlbenchmark'
outdir="results_$time" outdir="results_$time"
benchmark='nnivalid' # 'nnismall' benchmark='nnivalid' # 'nnismall' 'nnismall-regression' 'nnismall-binary' 'nnismall-multiclass'
serialize=$true # if false, run all experiments together in background serialize=true # if false, run all experiments together in background
mkdir $outdir $outdir/scorelogs $outdir/reports mkdir $outdir $outdir/scorelogs $outdir/reports
...@@ -14,7 +14,7 @@ else ...@@ -14,7 +14,7 @@ else
tuner_array=( "$@" ) tuner_array=( "$@" )
fi fi
if [ $serialize ]; then if [ "$serialize" = true ]; then
# run tuners serially # run tuners serially
for tuner in ${tuner_array[*]}; do for tuner in ${tuner_array[*]}; do
echo "python $installation/runbenchmark.py $tuner $benchmark -o $outdir -u nni" echo "python $installation/runbenchmark.py $tuner $benchmark -o $outdir -u nni"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment