HPO Benchmark (#3644)

4c49db12 · Bill Wu · GitHub · 4ccc9402 · 4c49db12 · 4c49db12
Unverified Commit 4c49db12 authored May 26, 2021 by Bill Wu Committed by GitHub May 26, 2021
20 changed files
--- a/docs/en_US/hpo_benchmark.rst
+++ b/docs/en_US/hpo_benchmark.rst
+Benchmark for Tuners
+====================
+We provide a benchmarking tool to compare the performances of tuners provided by NNI (and users' custom tuners) on different tasks. The implementation of this tool is based on the automlbenchmark repository (https://github.com/openml/automlbenchmark), which provides services of running different *frameworks* against different *benchmarks* consisting of multiple *tasks*. The tool is located in ``examples/trials/benchmarking/automlbenchmark``. This document provides a brief introduction to the tool and its usage. 
+Terminology
+^^^^^^^^^^^
+* **task**\ : a task can be thought of as (dataset, evaluator). It gives out a dataset containing (train, valid, test), and based on the received predictions, the evaluator evaluates a given metric (e.g., mse for regression, f1 for classification). 
+* **benchmark**\ : a benchmark is a set of tasks, along with other external constraints such as time and resource. 
+* **framework**\ : given a task, a framework conceives answers to the proposed regression or classification problem and produces predictions. Note that the automlbenchmark framework does not pose any restrictions on the hypothesis space of a framework. In our implementation in this folder, each framework is a tuple (tuner, architecture), where architecture provides the hypothesis space (and search space for tuner), and tuner determines the strategy of hyperparameter optimization. 
+* **tuner**\ : a tuner or advisor defined in the hpo folder, or a custom tuner provided by the user. 
+* **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). In our implementation, the architecture calls tuner multiple times to obtain possible hyperparameter configurations, and produces the final prediction for a task. See ``./nni/extensions/NNI/architectures`` for examples.
+Setup
+^^^^^
+Due to some incompatibilities between automlbenchmark and python 3.8, python 3.7 is recommended for running experiments contained in this folder. First, run the following shell script to clone the automlbenchmark repository. Note: it is recommended to perform the following steps in a separate virtual environment, as the setup code may install several packages. 
+.. code-block:: bash
+   ./setup.sh
+Run predefined benchmarks on existing tuners
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: bash
+   ./runbenchmark_nni.sh [tuner-names]
+This script runs the benchmark 'nnivalid', which consists of a regression task, a binary classification task, and a multi-class classification task. After the script finishes, you can find a summary of the results in the folder results_[time]/reports/. To run on other predefined benchmarks, change the ``benchmark`` variable in ``runbenchmark_nni.sh``. Some benchmarks are defined in ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks``\ , and others are defined in ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``. One example of larger benchmarks is "nnismall", which consists of 8 regression tasks, 8 binary classification tasks, and 8 multi-class classification tasks.
+By default, the script runs the benchmark on all embedded tuners in NNI. If provided a list of tuners in [tuner-names], it only runs the tuners in the list. Currently, the following tuner names are supported: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "Hyperband", "BOHB". It is also possible to evaluate custom tuners. See the next sections for details. 
+By default, the script runs the specified tuners against the specified benchmark one by one. To run all the experiments simultaneously in the background, set the "serialize" flag to false in ``runbenchmark_nni.sh``. 
+Note: the SMAC tuner and the BOHB advisor has to be manually installed before any experiments can be run on it. Please refer to `this page <https://nni.readthedocs.io/en/stable/Tuner/BuiltinTuner.html?highlight=nni>`_ for more details on installing SMAC and BOHB.
+Run customized benchmarks on existing tuners
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To run customized benchmarks, add a benchmark_name.yaml file in the folder ``./nni/benchmarks``\ , and change the ``benchmark`` variable in ``runbenchmark_nni.sh``. See ``./automlbenchmark/resources/benchmarks/`` for some examples of defining a custom benchmark.
+Run benchmarks on custom tuners
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.Tuner`` and correctly implements the required APIs. For more information on implementing a custom tuner, please refer to `here <https://nni.readthedocs.io/en/stable/Tuner/CustomizeTuner.html>`_. Next, perform the following steps:
+#. Install the custom tuner with command ``nnictl algo register``. Check `this document <https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html>`_ for details. 
+#. In ``./nni/frameworks.yaml``\ , add a new framework extending the base framework NNI. Make sure that the parameter ``tuner_type`` corresponds to the "builtinName" of tuner installed in step 1.
+#. Run the following command
+.. code-block:: bash
+      ./runbenchmark_nni.sh new-tuner-builtinName
+A Benchmark Example 
+^^^^^^^^^^^^^^^^^^^
+As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. 
+After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead. 
+.. image:: ../img/hpo_benchmark/performances.png
+   :target: ../img/hpo_benchmark/performances.png
+   :alt: 
+When the results are parsed, the tuners are ranked based on their final performance. ``results[time]/reports/rankings.txt`` presents a ranking of the tuners for each metric (logloss, rmse, auc), and the rankings of tuners for each metric (another view of the same data).
+Average rankings for metric rmse:
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - Average Ranking
+   * - Anneal
+     - 3.75
+   * - Random
+     - 4.00
+   * - Evolution
+     - 4.44
+   * - DNGOTuner
+     - 4.44
+   * - SMAC
+     - 4.56
+   * - TPE
+     - 4.94
+   * - GPTuner
+     - 4.94
+   * - MetisTuner
+     - 4.94
+Average rankings for metric auc:
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - Average Ranking
+   * - SMAC
+     - 3.67
+   * - GPTuner
+     - 4.00
+   * - Evolution
+     - 4.22
+   * - Anneal
+     - 4.39
+   * - MetisTuner
+     - 4.39
+   * - TPE
+     - 4.67
+   * - Random
+     - 5.33
+   * - DNGOTuner
+     - 5.33
+Average rankings for metric logloss:
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - Average Ranking
+   * - Random
+     - 3.36
+   * - DNGOTuner
+     - 3.50
+   * - SMAC
+     - 3.93
+   * - GPTuner
+     - 4.64
+   * - TPE
+     - 4.71
+   * - Anneal
+     - 4.93
+   * - Evolution
+     - 5.00
+   * - MetisTuner
+     - 5.93
+Average rankings for tuners:
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - rmse
+     - auc
+     - logloss
+   * - TPE
+     - 4.94
+     - 4.67
+     - 4.71
+   * - Random
+     - 4.00
+     - 5.33
+     - 3.36
+   * - Anneal
+     - 3.75
+     - 4.39
+     - 4.93
+   * - Evolution
+     - 4.44
+     - 4.22
+     - 5.00
+   * - GPTuner
+     - 4.94
+     - 4.00
+     - 4.64
+   * - MetisTuner
+     - 4.94
+     - 4.39
+     - 5.93
+   * - SMAC
+     - 4.56
+     - 3.67
+     - 3.93
+   * - DNGOTuner
+     - 4.44
+     - 5.33
+     - 3.50
+Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
+.. image:: ../img/hpo_benchmark/car_fold1_1.jpg
+   :target: ../img/hpo_benchmark/car_fold1_1.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/car_fold1_2.jpg
+   :target: ../img/hpo_benchmark/car_fold1_2.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
+   :target: ../img/hpo_benchmark/christine_fold0_1.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg
+   :target: ../img/hpo_benchmark/christine_fold0_2.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
+   :target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
+   :target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg
+   :target: ../img/hpo_benchmark/credit-g_fold1_1.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg
+   :target: ../img/hpo_benchmark/credit-g_fold1_2.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
+   :target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
+   :alt: 
+.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
+   :target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
+   :alt: 
--- a/docs/en_US/hyperparameter_tune.rst
+++ b/docs/en_US/hyperparameter_tune.rst
@@ -24,4 +24,5 @@ according to their needs.
    Examples <examples>
    WebUI <Tutorial/WebUI>
    How to Debug <Tutorial/HowToDebug>
    Advanced <hpo_advanced>
\ No newline at end of file
+    Benchmark for Tuners <hpo_benchmark>
--- a/docs/img/hpo_benchmark/car_fold1_1.jpg
+++ b/docs/img/hpo_benchmark/car_fold1_1.jpg
--- a/docs/img/hpo_benchmark/car_fold1_2.jpg
+++ b/docs/img/hpo_benchmark/car_fold1_2.jpg
--- a/docs/img/hpo_benchmark/christine_fold0_1.jpg
+++ b/docs/img/hpo_benchmark/christine_fold0_1.jpg
--- a/docs/img/hpo_benchmark/christine_fold0_2.jpg
+++ b/docs/img/hpo_benchmark/christine_fold0_2.jpg
--- a/docs/img/hpo_benchmark/cnae-9_fold0_1.jpg
+++ b/docs/img/hpo_benchmark/cnae-9_fold0_1.jpg
--- a/docs/img/hpo_benchmark/cnae-9_fold0_2.jpg
+++ b/docs/img/hpo_benchmark/cnae-9_fold0_2.jpg
--- a/docs/img/hpo_benchmark/credit-g_fold1_1.jpg
+++ b/docs/img/hpo_benchmark/credit-g_fold1_1.jpg
--- a/docs/img/hpo_benchmark/credit-g_fold1_2.jpg
+++ b/docs/img/hpo_benchmark/credit-g_fold1_2.jpg
--- a/docs/img/hpo_benchmark/performances.png
+++ b/docs/img/hpo_benchmark/performances.png
--- a/docs/img/hpo_benchmark/titanic_2_fold1_1.jpg
+++ b/docs/img/hpo_benchmark/titanic_2_fold1_1.jpg
--- a/docs/img/hpo_benchmark/titanic_2_fold1_2.jpg
+++ b/docs/img/hpo_benchmark/titanic_2_fold1_2.jpg
--- a/examples/trials/benchmarking/automlbenchmark/.gitignore
+++ b/examples/trials/benchmarking/automlbenchmark/.gitignore
+# data files 
+nni/data/
+# benchmark repository 
+automlbenchmark/
+# all experiment results
+results*
+# intermediate outputs of tuners
+smac3-output*
+param_config_space.pcs
+scenario.txt
\ No newline at end of file
--- a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall.yaml
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+- name: cholesterol
+  openml_task_id: 2295
+- name: liver-disorders
+  openml_task_id: 52948
+- name: kin8nm
+  openml_task_id: 2280
+- name: cpu_small
+  openml_task_id: 4883
+- name: titanic_2
+  openml_task_id: 211993
+- name: boston
+  openml_task_id: 4857
+- name: stock
+  openml_task_id: 2311
+- name: space_ga
+  openml_task_id: 4835
+- name: Australian
+  openml_task_id: 146818
+- name: blood-transfusion
+  openml_task_id: 10101
+- name: car
+  openml_task_id: 146821
+- name: christine
+  openml_task_id: 168908
+- name: cnae-9
+  openml_task_id: 9981
+- name: credit-g
+  openml_task_id: 31
+- name: dilbert
+  openml_task_id: 168909
+- name: fabert
+  openml_task_id: 168910
+- name: jasmine
+  openml_task_id: 168911
+- name: kc1
+  openml_task_id: 3917
+- name: kr-vs-kp
+  openml_task_id: 3
+- name: mfeat-factors
+  openml_task_id: 12
+- name: phoneme
+  openml_task_id: 9952
+- name: segment
+  openml_task_id: 146822
+- name: sylvine
+  openml_task_id: 168912
+- name: vehicle
+  openml_task_id: 53
--- a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt
+++ b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt
+nnismall:
+This benchmark contains 24 tasks: 8 tasks each for binary classfication, multi-class classification, and regression. 
+Binary Classification: 
+- name: Australian
+  openml_task_id: 146818
+  Introduction: Australian Credit Approval dataset, originating from the StatLog project. It concerns credit card applications. 
+  Features: 6 numerical and 8 categorical features, all normalized to [-1,1].
+  Number of instances: 690
+- name: blood-transfusion
+  openml_task_id: 10101
+  Introduction: Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The target attribute is a binary variable representing whether he/she donated blood in March 2007 (2 stands for donating blood; 1 stands for not donating blood).
+  Features: 4 numerical features.
+  Number of instances: 748
+- name: christine
+  openml_task_id: 168908
+  Introduction: An Openml challenge dataset on classification. The identity of the datasets and the type of data is concealed. 
+  Features: 1599 numerical features and 38 categorical features
+  Number of instances: 5418
+- name: credit-g
+  openml_task_id: 31
+  Introduction: This dataset classifies people described by a set of attributes as good or bad credit risks.
+  Features: 7 numerical features and 13 categorical features
+  Number of instances: 1000
+- name: kc1
+  openml_task_id: 3917
+  Introduction: One of the NASA Metrics Data Program defect data sets. Data from software for storage management for receiving and processing ground data.
+  Features: 21 numerical features
+  Number of instances: 2109
+- name: kr-vs-kp
+  openml_task_id: 3
+  Introduction: Given a board configuration, predict whether white can win or not. 
+  Features: 37 categorical features
+  Number of instances: 3196
+- name: phoneme
+  openml_task_id: 9952
+  Introduction: The aim of this dataset is to distinguish between nasal (class 0) and oral sounds (class 1). 
+  Features: 5 numerical features
+  Number of instances: 5404
+- name: sylvine
+  openml_task_id: 168912
+  Introduction: An Openml challenge dataset on classification. The identity of the datasets and the type of data is concealed.
+  Features: 20 numerical features
+  Number of instances: 5124
+Multi-class Classification
+- name: car
+  openml_task_id: 146821
+  Introduction: The model evaluates cars using six intermediate concepts. 
+  Features: 6 categorical features
+  Number of instances: 1728
+- name: cnae-9
+  openml_task_id: 9981
+  Introduction: This is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories.
+  Features: 856 numerical features (word frequency)
+  Number of instances: 1080
+- name: dilbert
+  openml_task_id: 168909
+  Introduction: An Openml challenge dataset on classification. The identity of the datasets and the type of data is concealed. 
+  Features: 2000 numerical features
+  Number of instances: 10000
+- name: fabert
+  openml_task_id: 168910
+  Introduction: An Openml challenge dataset on classification. The identity of the datasets and the type of data is concealed. 
+  Features: 800 numerical features
+  Number of instances: 8237
+- name: jasmine
+  openml_task_id: 168911
+  Introduction: An Openml challenge dataset on classification. The identity of the datasets and the type of data is concealed.
+  Features: 8 numerical features and 137 categorical features 
+  Number of instances: 2984
+- name: mfeat-factors
+  openml_task_id: 12
+  Introduction: Hand-written numeral classification. 
+  Features: 216 numerical features(corresponding to binarized image) 
+  Number of instances: 2000
+- name: segment
+  openml_task_id: 146822
+  Introduction: segmentation of outdoor images into 7 classes
+  Features: 19 numerical features
+  Number of instances: 2310  (3x3 patches from 7 images)
+- name: vehicle
+  openml_task_id: 53
+  Introduction: Classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
+  Features: 18 numerical features
+  Number of instances: 846
+Regression
+- name: cholesterol
+  openml_task_id: 2295
+  Introduction: Predict the cholesterol level of patients. 
+  Features: 6 numerical features and 7 categorical features 
+  Number of instances: 303
+- name: liver-disorders
+  openml_task_id: 52948
+  Introduction: Predict alcohol assumption based on blood test results. 
+  Features: 5 numerical features
+  Number of instances: 345
+- name: kin8nm
+  openml_task_id: 2280
+  Introduction: This dataset is concerned with the forward kinematics of an 8 link robot arm.
+  Features: 8 numerical features
+  Number of instances: 8192
+- name: cpu_small
+  openml_task_id: 4883
+  Introduction: Predict the portion of time that cpus run in user mode.
+  Features: 12 numerical features
+  Number of instances: 8192
+- name: titanic_2
+  openml_task_id: 211993
+  Introduction: Predict probability of survival
+  Features: 7 numerical features
+  Number of instances: 891
+- name: boston
+  openml_task_id: 4857
+  Introduction: Boston house price. 
+  Features: 11 numerical features and 2 categorical features 
+  Number of instances: 506
+- name: stock
+  openml_task_id: 2311
+  Introduction: This is a dataset obtained from the StatLib repository. The data provided are daily stock prices from January 1988 through October 1991, for ten aerospace companies.
+  Features: 11 numerical features
+  Number of instances: 950
+- name: space_ga
+  openml_task_id: 4835
+  Introduction: Predict the log of the proportion of votes cast for both candidates in the 1980 presidential election.
+  Features: 6 numerical attributes
+  Number of instances: 3107
--- a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnivalid.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnivalid.yaml
+---
+#for doc purpose using <placeholder:default_value> syntax when it applies.
+#FORMAT: global defaults are defined in config.yaml
+- name: __dummy-task
+  enabled: false  # actual default is `true` of course...
+  openml_task_id: 0
+  metric: # the first metric in the task list will be optimized against and used for the main result, the other ones are optional and purely informative. Only the metrics annotated with (*) can be used as a performance metric.
+    - # classification
+    - acc # (*) accuracy
+    - auc # (*) array under curve
+    - logloss # (*) log loss
+    - f1  # F1 score
+    - # regression
+    - mae  # (*) mean absolute error
+    - mse # (*) mean squared error
+    - rmse  # root mean squared error
+    - rmsle  # root mean squared log error
+    - r2  # R^2 score
+  folds: 1
+  max_runtime_seconds: 600
+  cores: 1
+  max_mem_size_mb: -1
+  ec2_instance_type: m5.large
+# local defaults (applying only to tasks defined in this file) can be defined in a task named "__defaults__"
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 180
+- name: kc2
+  openml_task_id: 3913
+  description: "binary test dataset"
+- name: iris
+  openml_task_id: 59
+  description: "multiclass test dataset"
+- name: cholesterol
+  openml_task_id: 2295
+  description: "regression test dataset"
--- a/examples/trials/benchmarking/automlbenchmark/nni/config.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/config.yaml
+---
+input_dir: '{user}/data'
+frameworks:
+  definition_file:
+    - '{root}/resources/frameworks.yaml'
+    - '{user}/frameworks.yaml'
+benchmarks:
+  definition_dir:
+    - '{user}/benchmarks'
+    - '{root}/resources/benchmarks'
+  constraints_file:
+    - '{user}/constraints.yaml'
+    - '{root}/resources/constraints.yaml'
--- a/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/.marker_setup_safe_to_delete
+++ b/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/.marker_setup_safe_to_delete
--- a/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/__init__.py
+++ b/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/__init__.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+def run(*args, **kwargs):
+    from .exec import run
+    return run(*args, **kwargs)