HPO Benchmark Fixes and New Features (#3925)

5bf2cb19 · Bill Wu · GitHub · 7fc5af07 · 5bf2cb19 · 5bf2cb19
Unverified Commit 5bf2cb19 authored Jul 25, 2021 by Bill Wu Committed by GitHub Jul 26, 2021
11 changed files
--- a/docs/en_US/hpo_benchmark.rst
+++ b/docs/en_US/hpo_benchmark.rst
--- a/docs/en_US/hpo_benchmark_stats.rst
+++ b/docs/en_US/hpo_benchmark_stats.rst
+HPO Benchmark Example Statistics
+================================
+A Benchmark Example
+^^^^^^^^^^^^^^^^^^^
+As an example, we ran the "nnismall" benchmark with the random forest search space on the following 8 tuners: "TPE",
+"Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner". For convenience of reference, we also list
+the search space we experimented on here. Note that the way in which the search space is written may significantly affect
+hyperparameter optimization performance, and we plan to conduct further experiments on how well NNI built-in tuners adapt
+to different search space formulations using this benchmarking tool.
+.. code-block:: json
+   {
+       "n_estimators": {"_type":"randint", "_value": [8, 512]},
+       "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},
+       "min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
+       "min_samples_split": {"_type":"randint", "_value": [2, 16]},
+       "max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]}
+    }
+As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on
+one tuner. For a more detailed description of the tasks, please check
+``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class
+classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used.
+After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``.
+Since the file is large, we only show the following screenshot and summarize other important statistics instead.
+.. image:: ../img/hpo_benchmark/performances.png
+   :target: ../img/hpo_benchmark/performances.png
+   :alt:
+When the results are parsed, the tuners are also ranked based on their final performance. The following three tables show
+the average ranking of the tuners for each metric (logloss, rmse, auc).
+Also, for every tuner, their performance for each type of metric is summarized (another view of the same data).
+We present this statistics in the fourth table. Note that this information can be found at ``results[time]/reports/rankings.txt``.
+Average rankings for metric rmse (for regression tasks). We found that Anneal performs the best among all NNI built-in tuners.
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - Average Ranking
+   * - Anneal
+     - 3.75
+   * - Random
+     - 4.00
+   * - Evolution
+     - 4.44
+   * - DNGOTuner
+     - 4.44
+   * - SMAC
+     - 4.56
+   * - TPE
+     - 4.94
+   * - GPTuner
+     - 4.94
+   * - MetisTuner
+     - 4.94
+Average rankings for metric auc (for classification tasks). We found that SMAC performs the best among all NNI built-in tuners.
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - Average Ranking
+   * - SMAC
+     - 3.67
+   * - GPTuner
+     - 4.00
+   * - Evolution
+     - 4.22
+   * - Anneal
+     - 4.39
+   * - MetisTuner
+     - 4.39
+   * - TPE
+     - 4.67
+   * - Random
+     - 5.33
+   * - DNGOTuner
+     - 5.33
+Average rankings for metric logloss (for classification tasks). We found that Random performs the best among all NNI built-in tuners.
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - Average Ranking
+   * - Random
+     - 3.36
+   * - DNGOTuner
+     - 3.50
+   * - SMAC
+     - 3.93
+   * - GPTuner
+     - 4.64
+   * - TPE
+     - 4.71
+   * - Anneal
+     - 4.93
+   * - Evolution
+     - 5.00
+   * - MetisTuner
+     - 5.93
+To view the same data in another way, for each tuner, we present the average rankings on different types of metrics. From the table, we can find that, for example, the DNGOTuner performs better for the tasks whose metric is "logloss" than for the tasks with metric "auc". We hope this information can to some extent guide the choice of tuners given some knowledge of task types.
+.. list-table::
+   :header-rows: 1
+   * - Tuner Name
+     - rmse
+     - auc
+     - logloss
+   * - TPE
+     - 4.94
+     - 4.67
+     - 4.71
+   * - Random
+     - 4.00
+     - 5.33
+     - 3.36
+   * - Anneal
+     - 3.75
+     - 4.39
+     - 4.93
+   * - Evolution
+     - 4.44
+     - 4.22
+     - 5.00
+   * - GPTuner
+     - 4.94
+     - 4.00
+     - 4.64
+   * - MetisTuner
+     - 4.94
+     - 4.39
+     - 5.93
+   * - SMAC
+     - 4.56
+     - 3.67
+     - 3.93
+   * - DNGOTuner
+     - 4.44
+     - 5.33
+     - 3.50
+Besides these reports, our script also generates two graphs for each fold of each task: one graph presents the best score received by each tuner until trial x, and another graph shows the score that each tuner receives in trial x. These two graphs can give some information regarding how the tuners are "converging" to their final solution. We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
+.. image:: ../img/hpo_benchmark/car_fold1_1.jpg
+   :target: ../img/hpo_benchmark/car_fold1_1.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/car_fold1_2.jpg
+   :target: ../img/hpo_benchmark/car_fold1_2.jpg
+   :alt:
+The previous two graphs are generated for fold 1 of the task "car". In the first graph, we observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its best score improved for three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. This may be interpreted as the Anneal tuner explores more aggressively than the GPTuner and thus its scores for different trials vary a lot. Regardless, although this pattern can to some extent hint a tuner's position on the explore-exploit tradeoff, it is not a comprehensive evaluation of a tuner's effectiveness.
+.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
+   :target: ../img/hpo_benchmark/christine_fold0_1.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg
+   :target: ../img/hpo_benchmark/christine_fold0_2.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
+   :target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
+   :target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg
+   :target: ../img/hpo_benchmark/credit-g_fold1_1.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg
+   :target: ../img/hpo_benchmark/credit-g_fold1_2.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
+   :target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
+   :alt:
+.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
+   :target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
+   :alt:
--- a/docs/en_US/hyperparameter_tune.rst
+++ b/docs/en_US/hyperparameter_tune.rst
@@ -25,4 +25,4 @@ according to their needs.
    WebUI <Tutorial/WebUI>
    How to Debug <Tutorial/HowToDebug>
    Advanced <hpo_advanced>
-    Benchmark for Tuners <hpo_benchmark>
+    HPO Benchmarks <hpo_benchmark>
--- a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-binary.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-binary.yaml
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+- name: Australian
+  openml_task_id: 146818
+- name: blood-transfusion
+  openml_task_id: 10101
+- name: christine
+  openml_task_id: 168908
+- name: credit-g
+  openml_task_id: 31
+- name: kc1
+  openml_task_id: 3917
+- name: kr-vs-kp
+  openml_task_id: 3
+- name: phoneme
+  openml_task_id: 9952
+- name: sylvine
+  openml_task_id: 168912
--- a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-multiclass.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-multiclass.yaml
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+- name: car
+  openml_task_id: 146821
+- name: cnae-9
+  openml_task_id: 9981
+- name: dilbert
+  openml_task_id: 168909
+- name: fabert
+  openml_task_id: 168910
+- name: jasmine
+  openml_task_id: 168911
+- name: mfeat-factors
+  openml_task_id: 12
+- name: segment
+  openml_task_id: 146822
+- name: vehicle
+  openml_task_id: 53
--- a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-regression.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-regression.yaml
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+- name: cholesterol
+  openml_task_id: 2295
+- name: liver-disorders
+  openml_task_id: 52948
+- name: kin8nm
+  openml_task_id: 2280
+- name: cpu_small
+  openml_task_id: 4883
+- name: titanic_2
+  openml_task_id: 211993
+- name: boston
+  openml_task_id: 4857
+- name: stock
+  openml_task_id: 2311
+- name: space_ga
+  openml_task_id: 4835
\ No newline at end of file
--- a/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_mlp.py
+++ b/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_mlp.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+import logging
+import sklearn
+import time
+import numpy as np
+from sklearn.impute import SimpleImputer
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import OneHotEncoder, StandardScaler
+from sklearn.pipeline import Pipeline
+from sklearn.neural_network import MLPClassifier, MLPRegressor
+from sklearn.model_selection import cross_val_score
+from amlb.benchmark import TaskConfig
+from amlb.data import Dataset
+from amlb.datautils import impute
+from amlb.utils import Timer
+from amlb.results import save_predictions_to_file
+arch_choices = [(16), (64), (128), (256),
+                (16, 16), (64, 64), (128, 128), (256, 256),
+                (16, 16, 16), (64, 64, 64), (128, 128, 128), (256, 256, 256),
+                (256, 128, 64, 16), (128, 64, 16), (64, 16),
+                (16, 64, 128, 256), (16, 64, 128), (16, 64)]
+SEARCH_SPACE = {
+    "hidden_layer_sizes": {"_type":"choice", "_value": arch_choices},
+    "learning_rate_init": {"_type":"choice", "_value": [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001]},
+    "alpha": {"_type":"choice", "_value": [0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001]},
+    "momentum": {"_type":"uniform","_value":[0, 1]},
+    "beta_1": {"_type":"uniform","_value":[0, 1]},
+    "tol": {"_type":"choice", "_value": [0.001, 0.0005, 0.0001, 0.00005, 0.00001]},
+    "max_iter": {"_type":"randint", "_value": [2, 256]},
+}
+def preprocess_mlp(dataset, log):
+    '''
+    For MLP:
+    - For numerical features, normalize them after null imputation. 
+    - For categorical features, use one-hot encoding after null imputation. 
+    '''
+    cat_columns, num_columns = [], []
+    shift_amount = 0
+    for i, f in enumerate(dataset.features):
+        if f.is_target:
+            shift_amount += 1
+            continue
+        elif f.is_categorical():
+            cat_columns.append(i - shift_amount)
+        else:
+            num_columns.append(i - shift_amount)
+    cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
+                             ('onehot_encoder', OneHotEncoder()),
+                             ])
+    num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')),
+                             ('standard_scaler', StandardScaler()),
+                             ])
+    data_pipeline = ColumnTransformer([
+        ('categorical', cat_pipeline, cat_columns),
+        ('numerical', num_pipeline, num_columns),
+    ])
+    data_pipeline.fit(np.concatenate([dataset.train.X, dataset.test.X], axis=0))
+    X_train = data_pipeline.transform(dataset.train.X)
+    X_test = data_pipeline.transform(dataset.test.X)  
+    return X_train, X_test
+def run_mlp(dataset, config, tuner, log):
+    """
+    Using the given tuner, tune a random forest within the given time constraint.
+    This function uses cross validation score as the feedback score to the tuner. 
+    The search space on which tuners search on is defined above empirically as a global variable.
+    """
+    limit_type, trial_limit = config.framework_params['limit_type'], None
+    if limit_type == 'ntrials':
+        trial_limit = int(config.framework_params['trial_limit'])
+    X_train, X_test = preprocess_mlp(dataset, log)
+    y_train, y_test = dataset.train.y, dataset.test.y
+    is_classification = config.type == 'classification'
+    estimator = MLPClassifier if is_classification else MLPRegressor
+    best_score, best_params, best_model = None, None, None
+    score_higher_better = True
+    tuner.update_search_space(SEARCH_SPACE)    
+    start_time = time.time()
+    trial_count = 0
+    intermediate_scores = []
+    intermediate_best_scores = []           # should be monotonically increasing 
+    while True:
+        try:            
+            param_idx, cur_params = tuner.generate_parameters()
+            if cur_params is not None and cur_params != {}:
+                trial_count += 1
+                train_params = cur_params.copy()
+                if 'TRIAL_BUDGET' in cur_params:
+                    train_params.pop('TRIAL_BUDGET')
+                log.info("Trial {}: \n{}\n".format(param_idx, train_params))
+                cur_model = estimator(random_state=config.seed, **train_params)
+                # Here score is the output of score() from the estimator
+                cur_score = cross_val_score(cur_model, X_train, y_train)
+                cur_score = sum(cur_score) / float(len(cur_score))
+                if np.isnan(cur_score):
+                    cur_score = 0
+                log.info("Score: {}\n".format(cur_score))
+                if best_score is None or (score_higher_better and cur_score > best_score) or (not score_higher_better and cur_score < best_score):
+                    best_score, best_params, best_model = cur_score, cur_params, cur_model    
+                intermediate_scores.append(cur_score)
+                intermediate_best_scores.append(best_score)
+                tuner.receive_trial_result(param_idx, cur_params, cur_score)
+            if limit_type == 'time':
+                current_time = time.time()
+                elapsed_time = current_time - start_time
+                if elapsed_time >= config.max_runtime_seconds:
+                    break
+            elif limit_type == 'ntrials':
+                if trial_count >= trial_limit:
+                    break
+        except:
+            break
+    # This line is required to fully terminate some advisors
+    tuner.handle_terminate()
+    log.info("Tuning done, the best parameters are:\n{}\n".format(best_params))
+    # retrain on the whole dataset 
+    with Timer() as training:
+        best_model.fit(X_train, y_train)     
+    predictions = best_model.predict(X_test)
+    probabilities = best_model.predict_proba(X_test) if is_classification else None
+    return probabilities, predictions, training, y_test, intermediate_scores, intermediate_best_scores
--- a/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py
+++ b/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py
@@ -21,28 +21,38 @@ from amlb.results import save_predictions_to_file
 SEARCH_SPACE = {
-    "n_estimators": {"_type":"randint", "_value": [8, 512]},
+    "n_estimators": {"_type":"randint", "_value": [4, 2048]},
-    "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},   # 0 for None
+    "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},     # 0 for None
    "min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
    "min_samples_split": {"_type":"randint", "_value": [2, 16]},
-    "max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]}                    # 0 for None
+    "max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]}                      # 0 for None
 }
-SEARCH_SPACE_CHOICE = {
+# change SEARCH_SPACE to the following spaces to experiment on different search spaces
-    "n_estimators": {"_type":"choice", "_value": [8, 16, 32, 64, 128, 256, 512]},
-    "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 0]},   # 0 for None
+# SEARCH_SPACE_CHOICE = {
-    "min_samples_leaf": {"_type":"choice", "_value": [1, 2, 4, 8]},
+#     "n_estimators": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048]},
-    "min_samples_split": {"_type":"choice", "_value": [2, 4, 8, 16]},
+#     "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},   # 0 for None
-    "max_leaf_nodes": {"_type":"choice", "_value": [8, 32, 128, 512, 0]}     # 0 for None
+#     "min_samples_leaf": {"_type":"choice", "_value": [1, 2, 4, 8]},
-}
+#     "min_samples_split": {"_type":"choice", "_value": [2, 4, 8, 16]},
+#     "max_leaf_nodes": {"_type":"choice", "_value": [8, 32, 128, 512, 1024, 2048, 4096, 0]}   # 0 for None
-SEARCH_SPACE_SIMPLE = {
+# }
-    "n_estimators": {"_type":"choice", "_value": [10]},
-    "max_depth": {"_type":"choice", "_value": [5]},
+# SEARCH_SPACE_LOG = {
-    "min_samples_leaf": {"_type":"choice", "_value": [8]},
+#     "n_estimators": {"_type":"loguniform", "_value": [4, 2048]},
-    "min_samples_split": {"_type":"choice", "_value": [16]},
+#     "max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},   # 0 for None 
-    "max_leaf_nodes": {"_type":"choice", "_value": [64]}
+#     "min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
-}
+#     "min_samples_split": {"_type":"randint", "_value": [2, 16]},
+#     "max_leaf_nodes": {"_type":"loguniform", "_value": [4, 4096]}                 # 0 for None
+# }
+# SEARCH_SPACE_SIMPLE = {
+#     "n_estimators": {"_type":"choice", "_value": [10]},
+#     "max_depth": {"_type":"choice", "_value": [5]},
+#     "min_samples_leaf": {"_type":"choice", "_value": [8]},
+#     "min_samples_split": {"_type":"choice", "_value": [16]},
+#     "max_leaf_nodes": {"_type":"choice", "_value": [64]}
+# }
 def preprocess_random_forest(dataset, log):
@@ -110,33 +120,35 @@ def run_random_forest(dataset, config, tuner, log):
    intermediate_best_scores = []           # should be monotonically increasing 
    while True:
-        try:
+        try:            
-            trial_count += 1
            param_idx, cur_params = tuner.generate_parameters()
-            train_params = cur_params.copy()
+            if cur_params is not None and cur_params != {}:
-            if 'TRIAL_BUDGET' in cur_params:
+                trial_count += 1
-                train_params.pop('TRIAL_BUDGET')
+                train_params = cur_params.copy()
-            if cur_params['max_leaf_nodes'] == 0: 
+                train_params = {x: int(train_params[x]) for x in train_params.keys()}
-                train_params.pop('max_leaf_nodes')
+                if 'TRIAL_BUDGET' in cur_params:
-            if cur_params['max_depth'] == 0:
+                    train_params.pop('TRIAL_BUDGET')
-                train_params.pop('max_depth')
+                if cur_params['max_leaf_nodes'] == 0: 
-            log.info("Trial {}: \n{}\n".format(param_idx, cur_params))
+                    train_params.pop('max_leaf_nodes')
+                if cur_params['max_depth'] == 0:
+                    train_params.pop('max_depth')
+                log.info("Trial {}: \n{}\n".format(param_idx, train_params))
-            cur_model = estimator(random_state=config.seed, **train_params)
+                cur_model = estimator(random_state=config.seed, **train_params)
-            # Here score is the output of score() from the estimator
+                # Here score is the output of score() from the estimator
-            cur_score = cross_val_score(cur_model, X_train, y_train)
+                cur_score = cross_val_score(cur_model, X_train, y_train)
-            cur_score = sum(cur_score) / float(len(cur_score))
+                cur_score = sum(cur_score) / float(len(cur_score))
-            if np.isnan(cur_score):
+                if np.isnan(cur_score):
-                cur_score = 0
+                    cur_score = 0
-            log.info("Score: {}\n".format(cur_score))
+                log.info("Score: {}\n".format(cur_score))
-            if best_score is None or (score_higher_better and cur_score > best_score) or (not score_higher_better and cur_score < best_score):
+                if best_score is None or (score_higher_better and cur_score > best_score) or (not score_higher_better and cur_score < best_score):
-                best_score, best_params, best_model = cur_score, cur_params, cur_model    
+                    best_score, best_params, best_model = cur_score, cur_params, cur_model    
-            intermediate_scores.append(cur_score)
+                intermediate_scores.append(cur_score)
-            intermediate_best_scores.append(best_score)
+                intermediate_best_scores.append(best_score)
-            tuner.receive_trial_result(param_idx, cur_params, cur_score)
+                tuner.receive_trial_result(param_idx, cur_params, cur_score)
            if limit_type == 'time':
                current_time = time.time()

--- a/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/run_experiment.py
+++ b/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/run_experiment.py
@@ -2,6 +2,7 @@
 # Licensed under the MIT license.
 from .architectures.run_random_forest import *
+from .architectures.run_mlp import *
 def run_experiment(dataset, config, tuner, log):
@@ -11,5 +12,8 @@ def run_experiment(dataset, config, tuner, log):
    if config.framework_params['arch_type'] == 'random_forest':
        return run_random_forest(dataset, config, tuner, log)
+    elif config.framework_params['arch_type'] == 'mlp':
+        return run_mlp(dataset, config, tuner, log)
    else:
        raise RuntimeError('The requested arch type in framework.yaml is unavailable.') 
--- a/examples/trials/benchmarking/automlbenchmark/nni/frameworks.yaml
+++ b/examples/trials/benchmarking/automlbenchmark/nni/frameworks.yaml
@@ -6,7 +6,7 @@ NNI:
  project: https://github.com/microsoft/nni
 # type in ['TPE', 'Random', 'Anneal', 'Evolution', 'SMAC', 'GPTuner', 'MetisTuner', 'DNGOTuner', 'Hyperband', 'BOHB']
-# arch_type in ['random_forest']
+# arch_type in ['random_forest', 'mlp']
 # limit_type in ['time', 'ntrials']
 # limit must be an integer

--- a/examples/trials/benchmarking/automlbenchmark/runbenchmark_nni.sh
+++ b/examples/trials/benchmarking/automlbenchmark/runbenchmark_nni.sh
@@ -3,8 +3,8 @@
 time=$(date "+%Y%m%d%H%M%S")
 installation='automlbenchmark'
 outdir="results_$time"
-benchmark='nnivalid'      # 'nnismall'  
+benchmark='nnivalid'      # 'nnismall' 'nnismall-regression' 'nnismall-binary' 'nnismall-multiclass' 
-serialize=$true           # if false, run all experiments together in background
+serialize=true           # if false, run all experiments together in background
 mkdir $outdir $outdir/scorelogs $outdir/reports 
@@ -14,7 +14,7 @@ else
    tuner_array=( "$@" )
 fi
-if [ $serialize ]; then
+if [ "$serialize" = true ]; then
    # run tuners serially 
    for tuner in ${tuner_array[*]}; do
 	echo "python $installation/runbenchmark.py $tuner $benchmark -o $outdir -u nni"