Update the memroy usage and time cost in benchmark (#1829)

* change auto-feature-engineering dir * add dir * update the ReadME * add test time and memory * update docs | update benchmark * add gitignore * update benchmark in feature selector(memory and time) * merge master * ignore F821 in flake8 | update benchmark number * update number in benchmark * fix flake8 * remove the update for the azure-pipeline.yml * update by comments

Update the memroy usage and time cost in benchmark (#1829)
* change auto-feature-engineering dir * add dir * update the ReadME * add test time and memory * update docs | update benchmark * add gitignore * update benchmark in feature selector(memory and time) * merge master * ignore F821 in flake8 | update benchmark number * update number in benchmark * fix flake8 * remove the update for the azure-pipeline.yml * update by comments
9484efb5 · xuehui · GitHub · cb15be49 · 9484efb5 · 9484efb5
Unverified Commit 9484efb5 authored Dec 12, 2019 by xuehui Committed by GitHub Dec 12, 2019
9 changed files
--- a/README.md
+++ b/README.md
@@ -358,7 +358,7 @@ With authors' permission, we listed a set of NNI usage examples and relevant art
 * ### **External Repositories** ###
   * Run [ENAS](examples/tuners/enas_nni/README.md) with NNI
   * Run [Neural Network Architecture Search](examples/trials/nas_cifar10/README.md) with NNI 
-   * [Automatic Feature Engineering](examples/trials/auto-feature-engineering/README.md) with NNI 
+   * [Automatic Feature Engineering](examples/feature_engineering/auto-feature-engineering/README.md) with NNI 
   * [Hyperparameter Tuning for Matrix Factorization](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) with NNI
   * [scikit-nni](https://github.com/ksachdeva/scikit-nni) Hyper-parameter search for scikit-learn pipelines using NNI

--- a/docs/en_US/FeatureEngineering/Overview.md
+++ b/docs/en_US/FeatureEngineering/Overview.md
@@ -243,13 +243,14 @@ print("Pipeline Score: ", pipeline.score(X_train, y_train))
 `Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
-| Dataset | Baseline | GradientFeatureSelector top20 | GradientFeatureSelector auto | TreeBasedClassifier | #Train | #Feature | 
+| Dataset | All Features + LR (acc, time, memory) | GradientFeatureSelector + LR (acc, time, memory) | TreeBasedClassifier + LR (acc, time, memory) | #Train | #Feature | 
-| ----------- | ------ | ------ | ------- | ------- | -------- |-------- |
+| ----------- | ------ | ------ | ------- | ------- | -------- |
-| colon-cancer | 0.7547 | 0.7368 | 0.5389 | 0.7223 | 62 | 2,000 |
+| colon-cancer | 0.7547, 890ms, 348MiB | 0.7368, 363ms, 286MiB | 0.7223, 171ms, 1171 MiB | 62 | 2,000 |
-| gisette | 0.9725 | 0.9241 | 0.9658 |0.9792 | 6,000 | 5,000 |
+| gisette | 0.9725, 215ms, 584MiB | 0.89416, 446ms, 397MiB | 0.9792, 911ms, 234MiB | 6,000 | 5,000 |
-| rcv1 | 0.9644 | 0.7333 | 0.9548 |0.9615 | 20,242 | 47,236 |
+| avazu | 0.8834, N/A, N/A | N/A, N/A, N/A | N/A, N/A, N/A | 40,428,967 | 1,000,000 |
-| news20.binary | 0.9208 | 0.8780  | 0.8875 | 0.9070 | 19,996 | 1,355,191 |
+| rcv1 | 0.9644, 557ms, 241MiB | 0.7333, 401ms, 281MiB | 0.9615, 752ms, 284MiB | 20,242 | 47,236 |
-| real-sim | 0.9681 |  0.7969 | 0.9439  |0.9591 | 72,309 | 20,958 |
+| news20.binary | 0.9208, 707ms, 361MiB | 0.6870, 565ms, 371MiB | 0.9070, 904ms, 364MiB | 19,996 | 1,355,191 |
+| real-sim | 0.9681, 433ms, 274MiB | 0.7969, 251ms, 274MiB | 0.9591, 643ms, 367MiB | 72,309 | 20,958 |
 The dataset of benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
 )

--- a/examples/trials/auto-feature-engineering/README.md
+++ b/examples/trials/auto-feature-engineering/README.md
--- a/examples/trials/auto-feature-engineering/README_zh_CN.md
+++ b/examples/trials/auto-feature-engineering/README_zh_CN.md
--- a/examples/feature_engineering/gradient_feature_selector/.gitignore
+++ b/examples/feature_engineering/gradient_feature_selector/.gitignore
+*.bz2
+*.svm
+*.log
+*memory
+*time
--- a/examples/feature_engineering/gradient_feature_selector/benchmark_test.py
+++ b/examples/feature_engineering/gradient_feature_selector/benchmark_test.py
@@ -18,6 +18,10 @@
 import bz2
 import urllib.request
 import numpy as np
+import datetime
+import line_profiler
+profile = line_profiler.LineProfiler()
 import os
@@ -34,7 +38,7 @@ from nni.feature_engineering.gradient_selector import FeatureGradientSelector
 class Benchmark():
-    def __init__(self, files, test_size = 0.2):
+    def __init__(self, files=None, test_size=0.2):
        self.files =  files
        self.test_size = test_size
@@ -73,40 +77,72 @@ class Benchmark():
        return update_name
+@profile
+def test_memory(pipeline_name, name, path):
+    if pipeline_name == "LR":
+        pipeline = make_pipeline(LogisticRegression())
+    if pipeline_name == "FGS":
+        pipeline = make_pipeline(FeatureGradientSelector(), LogisticRegression())
+    if pipeline_name == "Tree":
+        pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+    test_benchmark = Benchmark()
+    print("Dataset:\t", name)
+    print("Pipeline:\t", pipeline_name)
+    test_benchmark.run_test(pipeline, name, path)
+    print("")
+def test_time(pipeline_name, name, path):
+    if pipeline_name == "LR":
+        pipeline = make_pipeline(LogisticRegression())
+    if pipeline_name == "FGS":
+        pipeline = make_pipeline(FeatureGradientSelector(), LogisticRegression())
+    if pipeline_name == "Tree":
+        pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+    test_benchmark = Benchmark()
+    print("Dataset:\t", name)
+    print("Pipeline:\t", pipeline_name)
+    starttime = datetime.datetime.now()
+    test_benchmark.run_test(pipeline, name, path)
+    endtime = datetime.datetime.now()
+    print("Used time: ", (endtime - starttime).microseconds/1000)
+    print("")
 if __name__ == "__main__":
    LIBSVM_DATA = {
        "rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
-        # "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
        "colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
        "gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
-        # "kdd2010" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2",
-        # "kdd2012" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdd12.bz2",
        "news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
        "real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
-        "webspam" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/webspam_wc_normalized_trigram.svm.bz2"
+        "webspam" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/webspam_wc_normalized_trigram.svm.bz2",
+        "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2"
    }
-    test_benchmark = Benchmark(LIBSVM_DATA)
+    import argparse
+    parser = argparse.ArgumentParser()
-    pipeline1 = make_pipeline(LogisticRegression())
+    parser.add_argument('--pipeline_name', type=str, help='display pipeline_name.')
-    print("Test all data in LogisticRegression.")
+    parser.add_argument('--name', type=str, help='display name.')
-    print()
+    parser.add_argument('--object', type=str, help='display test object: time or memory.')
-    test_benchmark.run_all_test(pipeline1)
+    args = parser.parse_args()
-    pipeline2 = make_pipeline(FeatureGradientSelector(), LogisticRegression())
+    pipeline_name = args.pipeline_name
-    print("Test data selected by FeatureGradientSelector in LogisticRegression.")
+    name = args.name
-    print()
+    test_object = args.object
-    test_benchmark.run_all_test(pipeline2)
+    path = LIBSVM_DATA[name]
-    pipeline3 = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+    if test_object == 'time':
-    print("Test data selected by TreeClssifier in LogisticRegression.")
+        test_time(pipeline_name, name, path)
-    print()
+    elif test_object == 'memory':
-    test_benchmark.run_all_test(pipeline3)
+        test_memory(pipeline_name, name, path)
+    else:
-    pipeline4 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression())
+        print("Not support test object.\t", test_object)
-    print("Test data selected by FeatureGradientSelector top 20 in LogisticRegression.")
-    print()
-    test_benchmark.run_all_test(pipeline4)
    print("Done.")
\ No newline at end of file
--- a/examples/feature_engineering/gradient_feature_selector/sklearn_test.py
+++ b/examples/feature_engineering/gradient_feature_selector/sklearn_test.py
@@ -30,26 +30,28 @@ from sklearn.feature_selection import SelectFromModel
 from nni.feature_engineering.gradient_selector import FeatureGradientSelector
-url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
-urllib.request.urlretrieve(url_zip_train, filename='train.bz2')
-f_svm = open('train.svm', 'wt')
+def test():
-with bz2.open('train.bz2', 'rb') as f_zip:
+    url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
-    data = f_zip.read()
+    urllib.request.urlretrieve(url_zip_train, filename='train.bz2')
-    f_svm.write(data.decode('utf-8'))
-f_svm.close()
+    f_svm = open('train.svm', 'wt')
+    with bz2.open('train.bz2', 'rb') as f_zip:
+        data = f_zip.read()
+        f_svm.write(data.decode('utf-8'))
+    f_svm.close()
-X, y = load_svmlight_file('train.svm')
-X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
-fgs = FeatureGradientSelector(n_features=10)
+    X, y = load_svmlight_file('train.svm')
-fgs.fit(X_train, y_train)
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
-print("selected features\t", fgs.get_selected_features())
-pipeline = make_pipeline(FeatureGradientSelector(n_epochs=1, n_features=10), LogisticRegression())
+    pipeline = make_pipeline(FeatureGradientSelector(n_epochs=1, n_features=10), LogisticRegression())
-# pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
+    # pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
-pipeline.fit(X_train, y_train)
-print("Pipeline Score: ", pipeline.score(X_train, y_train))
+    pipeline.fit(X_train, y_train)
\ No newline at end of file
+    print("Pipeline Score: ", pipeline.score(X_train, y_train))
+if __name__ == "__main__":
+    test()
\ No newline at end of file
--- a/examples/feature_engineering/gradient_feature_selector/test_memory.py
+++ b/examples/feature_engineering/gradient_feature_selector/test_memory.py
+import os
+LIBSVM_DATA = {
+    "rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
+    "colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
+    "gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
+    "news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
+    "real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
+    "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
+}
+pipeline_name = "Tree"
+device = "CUDA_VISIBLE_DEVICES=0 "
+script = "setsid python -m memory_profiler benchmark_test.py "
+test_object = "memory"
+for name in LIBSVM_DATA:
+    log_name = "_".join([pipeline_name, name, test_object])
+    command = device + script + "--pipeline_name " + pipeline_name + " --name " + name + " --object " + test_object + " >" +log_name + " 2>&1 &"
+    print("command is\t", command)
+    os.system(command)
+    print("log is here\t", log_name)
+print("Done.")
--- a/examples/feature_engineering/gradient_feature_selector/test_time.py
+++ b/examples/feature_engineering/gradient_feature_selector/test_time.py
+import os
+LIBSVM_DATA = {
+    "rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
+    "colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
+    "gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
+    "news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
+    "real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
+    "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
+}
+pipeline_name = "LR"
+device = "CUDA_VISIBLE_DEVICES=0 "
+script = "setsid python benchmark_test.py "
+test_object = "time"
+for name in LIBSVM_DATA:
+    log_name = "_".join([pipeline_name, name, test_object])
+    command = device + script + "--pipeline_name " + pipeline_name + " --name " + name + " --object " + test_object + " >" +log_name + " 2>&1 &"
+    print("command is\t", command)
+    os.system(command)
+    print("log is here\t", log_name)
+print("Done.")