Unverified Commit 9484efb5 authored by xuehui's avatar xuehui Committed by GitHub
Browse files

Update the memroy usage and time cost in benchmark (#1829)

* change auto-feature-engineering dir

* add dir

* update the ReadME

* add test time and memory

* update docs | update benchmark

* add gitignore

* update benchmark in feature selector(memory and time)

* merge master

* ignore F821 in flake8 | update benchmark number

* update number in benchmark

* fix flake8

* remove the update for the azure-pipeline.yml

* update by comments
parent cb15be49
...@@ -358,7 +358,7 @@ With authors' permission, we listed a set of NNI usage examples and relevant art ...@@ -358,7 +358,7 @@ With authors' permission, we listed a set of NNI usage examples and relevant art
* ### **External Repositories** ### * ### **External Repositories** ###
* Run [ENAS](examples/tuners/enas_nni/README.md) with NNI * Run [ENAS](examples/tuners/enas_nni/README.md) with NNI
* Run [Neural Network Architecture Search](examples/trials/nas_cifar10/README.md) with NNI * Run [Neural Network Architecture Search](examples/trials/nas_cifar10/README.md) with NNI
* [Automatic Feature Engineering](examples/trials/auto-feature-engineering/README.md) with NNI * [Automatic Feature Engineering](examples/feature_engineering/auto-feature-engineering/README.md) with NNI
* [Hyperparameter Tuning for Matrix Factorization](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) with NNI * [Hyperparameter Tuning for Matrix Factorization](https://github.com/microsoft/recommenders/blob/master/notebooks/04_model_select_and_optimize/nni_surprise_svd.ipynb) with NNI
* [scikit-nni](https://github.com/ksachdeva/scikit-nni) Hyper-parameter search for scikit-learn pipelines using NNI * [scikit-nni](https://github.com/ksachdeva/scikit-nni) Hyper-parameter search for scikit-learn pipelines using NNI
......
...@@ -243,13 +243,14 @@ print("Pipeline Score: ", pipeline.score(X_train, y_train)) ...@@ -243,13 +243,14 @@ print("Pipeline Score: ", pipeline.score(X_train, y_train))
`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels. `Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
| Dataset | Baseline | GradientFeatureSelector top20 | GradientFeatureSelector auto | TreeBasedClassifier | #Train | #Feature | | Dataset | All Features + LR (acc, time, memory) | GradientFeatureSelector + LR (acc, time, memory) | TreeBasedClassifier + LR (acc, time, memory) | #Train | #Feature |
| ----------- | ------ | ------ | ------- | ------- | -------- |-------- | | ----------- | ------ | ------ | ------- | ------- | -------- |
| colon-cancer | 0.7547 | 0.7368 | 0.5389 | 0.7223 | 62 | 2,000 | | colon-cancer | 0.7547, 890ms, 348MiB | 0.7368, 363ms, 286MiB | 0.7223, 171ms, 1171 MiB | 62 | 2,000 |
| gisette | 0.9725 | 0.9241 | 0.9658 |0.9792 | 6,000 | 5,000 | | gisette | 0.9725, 215ms, 584MiB | 0.89416, 446ms, 397MiB | 0.9792, 911ms, 234MiB | 6,000 | 5,000 |
| rcv1 | 0.9644 | 0.7333 | 0.9548 |0.9615 | 20,242 | 47,236 | | avazu | 0.8834, N/A, N/A | N/A, N/A, N/A | N/A, N/A, N/A | 40,428,967 | 1,000,000 |
| news20.binary | 0.9208 | 0.8780 | 0.8875 | 0.9070 | 19,996 | 1,355,191 | | rcv1 | 0.9644, 557ms, 241MiB | 0.7333, 401ms, 281MiB | 0.9615, 752ms, 284MiB | 20,242 | 47,236 |
| real-sim | 0.9681 | 0.7969 | 0.9439 |0.9591 | 72,309 | 20,958 | | news20.binary | 0.9208, 707ms, 361MiB | 0.6870, 565ms, 371MiB | 0.9070, 904ms, 364MiB | 19,996 | 1,355,191 |
| real-sim | 0.9681, 433ms, 274MiB | 0.7969, 251ms, 274MiB | 0.9591, 643ms, 367MiB | 72,309 | 20,958 |
The dataset of benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ The dataset of benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
) )
......
...@@ -18,6 +18,10 @@ ...@@ -18,6 +18,10 @@
import bz2 import bz2
import urllib.request import urllib.request
import numpy as np import numpy as np
import datetime
import line_profiler
profile = line_profiler.LineProfiler()
import os import os
...@@ -34,7 +38,7 @@ from nni.feature_engineering.gradient_selector import FeatureGradientSelector ...@@ -34,7 +38,7 @@ from nni.feature_engineering.gradient_selector import FeatureGradientSelector
class Benchmark(): class Benchmark():
def __init__(self, files, test_size = 0.2): def __init__(self, files=None, test_size=0.2):
self.files = files self.files = files
self.test_size = test_size self.test_size = test_size
...@@ -73,40 +77,72 @@ class Benchmark(): ...@@ -73,40 +77,72 @@ class Benchmark():
return update_name return update_name
@profile
def test_memory(pipeline_name, name, path):
if pipeline_name == "LR":
pipeline = make_pipeline(LogisticRegression())
if pipeline_name == "FGS":
pipeline = make_pipeline(FeatureGradientSelector(), LogisticRegression())
if pipeline_name == "Tree":
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
test_benchmark = Benchmark()
print("Dataset:\t", name)
print("Pipeline:\t", pipeline_name)
test_benchmark.run_test(pipeline, name, path)
print("")
def test_time(pipeline_name, name, path):
if pipeline_name == "LR":
pipeline = make_pipeline(LogisticRegression())
if pipeline_name == "FGS":
pipeline = make_pipeline(FeatureGradientSelector(), LogisticRegression())
if pipeline_name == "Tree":
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
test_benchmark = Benchmark()
print("Dataset:\t", name)
print("Pipeline:\t", pipeline_name)
starttime = datetime.datetime.now()
test_benchmark.run_test(pipeline, name, path)
endtime = datetime.datetime.now()
print("Used time: ", (endtime - starttime).microseconds/1000)
print("")
if __name__ == "__main__": if __name__ == "__main__":
LIBSVM_DATA = { LIBSVM_DATA = {
"rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2", "rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
# "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
"colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2", "colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
"gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2", "gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
# "kdd2010" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2",
# "kdd2012" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdd12.bz2",
"news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2", "news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
"real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2", "real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
"webspam" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/webspam_wc_normalized_trigram.svm.bz2" "webspam" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/webspam_wc_normalized_trigram.svm.bz2",
"avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2"
} }
test_benchmark = Benchmark(LIBSVM_DATA) import argparse
parser = argparse.ArgumentParser()
pipeline1 = make_pipeline(LogisticRegression()) parser.add_argument('--pipeline_name', type=str, help='display pipeline_name.')
print("Test all data in LogisticRegression.") parser.add_argument('--name', type=str, help='display name.')
print() parser.add_argument('--object', type=str, help='display test object: time or memory.')
test_benchmark.run_all_test(pipeline1)
args = parser.parse_args()
pipeline2 = make_pipeline(FeatureGradientSelector(), LogisticRegression()) pipeline_name = args.pipeline_name
print("Test data selected by FeatureGradientSelector in LogisticRegression.") name = args.name
print() test_object = args.object
test_benchmark.run_all_test(pipeline2) path = LIBSVM_DATA[name]
pipeline3 = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression()) if test_object == 'time':
print("Test data selected by TreeClssifier in LogisticRegression.") test_time(pipeline_name, name, path)
print() elif test_object == 'memory':
test_benchmark.run_all_test(pipeline3) test_memory(pipeline_name, name, path)
else:
pipeline4 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression()) print("Not support test object.\t", test_object)
print("Test data selected by FeatureGradientSelector top 20 in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline4)
print("Done.") print("Done.")
\ No newline at end of file
...@@ -30,26 +30,28 @@ from sklearn.feature_selection import SelectFromModel ...@@ -30,26 +30,28 @@ from sklearn.feature_selection import SelectFromModel
from nni.feature_engineering.gradient_selector import FeatureGradientSelector from nni.feature_engineering.gradient_selector import FeatureGradientSelector
url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
urllib.request.urlretrieve(url_zip_train, filename='train.bz2')
f_svm = open('train.svm', 'wt') def test():
with bz2.open('train.bz2', 'rb') as f_zip: url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
data = f_zip.read() urllib.request.urlretrieve(url_zip_train, filename='train.bz2')
f_svm.write(data.decode('utf-8'))
f_svm.close()
f_svm = open('train.svm', 'wt')
with bz2.open('train.bz2', 'rb') as f_zip:
data = f_zip.read()
f_svm.write(data.decode('utf-8'))
f_svm.close()
X, y = load_svmlight_file('train.svm')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
fgs = FeatureGradientSelector(n_features=10) X, y = load_svmlight_file('train.svm')
fgs.fit(X_train, y_train) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print("selected features\t", fgs.get_selected_features())
pipeline = make_pipeline(FeatureGradientSelector(n_epochs=1, n_features=10), LogisticRegression()) pipeline = make_pipeline(FeatureGradientSelector(n_epochs=1, n_features=10), LogisticRegression())
# pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression()) # pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
pipeline.fit(X_train, y_train)
print("Pipeline Score: ", pipeline.score(X_train, y_train)) pipeline.fit(X_train, y_train)
\ No newline at end of file
print("Pipeline Score: ", pipeline.score(X_train, y_train))
if __name__ == "__main__":
test()
\ No newline at end of file
import os
LIBSVM_DATA = {
"rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
"colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
"gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
"news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
"real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
"avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
}
pipeline_name = "Tree"
device = "CUDA_VISIBLE_DEVICES=0 "
script = "setsid python -m memory_profiler benchmark_test.py "
test_object = "memory"
for name in LIBSVM_DATA:
log_name = "_".join([pipeline_name, name, test_object])
command = device + script + "--pipeline_name " + pipeline_name + " --name " + name + " --object " + test_object + " >" +log_name + " 2>&1 &"
print("command is\t", command)
os.system(command)
print("log is here\t", log_name)
print("Done.")
import os
LIBSVM_DATA = {
"rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
"colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
"gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
"news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
"real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
"avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
}
pipeline_name = "LR"
device = "CUDA_VISIBLE_DEVICES=0 "
script = "setsid python benchmark_test.py "
test_object = "time"
for name in LIBSVM_DATA:
log_name = "_".join([pipeline_name, name, test_object])
command = device + script + "--pipeline_name " + pipeline_name + " --name " + name + " --object " + test_object + " >" +log_name + " 2>&1 &"
print("command is\t", command)
os.system(command)
print("log is here\t", log_name)
print("Done.")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment