Commit df5b6a6f authored by xuehui's avatar xuehui Committed by chicm-ms
Browse files

Diff feature selection (#1793)

parent db809da2
......@@ -241,17 +241,17 @@ print("Pipeline Score: ", pipeline.score(X_train, y_train))
# Benchmark
`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data.
| Dataset | Baseline | GradientFeatureSelector | TreeBasedClassifier | #Train | #Feature |
| ----------- | ------ | ------ | ------- | ------- | -------- |
| colon-cancer | 0.7547 | 0.7368 | 0.7223 | 62 | 2,000 |
| gisette | 0.9725 | 0.89416 | 0.9792 | 6,000 | 5,000 |
| avazu | 0.8834 | N/A | N/A | 40,428,967 | 1,000,000 |
| rcv1 | 0.9644 | 0.7333 | 0.9615 | 20,242 | 47,236 |
| news20.binary | 0.9208 | 0.6870 | 0.9070 | 19,996 | 1,355,191 |
| real-sim | 0.9681 | 0.7969 | 0.9591 | 72,309 | 20,958 |
The benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
| Dataset | Baseline | GradientFeatureSelector top20 | GradientFeatureSelector auto | TreeBasedClassifier | #Train | #Feature |
| ----------- | ------ | ------ | ------- | ------- | -------- |-------- |
| colon-cancer | 0.7547 | 0.7368 | 0.5389 | 0.7223 | 62 | 2,000 |
| gisette | 0.9725 | 0.9241 | 0.9658 |0.9792 | 6,000 | 5,000 |
| rcv1 | 0.9644 | 0.7333 | 0.9548 |0.9615 | 20,242 | 47,236 |
| news20.binary | 0.9208 | 0.8780 | 0.8875 | 0.9070 | 19,996 | 1,355,191 |
| real-sim | 0.9681 | 0.7969 | 0.9439 |0.9591 | 72,309 | 20,958 |
The dataset of benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
)
The code could be refenrence `/examples/feature_engineering/gradient_feature_selector/benchmark_test.py`.
......@@ -94,7 +94,7 @@ if __name__ == "__main__":
print()
test_benchmark.run_all_test(pipeline1)
pipeline2 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression())
pipeline2 = make_pipeline(FeatureGradientSelector(), LogisticRegression())
print("Test data selected by FeatureGradientSelector in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline2)
......@@ -103,5 +103,10 @@ if __name__ == "__main__":
print("Test data selected by TreeClssifier in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline3)
pipeline4 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression())
print("Test data selected by FeatureGradientSelector top 20 in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline4)
print("Done.")
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment