Diff feature selection (#1793)

df5b6a6f · xuehui · chicm-ms · db809da2 · df5b6a6f · df5b6a6f
Commit df5b6a6f authored Dec 02, 2019 by xuehui Committed by chicm-ms Dec 02, 2019
2 changed files
--- a/docs/en_US/FeatureEngineering/Overview.md
+++ b/docs/en_US/FeatureEngineering/Overview.md
@@ -241,17 +241,17 @@ print("Pipeline Score: ", pipeline.score(X_train, y_train))

 # Benchmark

-`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data.
-
-| Dataset | Baseline | GradientFeatureSelector | TreeBasedClassifier | #Train | #Feature | 
-| ----------- | ------ | ------ | ------- | ------- | -------- |
-| colon-cancer | 0.7547 | 0.7368 | 0.7223 | 62 | 2,000 |
-| gisette | 0.9725 | 0.89416 | 0.9792 | 6,000 | 5,000 |
-| avazu | 0.8834 | N/A | N/A | 40,428,967 | 1,000,000 |
-| rcv1 | 0.9644 | 0.7333 | 0.9615 | 20,242 | 47,236 |
-| news20.binary | 0.9208 | 0.6870 | 0.9070 | 19,996 | 1,355,191 |
-| real-sim | 0.9681 | 0.7969 | 0.9591 | 72,309 | 20,958 |
-
-The benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
+`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.
+
+| Dataset | Baseline | GradientFeatureSelector top20 | GradientFeatureSelector auto | TreeBasedClassifier | #Train | #Feature | 
+| ----------- | ------ | ------ | ------- | ------- | -------- |-------- |
+| colon-cancer | 0.7547 | 0.7368 | 0.5389 | 0.7223 | 62 | 2,000 |
+| gisette | 0.9725 | 0.9241 | 0.9658 |0.9792 | 6,000 | 5,000 |
+| rcv1 | 0.9644 | 0.7333 | 0.9548 |0.9615 | 20,242 | 47,236 |
+| news20.binary | 0.9208 | 0.8780  | 0.8875 | 0.9070 | 19,996 | 1,355,191 |
+| real-sim | 0.9681 |  0.7969 | 0.9439  |0.9591 | 72,309 | 20,958 |
+
+The dataset of benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
 )

+The code could be refenrence `/examples/feature_engineering/gradient_feature_selector/benchmark_test.py`.
--- a/examples/feature_engineering/gradient_feature_selector/benchmark_test.py
+++ b/examples/feature_engineering/gradient_feature_selector/benchmark_test.py
@@ -94,7 +94,7 @@ if __name__ == "__main__":
    print()
    test_benchmark.run_all_test(pipeline1)

-    pipeline2 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression())
+    pipeline2 = make_pipeline(FeatureGradientSelector(), LogisticRegression())
    print("Test data selected by FeatureGradientSelector in LogisticRegression.")
    print()
    test_benchmark.run_all_test(pipeline2)
@@ -103,5 +103,10 @@ if __name__ == "__main__":
    print("Test data selected by TreeClssifier in LogisticRegression.")
    print()
    test_benchmark.run_all_test(pipeline3)
+
+    pipeline4 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression())
+    print("Test data selected by FeatureGradientSelector top 20 in LogisticRegression.")
+    print()
+    test_benchmark.run_all_test(pipeline4)
    
    print("Done.")
\ No newline at end of file