overview.rst 10.2 KB
Newer Older
1
2
3
4
5
6
7
8
Feature Engineering with NNI
============================

We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.

For now, we support the following feature selector:


J-shang's avatar
J-shang committed
9
10
* `GradientFeatureSelector <./gradient_feature_selector.rst>`__
* `GBDTSelector <./gbdt_selector.rst>`__
11
12
13
14
15
16
17
18
19
20
21
22
23

These selectors are suitable for tabular data(which means it doesn't include image, speech and text data).

In addition, those selector only for feature selection. If you want to:
1) generate high-order combined features on nni while doing feature selection;
2) leverage your distributed resources;
you could try this :githublink:`example <examples/feature_engineering/auto-feature-engineering>`.

How to use?
-----------

.. code-block:: python

Bingqi Shang's avatar
Bingqi Shang committed
24
25
   from nni.algorithms.feature_engineering.gradient_selector import FeatureGradientSelector
   # from nni.algorithms.feature_engineering.gbdt_selector import GBDTSelector
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

   # load data
   ...
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

   # initlize a selector
   fgs = FeatureGradientSelector(...)
   # fit data
   fgs.fit(X_train, y_train)
   # get improtant features
   # will return the index with important feature here.
   print(fgs.get_selected_features(...))

   ...

When using the built-in Selector, you first need to ``import`` a feature selector, and ``initialize`` it. You could call the function ``fit`` in the selector to pass the data to the selector. After that, you could use ``get_seleteced_features`` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it. 

How to customize?
-----------------

NNI provides *state-of-the-art* feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.

If you want to implement a customized feature selector, you need to:


#. Inherit the base FeatureSelector class
52
#. Implement *fit* and _get_selected *features* function
53
54
55
56
57
58
59
60
61
62
63
#. Integrate with sklearn (Optional)

Here is an example:

**1. Inherit the base Featureselector Class**

.. code-block:: python

   from nni.feature_engineering.feature_selector import FeatureSelector

   class CustomizedSelector(FeatureSelector):
Yuge Zhang's avatar
Yuge Zhang committed
64
65
       def __init__(self, *args, **kwargs):
           ...
66

67
**2. Implement fit and _get_selected features Function**
68
69
70
71
72
73
74
75

.. code-block:: python

   from nni.tuner import Tuner

   from nni.feature_engineering.feature_selector import FeatureSelector

   class CustomizedSelector(FeatureSelector):
Yuge Zhang's avatar
Yuge Zhang committed
76
77
       def __init__(self, *args, **kwargs):
           ...
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128

       def fit(self, X, y, **kwargs):
           """
           Fit the training data to FeatureSelector

           Parameters
           ------------
           X : array-like numpy matrix
           The training input samples, which shape is [n_samples, n_features].
           y: array-like numpy matrix
           The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
           """
           self.X = X
           self.y = y
           ...

       def get_selected_features(self):
           """
           Get important feature

           Returns
           -------
           list :
           Return the index of the important feature.
           """
           ...
           return self.selected_features_

       ...

**3. Integrate with Sklearn**

``sklearn.pipeline.Pipeline`` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow. 
The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a module of the pipeline.


#. Inherit the calss *sklearn.base.BaseEstimator*
#. Implement _get\ *params* and _set*params* function in *BaseEstimator*
#. Inherit the class _sklearn.feature\ *selection.base.SelectorMixin*
#. Implement _get\ *support*\ , *transform* and _inverse*transform* Function in *SelectorMixin*

Here is an example:

**1. Inherit the BaseEstimator Class and its Function**

.. code-block:: python

   from sklearn.base import BaseEstimator
   from nni.feature_engineering.feature_selector import FeatureSelector

   class CustomizedSelector(FeatureSelector, BaseEstimator):
Yuge Zhang's avatar
Yuge Zhang committed
129
130
       def __init__(self, *args, **kwargs):
           ...
131

Yuge Zhang's avatar
Yuge Zhang committed
132
       def get_params(self, *args, **kwargs):
133
134
135
136
           """
           Get parameters for this estimator.
           """
           params = self.__dict__
Yuge Zhang's avatar
Yuge Zhang committed
137
           params = {key: val for (key, val) in params.items() if not key.endswith('_')}
138
139
140
141
142
143
144
           return params

       def set_params(self, **params):
           """
           Set the parameters of this estimator.
           """
           for param in params:
Yuge Zhang's avatar
Yuge Zhang committed
145
146
               if hasattr(self, param):
                   setattr(self, param, params[param])
147
148
149
150
151
152
153
154
155
156
157
158
           return self

**2. Inherit the SelectorMixin Class and its Function**

.. code-block:: python

   from sklearn.base import BaseEstimator
   from sklearn.feature_selection.base import SelectorMixin

   from nni.feature_engineering.feature_selector import FeatureSelector

   class CustomizedSelector(FeatureSelector, BaseEstimator, SelectorMixin):
Yuge Zhang's avatar
Yuge Zhang committed
159
       def __init__(self, *args, **kwargs):
160
161
           ...

Yuge Zhang's avatar
Yuge Zhang committed
162
       def get_params(self, *args, **kwargs):
163
164
165
166
167
168
169
170
171
172
173
174
175
           """
           Get parameters for this estimator.
           """
           params = self.__dict__
           params = {key: val for (key, val) in params.items()
           if not key.endswith('_')}
           return params

       def set_params(self, **params):
           """
           Set the parameters of this estimator.
           """
           for param in params:
Yuge Zhang's avatar
Yuge Zhang committed
176
177
               if hasattr(self, param):
                   setattr(self, param, params[param])
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
           return self

       def get_support(self, indices=False):
           """
           Get a mask, or integer index, of the features selected.

           Parameters
           ----------
           indices : bool
           Default False. If True, the return value will be an array of integers, rather than a boolean mask.

           Returns
           -------
           list :
           returns support: An index that selects the retained features from a feature vector.
           If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
           If indices are True, this is an integer array of shape [# output features] whose values
           are indices into the input feature vector.
           """
           ...
           return mask


       def transform(self, X):
           """Reduce X to the selected features.

           Parameters
           ----------
           X : array
           which shape is [n_samples, n_features]

           Returns
           -------
           X_r : array
           which shape is [n_samples, n_selected_features]
           The input samples with only the selected features.
           """
           ...
           return X_r


       def inverse_transform(self, X):
           """
           Reverse the transformation operation

           Parameters
           ----------
           X : array
           shape is [n_samples, n_selected_features]

           Returns
           -------
           X_r : array
           shape is [n_samples, n_original_features]
           """
           ...
           return X_r

After integrating with Sklearn, we could use the feature selector as follows:

.. code-block:: python

   from sklearn.linear_model import LogisticRegression

   # load data
   ...
   X_train, y_train = ...

   # build a ppipeline
   pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
   pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
   pipeline.fit(X_train, y_train)

   # score
   print("Pipeline Score: ", pipeline.score(X_train, y_train))

Benchmark
---------

``Baseline`` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. For the GradientFeatureSelector, we only take the top20 features. The metric is the mean accuracy on the given test data and labels.

.. list-table::
   :header-rows: 1
   :widths: auto

   * - Dataset
     - All Features + LR (acc, time, memory)
     - GradientFeatureSelector + LR (acc, time, memory)
     - TreeBasedClassifier + LR (acc, time, memory)
     - #Train
     - #Feature
   * - colon-cancer
     - 0.7547, 890ms, 348MiB
     - 0.7368, 363ms, 286MiB
     - 0.7223, 171ms, 1171 MiB
     - 62
     - 2,000
   * - gisette
     - 0.9725, 215ms, 584MiB
     - 0.89416, 446ms, 397MiB
     - 0.9792, 911ms, 234MiB
     - 6,000
     - 5,000
   * - avazu
     - 0.8834, N/A, N/A
     - N/A, N/A, N/A
     - N/A, N/A, N/A
     - 40,428,967
     - 1,000,000
   * - rcv1
     - 0.9644, 557ms, 241MiB
     - 0.7333, 401ms, 281MiB
     - 0.9615, 752ms, 284MiB
     - 20,242
     - 47,236
   * - news20.binary
     - 0.9208, 707ms, 361MiB
     - 0.6870, 565ms, 371MiB
     - 0.9070, 904ms, 364MiB
     - 19,996
     - 1,355,191
   * - real-sim
     - 0.9681, 433ms, 274MiB
     - 0.7969, 251ms, 274MiB
     - 0.9591, 643ms, 367MiB
     - 72,309
     - 20,958


The dataset of benchmark could be download in `here <https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/>`__

The code could be refenrence ``/examples/feature_engineering/gradient_feature_selector/benchmark_test.py``.

Reference and Feedback
----------------------


* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
* To know more about :githublink:`Neural Architecture Search with NNI <docs/en_US/NAS/Overview.rst>`\ ;
* To know more about :githublink:`Model Compression with NNI <docs/en_US/Compression/Overview.rst>`\ ;
* To know more about :githublink:`Hyperparameter Tuning with NNI <docs/en_US/Tuner/BuiltinTuner.rst>`\ ;