gbdt_example.md 6.4 KB
Newer Older
Yan Ni's avatar
Yan Ni committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# GBDT in nni
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. 

Gradient boosting decision tree has many popular implementations, such as [lightgbm](https://github.com/Microsoft/LightGBM), [xgboost](https://github.com/dmlc/xgboost), and [catboost](https://github.com/catboost/catboost), etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.

NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently. 

## 1. Search Space in GBDT
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):

> * For better accuracy
* `learning_rate`. The range of `learning rate` could be [0.001, 0.9].

* `num_leaves`. `num_leaves` is related to `max_depth`, you don't have to tune both of them.
15

Yan Ni's avatar
Yan Ni committed
16
17
18
19
20
21
22
23
* `bagging_freq`. `bagging_freq` could be [1, 2, 4, 8, 10]

* `num_iterations`. May larger if underfitting.

> * For speed up
* `bagging_fraction`. The range of `bagging_fraction` could be [0.7, 1.0].

* `feature_fraction`. The range of `feature_fraction` could be [0.6, 1.0].
24

Yan Ni's avatar
Yan Ni committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
* `max_bin`.

> * To avoid overfitting
* `min_data_in_leaf`. This depends on your dataset.

* `min_sum_hessian_in_leaf`. This depend on your dataset.

* `lambda_l1` and `lambda_l2`.

* `min_gain_to_split`.

* `num_leaves`.

Reference link:
39
[lightgbm](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html) and [autoxgoboost](https://github.com/ja-thomas/autoxgboost/blob/master/poster_2018.pdf)
Yan Ni's avatar
Yan Ni committed
40
41

## 2. Task description
42
Now we come back to our example "auto-gbdt" which run in lightgbm and nni. The data including [train data](https://github.com/Microsoft/nni/blob/master/examples/trials/auto-gbdt/data/regression.train) and [test data](https://github.com/Microsoft/nni/blob/master/examples/trials/auto-gbdt/data/regression.train).
Yan Ni's avatar
Yan Ni committed
43
44
45
46
47
Given the features and label in train data, we train a GBDT regression model and use it to predict.

## 3. How to run in nni

### 3.1 Prepare your trial code
48

Yan Ni's avatar
Yan Ni committed
49
50
You need to prepare a basic code as following:

51
```python
Yan Ni's avatar
Yan Ni committed
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
...

def get_default_parameters():
    ...
    return params


def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
    '''
    Load or create dataset
    '''
    ...

    return lgb_train, lgb_eval, X_test, y_test

def run(lgb_train, lgb_eval, params, X_test, y_test):
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=20,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

77
    # eval
Yan Ni's avatar
Yan Ni committed
78
79
80
81
82
83
84
85
86
87
88
89
    rmse = mean_squared_error(y_test, y_pred) ** 0.5
    print('The rmse of prediction is:', rmse)

if __name__ == '__main__':
    lgb_train, lgb_eval, X_test, y_test = load_data()

    PARAMS = get_default_parameters()
    # train
    run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
```

### 3.2 Prepare your search space.
90
91
92
If you like to tune `num_leaves`, `learning_rate`, `bagging_fraction` and `bagging_freq`, you could write a [search_space.json](https://github.com/Microsoft/nni/blob/master/examples/trials/auto-gbdt/search_space.json) as follow:

```json
Yan Ni's avatar
Yan Ni committed
93
94
95
96
97
98
99
100
101
102
103
{
    "num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]},
    "learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
    "bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
    "bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
```

More support variable type you could reference [here](https://github.com/Microsoft/nni/blob/master/docs/SearchSpaceSpec.md).

### 3.3 Add SDK of nni into your code.
104

Yan Ni's avatar
Yan Ni committed
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
```diff
+import nni
...

def get_default_parameters():
    ...
    return params


def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
    '''
    Load or create dataset
    '''
    ...

    return lgb_train, lgb_eval, X_test, y_test

def run(lgb_train, lgb_eval, params, X_test, y_test):
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=20,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

132
    # eval
Yan Ni's avatar
Yan Ni committed
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
    rmse = mean_squared_error(y_test, y_pred) ** 0.5
    print('The rmse of prediction is:', rmse)
+   nni.report_final_result(rmse)

if __name__ == '__main__':
    lgb_train, lgb_eval, X_test, y_test = load_data()
+   RECEIVED_PARAMS = nni.get_next_parameter()
    PARAMS = get_default_parameters()
+   PARAMS.update(RECEIVED_PARAMS)
    PARAMS = get_default_parameters()
    PARAMS.update(RECEIVED_PARAMS)

    # train
    run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
```

### 3.4 Write a config file and run it.
150

Yan Ni's avatar
Yan Ni committed
151
152
153
154
155
156
157
158
In the config file, you could set some settings including:

* Experiment setting: `trialConcurrency`, `maxExecDuration`, `maxTrialNum`, `trial gpuNum`, etc.
* Platform setting: `trainingServicePlatform`, etc.
* Path seeting: `searchSpacePath`, `trial codeDir`, etc.
* Algorithm setting: select `tuner` algorithm, `tuner optimize_mode`, etc.

An config.yml as follow:
159

Yan Ni's avatar
Yan Ni committed
160
```yaml
Yan Ni's avatar
Yan Ni committed
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
authorName: default
experimentName: example_auto-gbdt
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner
  #SMAC (SMAC should be installed through nnictl)
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: minimize
trial:
  command: python3 main.py
  codeDir: .
  gpuNum: 0
```

Run this experiment with command as follow:
185
186

```bash
Yan Ni's avatar
Yan Ni committed
187
188
nnictl create --config ./config.yml
```