gbdt_example.md 6.4 KB
Newer Older
Yan Ni's avatar
Yan Ni committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# GBDT in nni
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. 

Gradient boosting decision tree has many popular implementations, such as [lightgbm](https://github.com/Microsoft/LightGBM), [xgboost](https://github.com/dmlc/xgboost), and [catboost](https://github.com/catboost/catboost), etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.

NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently. 


## 1. Search Space in GBDT
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):

> * For better accuracy
* `learning_rate`. The range of `learning rate` could be [0.001, 0.9].

* `num_leaves`. `num_leaves` is related to `max_depth`, you don't have to tune both of them.
    
* `bagging_freq`. `bagging_freq` could be [1, 2, 4, 8, 10]

* `num_iterations`. May larger if underfitting.

> * For speed up
* `bagging_fraction`. The range of `bagging_fraction` could be [0.7, 1.0].

* `feature_fraction`. The range of `feature_fraction` could be [0.6, 1.0].
    
* `max_bin`.

> * To avoid overfitting
* `min_data_in_leaf`. This depends on your dataset.

* `min_sum_hessian_in_leaf`. This depend on your dataset.

* `lambda_l1` and `lambda_l2`.

* `min_gain_to_split`.

* `num_leaves`.

Reference link:
[lightgbm](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html) and 
[autoxgoboost](https://github.com/ja-thomas/autoxgboost/blob/master/poster_2018.pdf)

## 2. Task description
Now we come back to our example "auto-gbdt" which run in lightgbm and nni. The data including [train data](https://github.com/Microsoft/nni/blob/master/examples/trials/auto-gbdt/data/regression.train) and [test data](https://github.com/Microsoft/nni/blob/master/examples/trials/auto-gbdt/data/regression.train). 
Given the features and label in train data, we train a GBDT regression model and use it to predict.

## 3. How to run in nni

### 3.1 Prepare your trial code
You need to prepare a basic code as following:
``` python

...

def get_default_parameters():
    ...
    return params


def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
    '''
    Load or create dataset
    '''
    ...

    return lgb_train, lgb_eval, X_test, y_test

def run(lgb_train, lgb_eval, params, X_test, y_test):
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=20,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

    # eval 
    rmse = mean_squared_error(y_test, y_pred) ** 0.5
    print('The rmse of prediction is:', rmse)

if __name__ == '__main__':
    lgb_train, lgb_eval, X_test, y_test = load_data()

    PARAMS = get_default_parameters()
    # train
    run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
```

### 3.2 Prepare your search space.
If you like to tune `num_leaves`, `learning_rate`, `bagging_fraction` and `bagging_freq`, 
you could write a [search_space.json](https://github.com/Microsoft/nni/blob/master/examples/trials/auto-gbdt/search_space.json) as follow:
```
{
    "num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]},
    "learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
    "bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
    "bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
```

More support variable type you could reference [here](https://github.com/Microsoft/nni/blob/master/docs/SearchSpaceSpec.md).

### 3.3 Add SDK of nni into your code.
```diff
+import nni
...

def get_default_parameters():
    ...
    return params


def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
    '''
    Load or create dataset
    '''
    ...

    return lgb_train, lgb_eval, X_test, y_test

def run(lgb_train, lgb_eval, params, X_test, y_test):
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=20,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)
    # predict
    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

    # eval 
    rmse = mean_squared_error(y_test, y_pred) ** 0.5
    print('The rmse of prediction is:', rmse)
+   nni.report_final_result(rmse)

if __name__ == '__main__':
    lgb_train, lgb_eval, X_test, y_test = load_data()
+   RECEIVED_PARAMS = nni.get_next_parameter()
    PARAMS = get_default_parameters()
+   PARAMS.update(RECEIVED_PARAMS)
    PARAMS = get_default_parameters()
    PARAMS.update(RECEIVED_PARAMS)

    # train
    run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
```

### 3.4 Write a config file and run it.
In the config file, you could set some settings including:

* Experiment setting: `trialConcurrency`, `maxExecDuration`, `maxTrialNum`, `trial gpuNum`, etc.
* Platform setting: `trainingServicePlatform`, etc.
* Path seeting: `searchSpacePath`, `trial codeDir`, etc.
* Algorithm setting: select `tuner` algorithm, `tuner optimize_mode`, etc.

An config.yml as follow:
Yan Ni's avatar
Yan Ni committed
158
```yaml
Yan Ni's avatar
Yan Ni committed
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
authorName: default
experimentName: example_auto-gbdt
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner
  #SMAC (SMAC should be installed through nnictl)
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: minimize
trial:
  command: python3 main.py
  codeDir: .
  gpuNum: 0
```

Run this experiment with command as follow:
```
nnictl create --config ./config.yml
```