"vscode:/vscode.git/clone" did not exist on "ebc0de8b92e4ba1b76ea78149040cedc2bb0f75e"
Python-intro.md 5.51 KB
Newer Older
Guolin Ke's avatar
Guolin Ke committed
1
2
Python Package Introduction
===========================
3

4
This document gives a basic walkthrough of LightGBM Python-package.
Guolin Ke's avatar
Guolin Ke committed
5
6

***List of other Helpful Links***
7
* [Python Examples](https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide)
8
* [Python API](./Python-API.rst)
Guolin Ke's avatar
Guolin Ke committed
9
* [Parameters Tuning](./Parameters-tuning.md)
Guolin Ke's avatar
Guolin Ke committed
10
11
12

Install
-------
wxchan's avatar
wxchan committed
13

14
15
Install Python-package dependencies, `setuptools`, `wheel`, `numpy` and `scipy` are required, `scikit-learn` is required for sklearn interface and recommended:

Guolin Ke's avatar
Guolin Ke committed
16
```
17
pip install setuptools wheel numpy scipy scikit-learn -U
Guolin Ke's avatar
Guolin Ke committed
18
19
```

20
21
22
23
Refer to [Python-package](https://github.com/Microsoft/LightGBM/tree/master/python-package) folder for the installation guide.

To verify your installation, try to `import lightgbm` in Python:

Guolin Ke's avatar
Guolin Ke committed
24
25
26
27
28
29
```
import lightgbm as lgb
```

Data Interface
--------------
30
31
32

The LightGBM Python module is able to load data from:

Guolin Ke's avatar
Guolin Ke committed
33
34
35
36
37
38
39
- libsvm/tsv/csv txt format file
- Numpy 2D array, pandas object
- LightGBM binary file

The data is stored in a ```Dataset``` object.

#### To load a libsvm text file or a LightGBM binary file into ```Dataset```:
40

Guolin Ke's avatar
Guolin Ke committed
41
```python
Guolin Ke's avatar
Guolin Ke committed
42
train_data = lgb.Dataset('train.svm.bin')
Guolin Ke's avatar
Guolin Ke committed
43
```
Guolin Ke's avatar
Guolin Ke committed
44

Guolin Ke's avatar
Guolin Ke committed
45
####  To load a numpy array into ```Dataset```:
46

Guolin Ke's avatar
Guolin Ke committed
47
```python
Yuyu Zhang's avatar
Yuyu Zhang committed
48
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
Guolin Ke's avatar
Guolin Ke committed
49
label = np.random.randint(2, size=500) # binary target
Yuyu Zhang's avatar
Yuyu Zhang committed
50
train_data = lgb.Dataset(data, label=label)
Guolin Ke's avatar
Guolin Ke committed
51
```
52

Guolin Ke's avatar
Guolin Ke committed
53
#### To load a scpiy.sparse.csr_matrix array into ```Dataset```:
54

Guolin Ke's avatar
Guolin Ke committed
55
56
57
58
```python
csr = scipy.sparse.csr_matrix((dat, (row, col)))
train_data = lgb.Dataset(csr)
```
59

Guolin Ke's avatar
Guolin Ke committed
60
#### Saving ```Dataset``` into a LightGBM binary file will make loading faster:
61

Guolin Ke's avatar
Guolin Ke committed
62
63
```python
train_data = lgb.Dataset('train.svm.txt')
Yuyu Zhang's avatar
Yuyu Zhang committed
64
train_data.save_binary('train.bin')
Guolin Ke's avatar
Guolin Ke committed
65
```
66
67
68

#### Create validation data:

Guolin Ke's avatar
Guolin Ke committed
69
70
71
72
73
74
75
76
77
78
79
```python
test_data = train_data.create_valid('test.svm')
```

or 

```python
test_data = lgb.Dataset('test.svm', reference=train_data)
```

In LightGBM, the validation data should be aligned with training data.
Guolin Ke's avatar
Guolin Ke committed
80

81
#### Specific feature names and categorical features:
82
83
84
85

```python
train_data = lgb.Dataset(data, label=label, feature_name=['c1', 'c2', 'c3'], categorical_feature=['c3'])
```
86
87
88
89

LightGBM can use categorical features as input directly. It doesn't need to covert to one-hot coding, and is much faster than one-hot coding (about 8x speed-up).

**Note**:You should convert your categorical features to int type before you construct `Dataset`.
Guolin Ke's avatar
Guolin Ke committed
90
91

#### Weights can be set when needed:
92

Guolin Ke's avatar
Guolin Ke committed
93
```python
wxchan's avatar
wxchan committed
94
w = np.random.rand(500, )
Guolin Ke's avatar
Guolin Ke committed
95
96
train_data = lgb.Dataset(data, label=label, weight=w)
```
97

Guolin Ke's avatar
Guolin Ke committed
98
or
99

Guolin Ke's avatar
Guolin Ke committed
100
101
```python
train_data = lgb.Dataset(data, label=label)
wxchan's avatar
wxchan committed
102
w = np.random.rand(500, )
Guolin Ke's avatar
Guolin Ke committed
103
104
105
106
107
108
109
110
111
112
train_data.set_weight(w)
```

And you can use `Dataset.set_init_score()` to set initial score, and `Dataset.set_group()` to set group/query data for ranking tasks.

#### Memory efficent usage

The `Dataset` object in LightGBM is very memory-efficient, due to it only need to save discrete bins.
However, Numpy/Array/Pandas object is memory cost. If you concern about your memory consumption. You can save memory accroding to following:

Rahul Phatak's avatar
Rahul Phatak committed
113
114
1. Let ```free_raw_data=True```(default is ```True```) when constructing the ```Dataset```
2. Explicit set ```raw_data=None``` after the ```Dataset``` has been constructed
Guolin Ke's avatar
Guolin Ke committed
115
116
117
118
3. Call ```gc```  

Setting Parameters
------------------
119
120
121
122
123

LightGBM can use either a list of pairs or a dictionary to set [Parameters](./Parameters.md). For instance:

* Booster parameters:

Guolin Ke's avatar
Guolin Ke committed
124
```python
Yuyu Zhang's avatar
Yuyu Zhang committed
125
param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
Guolin Ke's avatar
Guolin Ke committed
126
127
param['metric'] = 'auc'
```
128

Guolin Ke's avatar
Guolin Ke committed
129
* You can also specify multiple eval metrics:
130

Guolin Ke's avatar
Guolin Ke committed
131
132
133
134
135
136
137
138
```python
param['metric'] = ['auc', 'binary_logloss']
```

Training
--------

Training a model requires a parameter list and data set.
139

Guolin Ke's avatar
Guolin Ke committed
140
141
```python
num_round = 10
Yuyu Zhang's avatar
Yuyu Zhang committed
142
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])
Guolin Ke's avatar
Guolin Ke committed
143
```
144

Guolin Ke's avatar
Guolin Ke committed
145
After training, the model can be saved.
146

Guolin Ke's avatar
Guolin Ke committed
147
148
149
```python
bst.save_model('model.txt')
```
150
151
152

The trained model can also be dumped to JSON format.

Guolin Ke's avatar
Guolin Ke committed
153
154
155
156
```python
# dump model
json_model = bst.dump_model()
```
157
158
159

A saved model can be loaded.

Guolin Ke's avatar
Guolin Ke committed
160
```python
Yuyu Zhang's avatar
Yuyu Zhang committed
161
bst = lgb.Booster(model_file='model.txt') #init model
Guolin Ke's avatar
Guolin Ke committed
162
163
164
165
```

CV
--
166

Rahul Phatak's avatar
Rahul Phatak committed
167
Training with 5-fold CV:
168

Guolin Ke's avatar
Guolin Ke committed
169
170
171
172
173
174
175
```python
num_round = 10
lgb.cv(param, train_data, num_round, nfold=5)
```

Early Stopping
--------------
176

Guolin Ke's avatar
Guolin Ke committed
177
178
179
180
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
Early stopping requires at least one set in `valid_sets`. If there's more than one, it will use all of them.

```python
Yuyu Zhang's avatar
Yuyu Zhang committed
181
bst = lgb.train(param, train_data, num_round, valid_sets=valid_sets, early_stopping_rounds=10)
Guolin Ke's avatar
Guolin Ke committed
182
183
184
185
186
187
188
189
190
191
192
bst.save_model('model.txt', num_iteration=bst.best_iteration)
```

The model will train until the validation score stops improving. Validation error needs to improve at least every `early_stopping_rounds` to continue training.

If early stopping occurs, the model will have an additional field: `bst.best_iteration`. Note that `train()` will return a model from the last iteration, not the best one. And you can set `num_iteration=bst.best_iteration` when saving model.

This works with both metrics to minimize (L2, log loss, etc.) and to maximize (NDCG, AUC). Note that if you specify more than one evaluation metric, all of them will be used for early stopping.

Prediction
----------
193

Guolin Ke's avatar
Guolin Ke committed
194
A model that has been trained or loaded can perform predictions on data sets.
195

Guolin Ke's avatar
Guolin Ke committed
196
197
198
199
200
201
202
```python
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)
```

If early stopping is enabled during training, you can get predictions from the best iteration with `bst.best_iteration`:
203

Guolin Ke's avatar
Guolin Ke committed
204
```python
Yuyu Zhang's avatar
Yuyu Zhang committed
205
ypred = bst.predict(data, num_iteration=bst.best_iteration)
Guolin Ke's avatar
Guolin Ke committed
206
```