Quick-Start.rst 8.92 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
Quick Start
===========

This is a quick start guide for LightGBM CLI version.

Follow the `Installation Guide <./Installation-Guide.rst>`__ to install LightGBM first.

**List of other helpful links**

-  `Parameters <./Parameters.rst>`__

-  `Parameters Tuning <./Parameters-Tuning.rst>`__

-  `Python-package Quick Start <./Python-Intro.rst>`__

-  `Python API <./Python-API.rst>`__

Training Data Format
--------------------

LightGBM supports input data file with `CSV`_, `TSV`_ and `LibSVM`_ formats.

Label is the data of first column, and there is no header in the file.

Categorical Feature Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~

update 12/5/2016:

LightGBM can use categorical feature directly (without one-hot coding).
The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot coding.

For the setting details, please refer to `Parameters <./Parameters.rst>`__.

Weight and Query/Group Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~

LightGBM also support weighted training, it needs an additional `weight data <./Parameters.rst#io-parameters>`__.
And it needs an additional `query data <./Parameters.rst#io-parameters>`_ for ranking task.

update 11/3/2016:

1. support input with header now

2. can specific label column, weight column and query/group id column.
   Both index and column are supported

3. can specific a list of ignored columns

Parameter Quick Look
--------------------

The parameter format is ``key1=value1 key2=value2 ...``.
And parameters can be in both config file and command line.

Some important parameters:

- ``config``, default=\ ``""``, type=string, alias=\ ``config_file``

  - path to config file

62
-  ``task``, default=\ ``train``, type=enum, options=\ ``train``, ``predict``, ``convert_model``
63

64
   -  ``train``, alias=\ ``training``, for training
65

66
   -  ``predict``, alias=\ ``prediction``, ``test``, for prediction.
67

68
   -  ``convert_model``, for converting model file into if-else format, see more information in `Convert model parameters <./Parameters.rst#convert-model-parameters>`__
69

70
-  ``application``, default=\ ``regression``, type=enum,
71
   options=\ ``regression``, ``regression_l1``, ``huber``, ``fair``, ``poisson``, ``quantile``, ``mape``,
Guolin Ke's avatar
Guolin Ke committed
72
   ``binary``, ``multiclass``, ``multiclassova``, ``xentropy``, ``xentlambda``, ``lambdarank``, ``gammma``, ``tweedie``,
73
   alias=\ ``objective``, ``app``
74

75
   -  regression application
76

Nikita Titov's avatar
Nikita Titov committed
77
      -  ``regression_l2``, L2 loss, alias=\ ``regression``, ``mean_squared_error``, ``mse``, ``l2_root``, ``root_mean_squared_error``, ``rmse``
78

79
      -  ``regression_l1``, L1 loss, alias=\ ``mean_absolute_error``, ``mae``
80

81
      -  ``huber``, `Huber loss`_
82

83
      -  ``fair``, `Fair loss`_
84

85
      -  ``poisson``, `Poisson regression`_
86

87
      -  ``quantile``, `Quantile regression`_
88

Guolin Ke's avatar
Guolin Ke committed
89
90
91
92
93
      -  ``mape``, `MAPE loss`_, alias=\ ``mean_absolute_percentage_error``

      -  ``gamma``, gamma regression with log-link. It might be useful, e.g., for modeling insurance claims severity, or for any target that might be `gamma-distributed`_

      -  ``tweedie``, tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any target that might be `tweedie-distributed`_.
94

95
   -  ``binary``, binary `log loss`_ classification application
96

97
98
   -  multi-class classification application

Nikita Titov's avatar
Nikita Titov committed
99
      -  ``multiclass``, `softmax`_ objective function, alias=\ ``softmax``
100

Nikita Titov's avatar
Nikita Titov committed
101
102
103
      -  ``multiclassova``, `One-vs-All`_ binary objective function, alias=\ ``multiclass_ova``, ``ova``, ``ovr``

      -  ``num_class`` should be set as well
104
105
106
107
108
109
110
111
112
113
114
115
116
117

   -  cross-entropy application

      -  ``xentropy``, objective function for cross-entropy (with optional linear weights), alias=\ ``cross_entropy``

      -  ``xentlambda``, alternative parameterization of cross-entropy, alias=\ ``cross_entropy_lambda``

      -  the label is anything in interval [0, 1]

   -  ``lambdarank``, `lambdarank`_ application

      -  the label should be ``int`` type in lambdarank tasks, and larger number represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)

      -  ``label_gain`` can be used to set the gain(weight) of ``int`` label
118

Nikita Titov's avatar
Nikita Titov committed
119
120
      -  all values in ``label`` must be smaller than number of elements in ``label_gain``

121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
- ``boosting``, default=\ ``gbdt``, type=enum,
  options=\ ``gbdt``, ``rf``, ``dart``, ``goss``,
  alias=\ ``boost``, ``boosting_type``

  - ``gbdt``, traditional Gradient Boosting Decision Tree

  - ``rf``, Random Forest

  - ``dart``, `Dropouts meet Multiple Additive Regression Trees`_

  - ``goss``, Gradient-based One-Side Sampling

- ``data``, default=\ ``""``, type=string, alias=\ ``train``, ``train_data``

  - training data, LightGBM will train from this data

- ``valid``, default=\ ``""``, type=multi-string, alias=\ ``test``, ``valid_data``, ``test_data``

  - validation/test data, LightGBM will output metrics for these data

  - support multi validation data, separate by ``,``

- ``num_iterations``, default=\ ``100``, type=int,
144
  alias=\ ``num_iteration``, ``num_tree``, ``num_trees``, ``num_round``, ``num_rounds``, ``num_boost_round``
145
146
147
148
149
150
151
152
153
154
155

  - number of boosting iterations/trees

- ``learning_rate``, default=\ ``0.1``, type=double, alias=\ ``shrinkage_rate``

  - shrinkage rate

- ``num_leaves``, default=\ ``31``, type=int, alias=\ ``num_leaf``

  - number of leaves in one tree

156
-  ``tree_learner``, default=\ ``serial``, type=enum, options=\ ``serial``, ``feature``, ``data``, ``voting``, alias=\ ``tree``
157

158
   -  ``serial``, single machine tree learner
159

160
   -  ``feature``, alias=\ ``feature_parallel``, feature parallel tree learner
161

162
   -  ``data``, alias=\ ``data_parallel``, data parallel tree learner
163

164
165
166
   -  ``voting``, alias=\ ``voting_parallel``, voting parallel tree learner

   -  refer to `Parallel Learning Guide <./Parallel-Learning-Guide.rst>`__ to get more details
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

- ``num_threads``, default=\ ``OpenMP_default``, type=int, alias=\ ``num_thread``, ``nthread``

  - number of threads for LightGBM

  - for the best speed, set this to the number of **real CPU cores**,
    not the number of threads (most CPU using `hyper-threading`_ to generate 2 threads per CPU core)

  - for parallel learning, should not use full CPU cores since this will cause poor performance for the network

- ``max_depth``, default=\ ``-1``, type=int

  - limit the max depth for tree model.
    This is used to deal with overfit when ``#data`` is small.
    Tree still grow by leaf-wise

  - ``< 0`` means no limit

185
- ``min_data_in_leaf``, default=\ ``20``, type=int, alias=\ ``min_data_per_leaf`` , ``min_data``, ``min_child_samples``
186
187
188
189

  - minimal number of data in one leaf. Can use this to deal with over-fitting

- ``min_sum_hessian_in_leaf``, default=\ ``1e-3``, type=double,
190
  alias=\ ``min_sum_hessian_per_leaf``, ``min_sum_hessian``, ``min_hessian``, ``min_child_weight``
191

192
  - minimal sum hessian in one leaf. Like ``min_data_in_leaf``, it can be used to deal with over-fitting
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242

For all parameters, please refer to `Parameters <./Parameters.rst>`__.

Run LightGBM
------------

For Windows:

::

    lightgbm.exe config=your_config_file other_args ...

For Unix:

::

    ./lightgbm config=your_config_file other_args ...

Parameters can be both in the config file and command line, and the parameters in command line have higher priority than in config file.
For example, following command line will keep ``num_trees=10`` and ignore the same parameter in config file.

::

    ./lightgbm config=train.conf num_trees=10

Examples
--------

-  `Binary Classification <https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification>`__

-  `Regression <https://github.com/Microsoft/LightGBM/tree/master/examples/regression>`__

-  `Lambdarank <https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank>`__

-  `Parallel Learning <https://github.com/Microsoft/LightGBM/tree/master/examples/parallel_learning>`__

.. _CSV: https://en.wikipedia.org/wiki/Comma-separated_values

.. _TSV: https://en.wikipedia.org/wiki/Tab-separated_values

.. _LibSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

.. _Expo data: http://stat-computing.org/dataexpo/2009/

.. _Huber loss: https://en.wikipedia.org/wiki/Huber_loss

.. _Fair loss: https://www.kaggle.com/c/allstate-claims-severity/discussion/24520

.. _Poisson regression: https://en.wikipedia.org/wiki/Poisson_regression

243
244
.. _Quantile regression: https://en.wikipedia.org/wiki/Quantile_regression

245
246
.. _MAPE loss: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

Guolin Ke's avatar
Guolin Ke committed
247
.. _log loss: https://en.wikipedia.org/wiki/Cross_entropy
248
249
250
251
252

.. _softmax: https://en.wikipedia.org/wiki/Softmax_function

.. _One-vs-All: https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest

253
254
255
256
257
.. _lambdarank: https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf

.. _Dropouts meet Multiple Additive Regression Trees: https://arxiv.org/abs/1505.01866

.. _hyper-threading: https://en.wikipedia.org/wiki/Hyper-threading
Guolin Ke's avatar
Guolin Ke committed
258
259
260

.. _gamma-distributed: https://en.wikipedia.org/wiki/Gamma_distribution#Applications

Nikita Titov's avatar
Nikita Titov committed
261
.. _tweedie-distributed: https://en.wikipedia.org/wiki/Tweedie_distribution#Applications