Quick-Start.rst 8.94 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Quick Start
===========

This is a quick start guide for LightGBM CLI version.

Follow the `Installation Guide <./Installation-Guide.rst>`__ to install LightGBM first.

**List of other helpful links**

-  `Parameters <./Parameters.rst>`__

-  `Parameters Tuning <./Parameters-Tuning.rst>`__

-  `Python-package Quick Start <./Python-Intro.rst>`__

-  `Python API <./Python-API.rst>`__

Training Data Format
--------------------

21
LightGBM supports input data files with `CSV`_, `TSV`_ and `LibSVM`_ formats.
22

23
24
25
26
27
Files could be both with and without headers.

Label column could be specified both by index and by name.

Some columns could be ignored.
28
29
30
31

Categorical Feature Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~

32
33
LightGBM can use categorical features directly (without one-hot coding).
The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding.
34
35
36
37
38
39

For the setting details, please refer to `Parameters <./Parameters.rst>`__.

Weight and Query/Group Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~

40
LightGBM also supports weighted training, it needs an additional `weight data <./Parameters.rst#io-parameters>`__.
41
42
And it needs an additional `query data <./Parameters.rst#io-parameters>`_ for ranking task.

43
Also, weight and query data could be specified as columns in training data in the same manner as label.
44
45
46
47
48

Parameter Quick Look
--------------------

The parameter format is ``key1=value1 key2=value2 ...``.
49
Parameters can be set both in config file and command line.
50
51
52

Some important parameters:

53
-  ``config``, default=\ ``""``, type=string, alias=\ ``config_file``
54

55
   -  path to config file
56

57
-  ``task``, default=\ ``train``, type=enum, options=\ ``train``, ``predict``, ``convert_model``
58

59
   -  ``train``, alias=\ ``training``, for training
60

61
   -  ``predict``, alias=\ ``prediction``, ``test``, for prediction
62

63
   -  ``convert_model``, for converting model file into if-else format, see more information in `Convert model parameters <./Parameters.rst#convert-model-parameters>`__
64

65
-  ``application``, default=\ ``regression``, type=enum,
66
67
   options=\ ``regression``, ``regression_l1``, ``huber``, ``fair``, ``poisson``, ``quantile``, ``mape``, ``gammma``, ``tweedie``,
   ``binary``, ``multiclass``, ``multiclassova``, ``xentropy``, ``xentlambda``, ``lambdarank``,
68
   alias=\ ``objective``, ``app``
69

70
   -  regression application
71

Nikita Titov's avatar
Nikita Titov committed
72
      -  ``regression_l2``, L2 loss, alias=\ ``regression``, ``mean_squared_error``, ``mse``, ``l2_root``, ``root_mean_squared_error``, ``rmse``
73

74
      -  ``regression_l1``, L1 loss, alias=\ ``mean_absolute_error``, ``mae``
75

76
      -  ``huber``, `Huber loss`_
77

78
      -  ``fair``, `Fair loss`_
79

80
      -  ``poisson``, `Poisson regression`_
81

82
      -  ``quantile``, `Quantile regression`_
83

Guolin Ke's avatar
Guolin Ke committed
84
85
      -  ``mape``, `MAPE loss`_, alias=\ ``mean_absolute_percentage_error``

86
      -  ``gamma``, Gamma regression with log-link. It might be useful, e.g., for modeling insurance claims severity, or for any target that might be `gamma-distributed`_
Guolin Ke's avatar
Guolin Ke committed
87

88
      -  ``tweedie``, Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any target that might be `tweedie-distributed`_
89

90
   -  ``binary``, binary `log loss`_ classification application
91

92
93
   -  multi-class classification application

Nikita Titov's avatar
Nikita Titov committed
94
      -  ``multiclass``, `softmax`_ objective function, alias=\ ``softmax``
95

Nikita Titov's avatar
Nikita Titov committed
96
97
98
      -  ``multiclassova``, `One-vs-All`_ binary objective function, alias=\ ``multiclass_ova``, ``ova``, ``ovr``

      -  ``num_class`` should be set as well
99
100
101
102
103
104
105
106
107
108
109
110
111
112

   -  cross-entropy application

      -  ``xentropy``, objective function for cross-entropy (with optional linear weights), alias=\ ``cross_entropy``

      -  ``xentlambda``, alternative parameterization of cross-entropy, alias=\ ``cross_entropy_lambda``

      -  the label is anything in interval [0, 1]

   -  ``lambdarank``, `lambdarank`_ application

      -  the label should be ``int`` type in lambdarank tasks, and larger number represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect)

      -  ``label_gain`` can be used to set the gain(weight) of ``int`` label
113

Nikita Titov's avatar
Nikita Titov committed
114
115
      -  all values in ``label`` must be smaller than number of elements in ``label_gain``

116
117
118
-  ``boosting``, default=\ ``gbdt``, type=enum,
   options=\ ``gbdt``, ``rf``, ``dart``, ``goss``,
   alias=\ ``boost``, ``boosting_type``
119

120
   -  ``gbdt``, traditional Gradient Boosting Decision Tree
121

122
   -  ``rf``, Random Forest
123

124
   -  ``dart``, `Dropouts meet Multiple Additive Regression Trees`_
125

126
   -  ``goss``, Gradient-based One-Side Sampling
127

128
-  ``data``, default=\ ``""``, type=string, alias=\ ``train``, ``train_data``
129

130
   -  training data, LightGBM will train from this data
131

132
-  ``valid``, default=\ ``""``, type=multi-string, alias=\ ``test``, ``valid_data``, ``test_data``
133

134
   -  validation/test data, LightGBM will output metrics for these data
135

136
   -  support multi validation data, separate by ``,``
137

138
139
-  ``num_iterations``, default=\ ``100``, type=int,
   alias=\ ``num_iteration``, ``num_tree``, ``num_trees``, ``num_round``, ``num_rounds``, ``num_boost_round``, ``n_estimators``
140

141
   -  number of boosting iterations
142

143
-  ``learning_rate``, default=\ ``0.1``, type=double, alias=\ ``shrinkage_rate``
144

145
   -  shrinkage rate
146

147
-  ``num_leaves``, default=\ ``31``, type=int, alias=\ ``num_leaf``
148

149
   -  number of leaves in one tree
150

151
-  ``tree_learner``, default=\ ``serial``, type=enum, options=\ ``serial``, ``feature``, ``data``, ``voting``, alias=\ ``tree``
152

153
   -  ``serial``, single machine tree learner
154

155
   -  ``feature``, alias=\ ``feature_parallel``, feature parallel tree learner
156

157
   -  ``data``, alias=\ ``data_parallel``, data parallel tree learner
158

159
160
161
   -  ``voting``, alias=\ ``voting_parallel``, voting parallel tree learner

   -  refer to `Parallel Learning Guide <./Parallel-Learning-Guide.rst>`__ to get more details
162

163
-  ``num_threads``, default=\ ``OpenMP_default``, type=int, alias=\ ``num_thread``, ``nthread``
164

165
   -  number of threads for LightGBM
166

167
168
   -  for the best speed, set this to the number of **real CPU cores**,
      not the number of threads (most CPU using `hyper-threading`_ to generate 2 threads per CPU core)
169

170
   -  for parallel learning, should not use full CPU cores since this will cause poor performance for the network
171

172
-  ``max_depth``, default=\ ``-1``, type=int
173

174
175
176
   -  limit the max depth for tree model.
      This is used to deal with over-fitting when ``#data`` is small.
      Tree still grows by leaf-wise
177

178
   -  ``< 0`` means no limit
179

180
-  ``min_data_in_leaf``, default=\ ``20``, type=int, alias=\ ``min_data_per_leaf`` , ``min_data``, ``min_child_samples``
181

182
   -  minimal number of data in one leaf. Can be used this to deal with over-fitting
183

184
185
-  ``min_sum_hessian_in_leaf``, default=\ ``1e-3``, type=double,
   alias=\ ``min_sum_hessian_per_leaf``, ``min_sum_hessian``, ``min_hessian``, ``min_child_weight``
186

187
   -  minimal sum hessian in one leaf. Like ``min_data_in_leaf``, it can be used to deal with over-fitting
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205

For all parameters, please refer to `Parameters <./Parameters.rst>`__.

Run LightGBM
------------

For Windows:

::

    lightgbm.exe config=your_config_file other_args ...

For Unix:

::

    ./lightgbm config=your_config_file other_args ...

206
Parameters can be set both in config file and command line, and the parameters in command line have higher priority than in config file.
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
For example, following command line will keep ``num_trees=10`` and ignore the same parameter in config file.

::

    ./lightgbm config=train.conf num_trees=10

Examples
--------

-  `Binary Classification <https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification>`__

-  `Regression <https://github.com/Microsoft/LightGBM/tree/master/examples/regression>`__

-  `Lambdarank <https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank>`__

-  `Parallel Learning <https://github.com/Microsoft/LightGBM/tree/master/examples/parallel_learning>`__

.. _CSV: https://en.wikipedia.org/wiki/Comma-separated_values

.. _TSV: https://en.wikipedia.org/wiki/Tab-separated_values

.. _LibSVM: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

.. _Expo data: http://stat-computing.org/dataexpo/2009/

.. _Huber loss: https://en.wikipedia.org/wiki/Huber_loss

.. _Fair loss: https://www.kaggle.com/c/allstate-claims-severity/discussion/24520

.. _Poisson regression: https://en.wikipedia.org/wiki/Poisson_regression

238
239
.. _Quantile regression: https://en.wikipedia.org/wiki/Quantile_regression

240
241
.. _MAPE loss: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

Guolin Ke's avatar
Guolin Ke committed
242
.. _log loss: https://en.wikipedia.org/wiki/Cross_entropy
243
244
245
246
247

.. _softmax: https://en.wikipedia.org/wiki/Softmax_function

.. _One-vs-All: https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest

248
249
250
251
252
.. _lambdarank: https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf

.. _Dropouts meet Multiple Additive Regression Trees: https://arxiv.org/abs/1505.01866

.. _hyper-threading: https://en.wikipedia.org/wiki/Hyper-threading
Guolin Ke's avatar
Guolin Ke committed
253
254
255

.. _gamma-distributed: https://en.wikipedia.org/wiki/Gamma_distribution#Applications

Nikita Titov's avatar
Nikita Titov committed
256
.. _tweedie-distributed: https://en.wikipedia.org/wiki/Tweedie_distribution#Applications