Description: Tree based algorithms can be improved by introducing boosting frameworks.
Description: Tree based algorithms can be improved by introducing boosting frameworks.
'LightGBM' is one such framework, based on Ke, Guolin et al. (2017) <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision>.
'LightGBM' is one such framework, based on Ke, Guolin et al. (2017) <https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html>.
This package offers an R interface to work with it.
This package offers an R interface to work with it.
It is designed to be distributed and efficient with the following advantages:
It is designed to be distributed and efficient with the following advantages:
@@ -68,16 +67,23 @@ The steps above should work on most systems, but users with highly-customized en
...
@@ -68,16 +67,23 @@ The steps above should work on most systems, but users with highly-customized en
To change the compiler used when installing the CRAN package, you can create a file `~/.R/Makevars` which overrides `CC` (`C` compiler) and `CXX` (`C++` compiler).
To change the compiler used when installing the CRAN package, you can create a file `~/.R/Makevars` which overrides `CC` (`C` compiler) and `CXX` (`C++` compiler).
For example, to use `gcc` instead of `clang` on Mac, you could use something like the following:
For example, to use `gcc-14` instead of `clang` on macOS, you could use something like the following:
```make
```make
# ~/.R/Makevars
# ~/.R/Makevars
CC=gcc-8
CC=gcc-14
CXX=g++-8
CC17=gcc-14
CXX11=g++-8
CXX=g++-14
CXX17=g++-14
```
```
### Installing from Source with CMake <a name="install"></a>
To check the values R is using, run the following:
```shell
R CMD config --all
```
### Installing from Source with CMake <a id="install"></a>
You need to install git and [CMake](https://cmake.org/) first.
You need to install git and [CMake](https://cmake.org/) first.
...
@@ -215,7 +221,7 @@ These packages do not require compilation, so they will be faster and easier to
...
@@ -215,7 +221,7 @@ These packages do not require compilation, so they will be faster and easier to
CRAN does not prepare precompiled binaries for Linux, and as of this writing neither does this project.
CRAN does not prepare precompiled binaries for Linux, and as of this writing neither does this project.
### Installing from a Pre-compiled lib_lightgbm <a name="lib_lightgbm"></a>
### Installing from a Pre-compiled lib_lightgbm <a id="lib_lightgbm"></a>
Previous versions of LightGBM offered the ability to first compile the C++ library (`lib_lightgbm.{dll,dylib,so}`) and then build an R-package that wraps it.
Previous versions of LightGBM offered the ability to first compile the C++ library (`lib_lightgbm.{dll,dylib,so}`) and then build an R-package that wraps it.
...
@@ -372,9 +378,12 @@ At build time, `configure` will be run and used to create a file `Makevars`, usi
...
@@ -372,9 +378,12 @@ At build time, `configure` will be run and used to create a file `Makevars`, usi
3. Edit `src/Makevars.in`.
3. Edit `src/Makevars.in`.
Alternatively, GitHub Actions can re-generate this file for you. On a pull request (only on internal one, does not work for ones from forks), create a comment with this phrase:
Alternatively, GitHub Actions can re-generate this file for you.
> /gha run r-configure
1. navigate to https://github.com/microsoft/LightGBM/actions/workflows/r_configure.yml
2. click "Run workflow" (drop-down)
3. enter the branch from the pull request for the `pr-branch` input
4. click "Run workflow" (button)
**Configuring for Windows**
**Configuring for Windows**
...
@@ -477,9 +486,13 @@ RDvalgrind \
...
@@ -477,9 +486,13 @@ RDvalgrind \
| cat
| cat
```
```
These tests can also be triggered on any pull request by leaving a comment in a pull request:
These tests can also be triggered on a pull request branch, using GitHub Actions.
> /gha run r-valgrind
1. navigate to https://github.com/microsoft/LightGBM/actions/workflows/r_valgrind.yml
2. click "Run workflow" (drop-down)
3. enter the branch from the pull request for the `pr-branch` input
4. enter the pull request ID for the `pr-number` input
@@ -702,11 +702,11 @@ Responded to CRAN with the following:
...
@@ -702,11 +702,11 @@ Responded to CRAN with the following:
The paper citation has been adjusted as requested. We were using 'glmnet' as a guide on how to include the URL but maybe they are no longer in compliance with CRAN policies: https://github.com/cran/glmnet/blob/b1a4b50de01e0cd24343959d7cf86452bac17b26/DESCRIPTION
The paper citation has been adjusted as requested. We were using 'glmnet' as a guide on how to include the URL but maybe they are no longer in compliance with CRAN policies: https://github.com/cran/glmnet/blob/b1a4b50de01e0cd24343959d7cf86452bac17b26/DESCRIPTION
All authors from the original LightGBM paper have been added to Authors@R as `"aut"`. We have also added Microsoft and DropBox, Inc. as `"cph"` (copyright holders). These roles were chosen based on the guidance in https://journal.r-project.org/archive/2012-1/RJournal_2012-1_Hornik~et~al.pdf.
All authors from the original LightGBM paper have been added to Authors@R as `"aut"`. We have also added Microsoft and DropBox, Inc. as `"cph"` (copyright holders). These roles were chosen based on the guidance in https://journal.r-project.org/archive/2012/RJ-2012-009/index.html.
lightgbm's code does use `<<-`, but it does not modify the global environment. The uses of `<<-` in R/lgb.interprete.R and R/callback.R are in functions which are called in an environment created by the lightgbm functions that call them, and this operator is used to reach one level up into the calling function's environment.
lightgbm's code does use `<<-`, but it does not modify the global environment. The uses of `<<-` in R/lgb.interprete.R and R/callback.R are in functions which are called in an environment created by the lightgbm functions that call them, and this operator is used to reach one level up into the calling function's environment.
We chose to wrap our examples in `\donttest{}` because we found, through testing on https://builder.r-hub.io/ and in our own continuous integration environments, that their run time varies a lot between platforms, and we cannot guarantee that all examples will run in under 5 seconds. We intentionally chose `\donttest{}` over `\donttest{}` because this item in the R 4.0.0 changelog (https://cran.r-project.org/doc/manuals/r-devel/NEWS.html) seems to indicate that \donttest will be ignored by CRAN's automated checks:
We chose to wrap our examples in `\donttest{}` because we found, through testing on https://r-hub.github.io/rhub/ and in our own continuous integration environments, that their run time varies a lot between platforms, and we cannot guarantee that all examples will run in under 5 seconds. We intentionally chose `\donttest{}` over `\donttest{}` because this item in the R 4.0.0 changelog (https://cran.r-project.org/doc/manuals/r-devel/NEWS.html) seems to indicate that \donttest will be ignored by CRAN's automated checks:
> "`R CMD check --as-cran` now runs \donttest examples (which are run by example()) instead of instructing the tester to do so. This can be temporarily circumvented during development by setting environment variable `_R_CHECK_DONTTEST_EXAMPLES_` to a false value."
> "`R CMD check --as-cran` now runs \donttest examples (which are run by example()) instead of instructing the tester to do so. This can be temporarily circumvented during development by setting environment variable `_R_CHECK_DONTTEST_EXAMPLES_` to a false value."
...
@@ -813,7 +813,7 @@ YEAR: 2016
...
@@ -813,7 +813,7 @@ YEAR: 2016
COPYRIGHT HOLDER:Microsoft Corporation
COPYRIGHT HOLDER:Microsoft Corporation
```
```
Added a citation and link for [the main paper](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision) in `DESCRIPTION`.
Added a citation and link for [the main paper](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html) in `DESCRIPTION`.
@@ -168,9 +170,9 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
...
@@ -168,9 +170,9 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
Reference Papers
Reference Papers
----------------
----------------
Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu. "Quantized Training of Gradient Boosting Decision Trees" ([link](https://papers.nips.cc/paper_files/paper/2022/hash/77911ed9e6e864ca1a3d165b2c3cb258-Abstract.html)). Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 18822-18833.
Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu. "Quantized Training of Gradient Boosting Decision Trees" ([link](https://proceedings.neurips.cc/paper/2022/hash/77911ed9e6e864ca1a3d165b2c3cb258-Abstract.html)). Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 18822-18833.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)". Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)". Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3149-3157.
Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](https://proceedings.neurips.cc/paper/2016/hash/10a5ab2db37feedfdeaab192ead4ac0e-Abstract.html)". Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 1279-1287.
Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](https://proceedings.neurips.cc/paper/2016/hash/10a5ab2db37feedfdeaab192ead4ac0e-Abstract.html)". Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 1279-1287.
`described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.
`described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.
...
@@ -46,7 +46,7 @@ LambdaRank
...
@@ -46,7 +46,7 @@ LambdaRank
Cost Efficient Gradient Boosting
Cost Efficient Gradient Boosting
--------------------------------
--------------------------------
`Cost Efficient Gradient Boosting <https://papers.nips.cc/paper/6753-cost-efficient-gradient-boosting.pdf>`_ (CEGB) makes it possible to penalise boosting based on the cost of obtaining feature values.
`Cost Efficient Gradient Boosting <https://proceedings.neurips.cc/paper/2017/hash/4fac9ba115140ac4f1c22da82aa0bc7f-Abstract.html>`_ (CEGB) makes it possible to penalise boosting based on the cost of obtaining feature values.
CEGB penalises learning in the following ways:
CEGB penalises learning in the following ways:
- Each time a tree is split, a penalty of ``cegb_penalty_split`` is applied.
- Each time a tree is split, a penalty of ``cegb_penalty_split`` is applied.
...
@@ -112,4 +112,4 @@ Currently, implemented is an approach to model position bias by using an idea of
...
@@ -112,4 +112,4 @@ Currently, implemented is an approach to model position bias by using an idea of
During the training, the compound scoring function ``s(x, pos)`` is fit with a standard ranking algorithm (e.g., LambdaMART) which boils down to jointly learning the relevance component ``f(x)`` (it is later returned as an unbiased model) and the position factors ``g(pos)`` that help better explain the observed (biased) labels.
During the training, the compound scoring function ``s(x, pos)`` is fit with a standard ranking algorithm (e.g., LambdaMART) which boils down to jointly learning the relevance component ``f(x)`` (it is later returned as an unbiased model) and the position factors ``g(pos)`` that help better explain the observed (biased) labels.
Similar score decomposition ideas have previously been applied for classification & pointwise ranking tasks with assumptions of binary labels and binary relevance (a.k.a. "two-tower" models, refer to the papers: `Towards Disentangling Relevance and Bias in Unbiased Learning to Rank <https://arxiv.org/abs/2212.13937>`_, `PAL: a position-bias aware learning framework for CTR prediction in live recommender systems <https://dl.acm.org/doi/10.1145/3298689.3347033>`_, `A General Framework for Debiasing in CTR Prediction <https://arxiv.org/abs/2112.02767>`_).
Similar score decomposition ideas have previously been applied for classification & pointwise ranking tasks with assumptions of binary labels and binary relevance (a.k.a. "two-tower" models, refer to the papers: `Towards Disentangling Relevance and Bias in Unbiased Learning to Rank <https://arxiv.org/abs/2212.13937>`_, `PAL: a position-bias aware learning framework for CTR prediction in live recommender systems <https://dl.acm.org/doi/10.1145/3298689.3347033>`_, `A General Framework for Debiasing in CTR Prediction <https://arxiv.org/abs/2112.02767>`_).
In LightGBM, we adapt this idea to general pairwise Lerarning-to-Rank with arbitrary ordinal relevance labels.
In LightGBM, we adapt this idea to general pairwise Lerarning-to-Rank with arbitrary ordinal relevance labels.
Besides, GAMs have been used in the context of explainable ML (`Accurate Intelligible Models with Pairwise Interactions <https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf>`_) to linearly decompose the contribution of each feature (and possibly their pairwise interactions) to the overall score, for subsequent analysis and interpretation of their effects in the trained models.
Besides, GAMs have been used in the context of explainable ML (`Accurate Intelligible Models with Pairwise Interactions <https://www.cs.cornell.edu/~yinlou/projects/gam/>`_) to linearly decompose the contribution of each feature (and possibly their pairwise interactions) to the overall score, for subsequent analysis and interpretation of their effects in the trained models.
| Higgs | Binary classification | `link <https://archive.ics.uci.edu/dataset/280/higgs>`__ | 10,500,000 | 28 | last 500,000 samples were used as test set |
| Higgs | Binary classification | `link <https://archive.ics.uci.edu/dataset/280/higgs>`__ | 10,500,000 | 28 | last 500,000 samples were used as test set |
| MS LTR | Learning to rank | `link <https://www.microsoft.com/en-us/research/project/mslr/>`__ | 2,270,296 | 137 | {S1,S2,S3} as train set, {S5} as test set |
| MS LTR | Learning to rank | `link <https://www.microsoft.com/en-us/research/project/mslr/>`__ | 2,270,296 | 137 | {S1,S2,S3} as train set, {S5} as test set |
[11] Huan Zhang, Si Si and Cho-Jui Hsieh. "`GPU Acceleration for Large-scale Tree Boosting`_." SysML Conference, 2018.
[11] Huan Zhang, Si Si and Cho-Jui Hsieh. "`GPU Acceleration for Large-scale Tree Boosting`_." SysML Conference, 2018.
.. _LightGBM\: A Highly Efficient Gradient Boosting Decision Tree: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
.. _LightGBM\: A Highly Efficient Gradient Boosting Decision Tree: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
.. _On Grouping for Maximum Homogeneity: https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10501479
.. _On Grouping for Maximum Homogeneity: https://www.jstor.org/stable/2281952
.. _Optimization of collective communication operations in MPICH: https://web.cels.anl.gov/~thakur/papers/ijhpca-coll.pdf
.. _Optimization of collective communication operations in MPICH: https://www.mpich.org/2012/10/24/optimization-of-collective-communication-operations-in-mpich/
.. _A Communication-Efficient Parallel Algorithm for Decision Tree: https://proceedings.neurips.cc/paper/2016/hash/10a5ab2db37feedfdeaab192ead4ac0e-Abstract.html
.. _A Communication-Efficient Parallel Algorithm for Decision Tree: https://proceedings.neurips.cc/paper/2016/hash/10a5ab2db37feedfdeaab192ead4ac0e-Abstract.html