Commit eded794e authored by James Lamb's avatar James Lamb Committed by Qiwei Ye
Browse files

[R-package] CRAN fixes (#1499)

* Fixed typos in docs

* Fixed inconsistencies in documentation

* Updated strategy for registering routines

* Fixed issues caused by smashing multiple functions into one Rd

* Fixed issues with documentation

* Removed VignetteBuilder and updated Rbuildignore

* Added R build artefacts to gitignore

* Added namespacing on data.table set function. Updated handling of CMakeLists file to get around CRAN check.

* Updated build instructions

* Added R build script

* Removed build_r.sh script and updated R-package install instructions
parent 80a9a941
......@@ -382,3 +382,11 @@ lightgbm.model
# duplicate version file
python-package/lightgbm/VERSION.txt
.Rproj.user
# R build artefacts
R-package/src/CMakeLists.txt
R-package/src/lib_lightgbm.so.dSYM/
R-package/src/src/
lightgbm_r/*
lightgbm*.tar.gz
lightgbm.Rcheck/
^build_package.R$
\.gitkeep$
# Objects created by compilation
\.o$
\.so$
\.dll$
\.out$
\.bin$
# Code copied in at build time
^src/CMakeLists.txt$
......@@ -7,7 +7,7 @@ Authors@R: c(
person("Guolin", "Ke", email = "guolin.ke@microsoft.com", role = c("aut", "cre")),
person("Damien", "Soukhavong", email = "damien.soukhavong@skema.edu", role = c("ctb")),
person("Yachen", "Yan", role = c("ctb")),
person("James", "Lamb", role = c("ctb"))
person("James", "Lamb", email="james.lamb@uptake.com", role = c("ctb"))
)
Description: Tree based algorithms can be improved by introducing boosting frameworks. LightGBM is one such framework, and this package offers an R interface to work with it.
It is designed to be distributed and efficient with the following advantages:
......@@ -21,7 +21,6 @@ Description: Tree based algorithms can be improved by introducing boosting frame
License: MIT + file LICENSE
URL: https://github.com/Microsoft/LightGBM
BugReports: https://github.com/Microsoft/LightGBM/issues
VignetteBuilder: knitr
Suggests:
Ckmeans.1d.dp (>= 3.3.1),
DiagrammeR (>= 0.8.1),
......@@ -33,7 +32,7 @@ Suggests:
testthat,
vcd (>= 1.3)
Depends:
R (>= 3.0),
R (>= 3.4),
R6 (>= 2.0)
Imports:
data.table (>= 1.9.6),
......
......@@ -49,4 +49,4 @@ importFrom(magrittr,"%T>%")
importFrom(magrittr,extract)
importFrom(magrittr,inset)
importFrom(methods,is)
useDynLib(lib_lightgbm)
useDynLib(lib_lightgbm , .registration = TRUE)
CB_ENV <- R6Class(
#' @importFrom R6 R6Class
CB_ENV <- R6::R6Class(
"lgb.cb_env",
cloneable = FALSE,
public = list(
......
Booster <- R6Class(
#' @importFrom R6 R6Class
Booster <- R6::R6Class(
classname = "lgb.Booster",
cloneable = FALSE,
public = list(
......@@ -654,13 +655,15 @@ Booster <- R6Class(
#'
#' @rdname predict.lgb.Booster
#' @export
predict.lgb.Booster <- function(object, data,
predict.lgb.Booster <- function(object,
data,
num_iteration = NULL,
rawscore = FALSE,
predleaf = FALSE,
predcontrib = FALSE,
header = FALSE,
reshape = FALSE, ...) {
reshape = FALSE,
...) {
# Check booster existence
if (!lgb.is.Booster(object)) {
......
#' @importFrom methods is
Dataset <- R6Class(
#' @importFrom R6 R6Class
Dataset <- R6::R6Class(
classname = "lgb.Dataset",
cloneable = FALSE,
public = list(
......@@ -854,7 +856,7 @@ dimnames.lgb.Dataset <- function(x) {
#' Slice a dataset
#'
#' Get a new \code{lgb.Dataset} containing the specified rows of
#' orginal lgb.Dataset object
#' original lgb.Dataset object
#'
#' @param dataset Object of class "lgb.Dataset"
#' @param idxset a integer vector of indices of rows needed
......
#' @importFrom methods is
Predictor <- R6Class(
#' @importFrom R6 R6Class
Predictor <- R6::R6Class(
classname = "lgb.Predictor",
cloneable = FALSE,
public = list(
......
CVBooster <- R6Class(
#' @importFrom R6 R6Class
CVBooster <- R6::R6Class(
classname = "lgb.CVBooster",
cloneable = FALSE,
public = list(
......@@ -17,46 +18,39 @@ CVBooster <- R6Class(
)
#' @title Main CV logic for LightGBM
#' @description Cross validation logic used by LightGBM
#' @name lgb.cv
#' @param params List of parameters
#' @param data a \code{lgb.Dataset} object, used for CV
#' @param nrounds number of CV rounds
#' @inheritParams lgb_shared_params
#' @param nfold the original dataset is randomly partitioned into \code{nfold} equal size subsamples.
#' @param label vector of response values. Should be provided only when data is an R-matrix.
#' @param weight vector of response values. If not NULL, will set to dataset
#' @param obj objective function, can be character or custom objective function. Examples include
#' \code{regression}, \code{regression_l1}, \code{huber},
#' \code{binary}, \code{lambdarank}, \code{multiclass}, \code{multiclass}
#' @param boosting boosting type. \code{gbdt}, \code{dart}
#' @param num_leaves number of leaves in one tree. defaults to 127
#' @param max_depth Limit the max depth for tree model. This is used to deal with overfit when #data is small.
#' Tree still grow by leaf-wise.
#' @param num_threads Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores, not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).
#' @param eval evaluation function, can be (list of) character or custom eval function
#' @param verbose verbosity for output, if <= 0, also will disable the print of evalutaion during training
#' @param record Boolean, TRUE will record iteration message to \code{booster$record_evals}
#' @param eval_freq evalutaion output frequence, only effect when verbose > 0
#' @param showsd \code{boolean}, whether to show standard deviation of cross validation
#' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified
#' by the values of outcome labels.
#' @param folds \code{list} provides a possibility to use a list of pre-defined CV folds
#' (each element must be a vector of test fold's indices). When folds are supplied,
#' the \code{nfold} and \code{stratified} parameters are ignored.
#' @param init_model path of model file of \code{lgb.Booster} object, will continue train from this model
#' @param colnames feature names, if not null, will use this to overwrite the names in dataset
#' @param categorical_feature list of str or int
#' type int represents index,
#' type str represents feature names
#' @param early_stopping_rounds int
#' Activates early stopping.
#' CV score needs to improve at least every early_stopping_rounds round(s) to continue.
#' Requires at least one metric.
#' If there's more than one, will check all of them.
#' Returns the model with (best_iter + early_stopping_rounds).
#' If early stopping occurs, the model will have 'best_iter' field
#' @param callbacks list of callback functions
#' List of callback functions that are applied at each iteration.
#' @param ... other parameters, see Parameters.rst for more informations
#' @param ... other parameters, see Parameters.rst for more information. A few key parameters:
#' \itemize{
#' \item{boosting}{Boosting type. \code{"gbdt"} or \code{"dart"}}
#' \item{num_leaves}{number of leaves in one tree. defaults to 127}
#' \item{max_depth}{Limit the max depth for tree model. This is used to deal with
#' overfit when #data is small. Tree still grow by leaf-wise.}
#' \item{num_threads}{Number of threads for LightGBM. For the best speed, set this to
#' the number of real CPU cores, not the number of threads (most
#' CPU using hyper-threading to generate 2 threads per CPU core).}
#' }
#'
#' @return a trained model \code{lgb.CVBooster}.
#'
......@@ -75,7 +69,6 @@ CVBooster <- R6Class(
#' learning_rate = 1,
#' early_stopping_rounds = 10)
#' }
#' @rdname lgb.train
#' @export
lgb.cv <- function(params = list(),
data,
......
......@@ -20,7 +20,7 @@
#' \item \code{leaf_index}: ID of a leaf in a tree (integer)
#' \item \code{leaf_parent}: ID of the parent node for current leaf (integer)
#' \item \code{split_gain}: Split gain of a node
#' \item \code{threshold}: Spliting threshold value of a node
#' \item \code{threshold}: Splitting threshold value of a node
#' \item \code{decision_type}: Decision type of a node
#' \item \code{default_left}: Determine how to handle NA value, TRUE -> Left, FALSE -> Right
#' \item \code{internal_value}: Node value
......@@ -47,7 +47,7 @@
#' }
#'
#' @importFrom magrittr %>%
#' @importFrom data.table := data.table
#' @importFrom data.table := data.table rbindlist
#' @importFrom jsonlite fromJSON
#' @export
lgb.model.dt.tree <- function(model, num_iteration = NULL) {
......@@ -78,6 +78,7 @@ lgb.model.dt.tree <- function(model, num_iteration = NULL) {
}
#' @importFrom data.table data.table rbindlist
single.tree.parse <- function(lgb_tree) {
......
......@@ -68,6 +68,7 @@
#'
#' }
#'
#' @importFrom data.table set
#' @export
lgb.prepare_rules <- function(data, rules = NULL) {
......@@ -80,7 +81,7 @@ lgb.prepare_rules <- function(data, rules = NULL) {
# Loop through rules
for (i in names(rules)) {
set(data, j = i, value = unname(rules[[i]][data[[i]]]))
data.table::set(data, j = i, value = unname(rules[[i]][data[[i]]]))
data[[i]][is.na(data[[i]])] <- 0 # Overwrite NAs by 0s
}
......@@ -119,7 +120,7 @@ lgb.prepare_rules <- function(data, rules = NULL) {
names(rules[[indexed]]) <- mini_unique # Character equivalent
# Apply to real data column
set(data, j = i, value = unname(rules[[indexed]][mini_data]))
data.table::set(data, j = i, value = unname(rules[[indexed]][mini_data]))
}
......
......@@ -68,6 +68,7 @@
#'
#' }
#'
#' @importFrom data.table set
#' @export
lgb.prepare_rules2 <- function(data, rules = NULL) {
......@@ -80,7 +81,7 @@ lgb.prepare_rules2 <- function(data, rules = NULL) {
# Loop through rules
for (i in names(rules)) {
set(data, j = i, value = unname(rules[[i]][data[[i]]]))
data.table::set(data, j = i, value = unname(rules[[i]][data[[i]]]))
data[[i]][is.na(data[[i]])] <- 0L # Overwrite NAs by 0s as integer
}
......@@ -118,7 +119,7 @@ lgb.prepare_rules2 <- function(data, rules = NULL) {
names(rules[[indexed]]) <- mini_unique # Character equivalent
# Apply to real data column
set(data, j = i, value = unname(rules[[indexed]][mini_data]))
data.table::set(data, j = i, value = unname(rules[[indexed]][mini_data]))
}
......
#' @title Main training logic for LightGBM
#' @name lgb.train
#' @param params List of parameters
#' @param data a \code{lgb.Dataset} object, used for training
#' @param nrounds number of training rounds
#' @description Logic to train with LightGBM
#' @inheritParams lgb_shared_params
#' @param valids a list of \code{lgb.Dataset} objects, used for validation
#' @param obj objective function, can be character or custom objective function. Examples include
#' \code{regression}, \code{regression_l1}, \code{huber},
#' \code{binary}, \code{lambdarank}, \code{multiclass}, \code{multiclass}
#' @param boosting boosting type. \code{gbdt}, \code{dart}
#' @param num_leaves number of leaves in one tree. defaults to 127
#' @param max_depth Limit the max depth for tree model. This is used to deal with overfit when #data is small.
#' Tree still grow by leaf-wise.
#' @param num_threads Number of threads for LightGBM. For the best speed, set this to the number of real CPU cores, not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).
#' @param eval evaluation function, can be (a list of) character or custom eval function
#' @param verbose verbosity for output, if <= 0, also will disable the print of evalutaion during training
#' @param record Boolean, TRUE will record iteration message to \code{booster$record_evals}
#' @param eval_freq evalutaion output frequency, only effect when verbose > 0
#' @param init_model path of model file of \code{lgb.Booster} object, will continue training from this model
#' @param colnames feature names, if not null, will use this to overwrite the names in dataset
#' @param categorical_feature list of str or int
#' type int represents index,
#' type str represents feature names
#' @param early_stopping_rounds int
#' Activates early stopping.
#' The model will train until the validation score stops improving.
#' Validation score needs to improve at least every early_stopping_rounds round(s) to continue training.
#' Requires at least one validation data and one metric.
#' If there's more than one, will check all of them. But the training data is ignored anyway.
#' Returns the model with (best_iter + early_stopping_rounds).
#' If early stopping occurs, the model will have 'best_iter' field
#' @param reset_data Boolean, setting it to TRUE (not the default value) will transform the booster model into a predictor model which frees up memory and the original datasets
#' @param callbacks list of callback functions
#' List of callback functions that are applied at each iteration.
#' @param ... other parameters, see Parameters.rst for more information
#'
#' @param ... other parameters, see Parameters.rst for more information. A few key parameters:
#' \itemize{
#' \item{boosting}{Boosting type. \code{"gbdt"} or \code{"dart"}}
#' \item{num_leaves}{number of leaves in one tree. defaults to 127}
#' \item{max_depth}{Limit the max depth for tree model. This is used to deal with
#' overfit when #data is small. Tree still grow by leaf-wise.}
#' \item{num_threads}{Number of threads for LightGBM. For the best speed, set this to
#' the number of real CPU cores, not the number of threads (most
#' CPU using hyper-threading to generate 2 threads per CPU core).}
#' }
#' @return a trained booster model \code{lgb.Booster}.
#'
#' @examples
......@@ -56,8 +45,6 @@
#' early_stopping_rounds = 10)
#' }
#'
#' @rdname lgb.train
#'
#' @export
lgb.train <- function(params = list(),
data,
......
#' Simple interface for training an lightgbm model.
#' Its documentation is combined with lgb.train.
#'
#' @rdname lgb.train
#' @name lgb_shared_params
#' @title Shared parameter docs
#' @description Parameter docs shared by \code{lgb.train}, \code{lgb.cv}, and \code{lightgbm}
#' @param callbacks list of callback functions
#' List of callback functions that are applied at each iteration.
#' @param data a \code{lgb.Dataset} object, used for training
#' @param early_stopping_rounds int
#' Activates early stopping.
#' Requires at least one validation data and one metric
#' If there's more than one, will check all of them except the training data
#' Returns the model with (best_iter + early_stopping_rounds)
#' If early stopping occurs, the model will have 'best_iter' field
#' @param eval_freq evaluation output frequency, only effect when verbose > 0
#' @param init_model path of model file of \code{lgb.Booster} object, will continue training from this model
#' @param nrounds number of training rounds
#' @param params List of parameters
#' @param verbose verbosity for output, if <= 0, also will disable the print of evaluation during training
NULL
#' @title Train a LightGBM model
#' @name lightgbm
#' @description Simple interface for training an LightGBM model.
#' @inheritParams lgb_shared_params
#' @param label Vector of labels, used if \code{data} is not an \code{\link{lgb.Dataset}}
#' @param weight vector of response values. If not NULL, will set to dataset
#' @param save_name File name to use when writing the trained model to disk. Should end in ".model".
#' @param ... Additional arguments passed to \code{\link{lgb.train}}. For example
#' \itemize{
#' \item{valids}{a list of \code{lgb.Dataset} objects, used for validation}
#' \item{obj}{objective function, can be character or custom objective function. Examples include
#' \code{regression}, \code{regression_l1}, \code{huber},
#' \code{binary}, \code{lambdarank}, \code{multiclass}, \code{multiclass}}
#' \item{eval}{evaluation function, can be (a list of) character or custom eval function}
#' \item{record}{Boolean, TRUE will record iteration message to \code{booster$record_evals}}
#' \item{colnames}{feature names, if not null, will use this to overwrite the names in dataset}
#' \item{categorical_feature}{list of str or int. type int represents index, type str represents feature names}
#' \item{reset_data}{Boolean, setting it to TRUE (not the default value) will transform the booster model
#' into a predictor model which frees up memory and the original datasets}
#' \item{boosting}{Boosting type. \code{"gbdt"} or \code{"dart"}}
#' \item{num_leaves}{number of leaves in one tree. defaults to 127}
#' \item{max_depth}{Limit the max depth for tree model. This is used to deal with
#' overfit when #data is small. Tree still grow by leaf-wise.}
#' \item{num_threads}{Number of threads for LightGBM. For the best speed, set this to
#' the number of real CPU cores, not the number of threads (most
#' CPU using hyper-threading to generate 2 threads per CPU core).}
#' }
#' @export
lightgbm <- function(data,
label = NULL,
......@@ -122,7 +166,7 @@ NULL
# Various imports
#' @import methods
#' @importFrom R6 R6Class
#' @useDynLib lib_lightgbm
#' @useDynLib lib_lightgbm , .registration = TRUE
NULL
# Suppress false positive warnings from R CMD CHECK about
......
#' readRDS for lgb.Booster models
#'
#' Attemps to load a model using RDS.
#' Attempts to load a model using RDS.
#'
#' @param file a connection or the name of the file where the R object is saved to or read from.
#' @param refhook a hook function for handling reference objects.
......
#' saveRDS for lgb.Booster models
#'
#' Attemps to save a model using RDS. Has an additional parameter (\code{raw}) which decides whether to save the raw model or not.
#' Attempts to save a model using RDS. Has an additional parameter (\code{raw}) which decides whether to save the raw model or not.
#'
#' @param object R object to serialize.
#' @param file a connection or the name of the file where the R object is saved to or read from.
......
......@@ -22,46 +22,36 @@ For users who wants to install online with GPU or want to choose a specific comp
**Warning for Windows users**: it is recommended to use *Visual Studio* for its better multi-threading efficiency in Windows for many core systems. For very simple systems (dual core computers or worse), MinGW64 is recommended for maximum performance. If you do not know what to choose, it is recommended to use [Visual Studio](https://visualstudio.microsoft.com/downloads/), the default compiler. **Do not try using MinGW in Windows on many core systems. It may result in 10x slower results than Visual Studio.**
#### macOS Preparation
#### Mac OS Preparation
You can perform installation either with **Apple Clang** or **gcc**. In case you prefer **Apple Clang**, you should install **OpenMP** (details for installation can be found in [Installation Guide](https://github.com/Microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#apple-clang)) first and **CMake** version 3.12 or higher is required. In case you prefer **gcc**, you need to install it (details for installation can be found in [Installation Guide](https://github.com/Microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#gcc)) and specify compilers by running ``export CXX=g++-7 CC=gcc-7`` (replace "7" with version of **gcc** installed on your machine) first.
### Install
Install LightGBM R-package with the following command:
Mac users may need to set some environment variables to tell R to use `gcc` and `g++`. If you install these from Homebrew, your versions of `g++` and `gcc` are most likely in `/usr/local/bin`, as shown below.
```sh
git clone --recursive https://github.com/Microsoft/LightGBM
cd LightGBM/R-package
# export CXX=g++-7 CC=gcc-7 # macOS users, if you decided to compile with gcc, don't forget to specify compilers (replace "7" with version of gcc installed on your machine)
R CMD INSTALL --build . --no-multiarch
```
# replace 8 with version of gcc installed on your machine
export CXX=/usr/local/bin/g++-8 CC=/usr/local/bin/gcc-8
```
### Install
Or build a self-contained R-package which can be installed afterwards:
Build and install R-package with the following commands:
```sh
git clone --recursive https://github.com/Microsoft/LightGBM
cd LightGBM/R-package
Rscript build_package.R
# export CXX=g++-7 CC=gcc-7 # macOS users, if you decided to compile with gcc, don't forget to specify compilers (replace "7" with version of gcc installed on your machine)
R CMD INSTALL lightgbm_2.1.1.tar.gz --no-multiarch
cd LightGBM
Rscript build_r.R
```
The `build_r.R` script builds the package in a temporary directory called `lightgbm_r`. It will destroy and recreate that directory each time you run the script.
Note: for the build with Visual Studio/MSBuild in Windows, you should use the Windows CMD or Powershell.
Windows users may need to run with administrator rights (either R or the command prompt, depending on the way you are installing this package). Linux users might require the appropriate user write permissions for packages.
Set `use_gpu` to `TRUE` in `R-package/src/install.libs.R` to enable the build with GPU support. You will need to install Boost and OpenCL first: details for installation can be found in [Installation-Guide](https://github.com/Microsoft/LightGBM/blob/master/docs/Installation-Guide.rst#build-gpu-version).
You can also install directly from R using the repository with `devtools`:
```r
library(devtools)
options(devtools.install.args = "--no-multiarch") # if you have 64-bit R only, you can skip this
install_github("Microsoft/LightGBM", subdir = "R-package")
```
If you are using a precompiled dll/lib locally, you can move the dll/lib into LightGBM root folder, modify `LightGBM/R-package/src/install.libs.R`'s 2nd line (change `use_precompile <- FALSE` to `use_precompile <- TRUE`), and install R-package as usual. **NOTE: If your R version is not smaller than 3.5.0, you should set `DUSE_R35=ON` in CMake options when build precompiled dll/lib**.
If you are using a precompiled dll/lib locally, you can move the dll/lib into LightGBM root folder, modify `LightGBM/R-package/src/install.libs.R`'s 2nd line (change `use_precompile <- FALSE` to `use_precompile <- TRUE`), and install R-package as usual. **NOTE: If your R version is not smaller than 3.5.0, you should set `DUSE_R35=ON` in cmake options when build precompiled dll/lib**.
When your package installation is done, you can check quickly if your LightGBM R-package is working by running the following:
......
unlink("./src/include", recursive = TRUE)
unlink("./src/src", recursive = TRUE)
unlink("./src/compute", recursive = TRUE)
unlink("./src/build", recursive = TRUE)
unlink("./src/Release", recursive = TRUE)
if (!file.copy("./../include", "./src/", overwrite = TRUE, recursive = TRUE)) {
stop("Cannot find folder LightGBM/include")
}
if (!file.copy("./../src", "./src/", overwrite = TRUE, recursive = TRUE)) {
stop("Cannot find folder LightGBM/src")
}
if (!file.copy("./../compute", "./src/", overwrite = TRUE, recursive = TRUE)) {
print("Cannot find folder LightGBM/compute, will disable GPU build")
}
if (!file.copy("./../CMakeLists.txt", "./src/", overwrite = TRUE, recursive = TRUE)) {
stop("Cannot find file LightGBM/CMakeLists.txt")
}
if (!file.exists("./src/_IS_FULL_PACKAGE")) {
file.create("./src/_IS_FULL_PACKAGE")
}
system("R CMD build --no-build-vignettes .")
file.remove("./src/_IS_FULL_PACKAGE")
basic_walkthrough Basic feature walkthrough
boost_from_prediction Boosting from existing prediction
categorical_feature_prepare Categorical Feature Preparation
categorical_feature_rules Categorical Feature Preparation with Rules
categorical_features_prepare Categorical Feature Preparation
categorical_features_rules Categorical Feature Preparation with Rules
cross_validation Cross Validation
early_stopping Early Stop in training
efficient_many_training Efficiency for Many Model Trainings
multiclass Multiclass training/prediction
multiclass_custom_objective Multiclass with Custom Objective Function
leaf_stability Leaf (in)Stability example
weight_param Weight-Parameter adjustment relationship
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment