lightgbm.R 16.8 KB
Newer Older
James Lamb's avatar
James Lamb committed
1
2
3
#' @name lgb_shared_params
#' @title Shared parameter docs
#' @description Parameter docs shared by \code{lgb.train}, \code{lgb.cv}, and \code{lightgbm}
4
#' @param callbacks List of callback functions that are applied at each iteration.
5
6
7
#' @param data a \code{lgb.Dataset} object, used for training. Some functions, such as \code{\link{lgb.cv}},
#'             may allow you to pass other types of data like \code{matrix} and then separately supply
#'             \code{label} as a keyword argument.
8
9
10
11
12
#' @param early_stopping_rounds int. Activates early stopping. When this parameter is non-null,
#'                              training will stop if the evaluation of any metric on any validation set
#'                              fails to improve for \code{early_stopping_rounds} consecutive boosting rounds.
#'                              If training stops early, the returned model will have attribute \code{best_iter}
#'                              set to the iteration number of the best iteration.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#' @param eval evaluation function(s). This can be a character vector, function, or list with a mixture of
#'             strings and functions.
#'
#'             \itemize{
#'                 \item{\bold{a. character vector}:
#'                     If you provide a character vector to this argument, it should contain strings with valid
#'                     evaluation metrics.
#'                     See \href{https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric}{
#'                     The "metric" section of the documentation}
#'                     for a list of valid metrics.
#'                 }
#'                 \item{\bold{b. function}:
#'                      You can provide a custom evaluation function. This
#'                      should accept the keyword arguments \code{preds} and \code{dtrain} and should return a named
#'                      list with three elements:
#'                      \itemize{
#'                          \item{\code{name}: A string with the name of the metric, used for printing
#'                              and storing results.
#'                          }
#'                          \item{\code{value}: A single number indicating the value of the metric for the
#'                              given predictions and true values
#'                          }
#'                          \item{
#'                              \code{higher_better}: A boolean indicating whether higher values indicate a better fit.
#'                              For example, this would be \code{FALSE} for metrics like MAE or RMSE.
#'                          }
#'                      }
#'                 }
#'                 \item{\bold{c. list}:
#'                     If a list is given, it should only contain character vectors and functions.
#'                     These should follow the requirements from the descriptions above.
#'                 }
#'             }
46
#' @param eval_freq evaluation output frequency, only effective when verbose > 0 and \code{valids} has been provided
47
#' @param init_model path of model file or \code{lgb.Booster} object, will continue training from this model
James Lamb's avatar
James Lamb committed
48
#' @param nrounds number of training rounds
49
50
51
#' @param obj objective function, can be character or custom objective function. Examples include
#'            \code{regression}, \code{regression_l1}, \code{huber},
#'            \code{binary}, \code{lambdarank}, \code{multiclass}, \code{multiclass}
52
53
#' @param params a list of parameters. See \href{https://lightgbm.readthedocs.io/en/latest/Parameters.html}{
#'               the "Parameters" section of the documentation} for a list of parameters and valid values.
54
55
#' @param verbose verbosity for output, if <= 0 and \code{valids} has been provided, also will disable the
#'                printing of evaluation during training
56
#' @param serializable whether to make the resulting objects serializable through functions such as
57
#'                     \code{save} or \code{saveRDS} (see section "Model serialization").
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#' @section Early Stopping:
#'
#'          "early stopping" refers to stopping the training process if the model's performance on a given
#'          validation set does not improve for several consecutive iterations.
#'
#'          If multiple arguments are given to \code{eval}, their order will be preserved. If you enable
#'          early stopping by setting \code{early_stopping_rounds} in \code{params}, by default all
#'          metrics will be considered for early stopping.
#'
#'          If you want to only consider the first metric for early stopping, pass
#'          \code{first_metric_only = TRUE} in \code{params}. Note that if you also specify \code{metric}
#'          in \code{params}, that metric will be considered the "first" one. If you omit \code{metric},
#'          a default metric will be used based on your choice for the parameter \code{obj} (keyword argument)
#'          or \code{objective} (passed into \code{params}).
72
73
74
#'
#'          \bold{NOTE:} if using \code{boosting_type="dart"}, any early stopping configuration will be ignored
#'          and early stopping will not be performed.
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#' @section Model serialization:
#'
#'          LightGBM model objects can be serialized and de-serialized through functions such as \code{save}
#'          or \code{saveRDS}, but similarly to libraries such as 'xgboost', serialization works a bit differently
#'          from typical R objects. In order to make models serializable in R, a copy of the underlying C++ object
#'          as serialized raw bytes is produced and stored in the R model object, and when this R object is
#'          de-serialized, the underlying C++ model object gets reconstructed from these raw bytes, but will only
#'          do so once some function that uses it is called, such as \code{predict}. In order to forcibly
#'          reconstruct the C++ object after deserialization (e.g. after calling \code{readRDS} or similar), one
#'          can use the function \link{lgb.restore_handle} (for example, if one makes predictions in parallel or in
#'          forked processes, it will be faster to restore the handle beforehand).
#'
#'          Producing and keeping these raw bytes however uses extra memory, and if they are not required,
#'          it is possible to avoid producing them by passing `serializable=FALSE`. In such cases, these raw
#'          bytes can be added to the model on demand through function \link{lgb.make_serializable}.
90
91
92
#'
#'          \emph{New in version 4.0.0}
#'
93
#' @keywords internal
James Lamb's avatar
James Lamb committed
94
95
96
NULL

#' @name lightgbm
97
#' @title Train a LightGBM model
98
99
100
101
102
103
#' @description High-level R interface to train a LightGBM model. Unlike \code{\link{lgb.train}}, this function
#'              is focused on compatibility with other statistics and machine learning interfaces in R.
#'              This focus on compatibility means that this interface may experience more frequent breaking API changes
#'              than \code{\link{lgb.train}}.
#'              For efficiency-sensitive applications, or for applications where breaking API changes across releases
#'              is very expensive, use \code{\link{lgb.train}}.
James Lamb's avatar
James Lamb committed
104
105
#' @inheritParams lgb_shared_params
#' @param label Vector of labels, used if \code{data} is not an \code{\link{lgb.Dataset}}
106
107
#' @param weights Sample / observation weights for rows in the input data. If \code{NULL}, will assume that all
#'                observations / rows have the same importance / weight.
108
109
110
#'
#'                \emph{Changed from 'weight', in version 4.0.0}
#'
111
112
#' @param objective Optimization objective (e.g. `"regression"`, `"binary"`, etc.).
#'                  For a list of accepted objectives, see
113
114
#'                  \href{https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective}{
#'                  the "objective" item of the "Parameters" section of the documentation}.
115
116
117
118
119
120
121
#'
#'                  If passing \code{"auto"} and \code{data} is not of type \code{lgb.Dataset}, the objective will
#'                  be determined according to what is passed for \code{label}:\itemize{
#'                  \item If passing a factor with two variables, will use objective \code{"binary"}.
#'                  \item If passing a factor with more than two variables, will use objective \code{"multiclass"}
#'                  (note that parameter \code{num_class} in this case will also be determined automatically from
#'                  \code{label}).
122
#'                  \item Otherwise (or if passing \code{lgb.Dataset} as input), will use objective \code{"regression"}.
123
#'                  }
124
125
126
#'
#'                  \emph{New in version 4.0.0}
#'
127
#' @param init_score initial score is the base prediction lightgbm will boost from
128
129
130
#'
#'                   \emph{New in version 4.0.0}
#'
131
132
133
134
135
136
137
138
139
140
141
142
143
144
#' @param num_threads Number of parallel threads to use. For best speed, this should be set to the number of
#'                    physical cores in the CPU - in a typical x86-64 machine, this corresponds to half the
#'                    number of maximum threads.
#'
#'                    Be aware that using too many threads can result in speed degradation in smaller datasets
#'                    (see the parameters documentation for more details).
#'
#'                    If passing zero, will use the default number of threads configured for OpenMP
#'                    (typically controlled through an environment variable \code{OMP_NUM_THREADS}).
#'
#'                    If passing \code{NULL} (the default), will try to use the number of physical cores in the
#'                    system, but be aware that getting the number of cores detected correctly requires package
#'                    \code{RhpcBLASctl} to be installed.
#'
145
#'                    This parameter gets overridden by \code{num_threads} and its aliases under \code{params}
146
#'                    if passed there.
147
148
149
#'
#'                    \emph{New in version 4.0.0}
#'
150
151
152
153
154
155
#' @param colnames Character vector of features. Only used if \code{data} is not an \code{\link{lgb.Dataset}}.
#' @param categorical_feature categorical features. This can either be a character vector of feature
#'                            names or an integer vector with the indices of the features (e.g.
#'                            \code{c(1L, 10L)} to say "the first and tenth columns").
#'                            Only used if \code{data} is not an \code{\link{lgb.Dataset}}.
#'
James Lamb's avatar
James Lamb committed
156
157
#' @param ... Additional arguments passed to \code{\link{lgb.train}}. For example
#'     \itemize{
158
159
#'        \item{\code{valids}: a list of \code{lgb.Dataset} objects, used for validation}
#'        \item{\code{obj}: objective function, can be character or custom objective function. Examples include
James Lamb's avatar
James Lamb committed
160
161
#'                   \code{regression}, \code{regression_l1}, \code{huber},
#'                    \code{binary}, \code{lambdarank}, \code{multiclass}, \code{multiclass}}
162
163
164
#'        \item{\code{eval}: evaluation function, can be (a list of) character or custom eval function}
#'        \item{\code{record}: Boolean, TRUE will record iteration message to \code{booster$record_evals}}
#'        \item{\code{reset_data}: Boolean, setting it to TRUE (not the default value) will transform the booster model
James Lamb's avatar
James Lamb committed
165
166
#'                          into a predictor model which frees up memory and the original datasets}
#'     }
167
#' @inheritSection lgb_shared_params Early Stopping
168
#' @return a trained \code{lgb.Booster}
Guolin Ke's avatar
Guolin Ke committed
169
#' @export
170
171
lightgbm <- function(data,
                     label = NULL,
172
                     weights = NULL,
173
                     params = list(),
174
                     nrounds = 100L,
175
                     verbose = 1L,
176
177
178
179
                     eval_freq = 1L,
                     early_stopping_rounds = NULL,
                     init_model = NULL,
                     callbacks = list(),
180
                     serializable = TRUE,
181
                     objective = "auto",
182
                     init_score = NULL,
183
                     num_threads = NULL,
184
185
                     colnames = NULL,
                     categorical_feature = NULL,
186
                     ...) {
187

188
  # validate inputs early to avoid unnecessary computation
189
  if (nrounds <= 0L) {
190
191
    stop("nrounds should be greater than zero")
  }
192

193
  if (is.null(num_threads)) {
194
    num_threads <- .get_default_num_threads()
195
  }
196
  params <- .check_wrapper_param(
197
198
199
200
    main_param_name = "num_threads"
    , params = params
    , alternative_kwarg_value = num_threads
  )
201
  params <- .check_wrapper_param(
202
203
204
205
    main_param_name = "verbosity"
    , params = params
    , alternative_kwarg_value = verbose
  )
206

207
  # Process factors as labels and auto-determine objective
208
  if (!.is_Dataset(data)) {
209
210
211
212
213
214
215
216
217
218
219
220
    data_processor <- DataProcessor$new()
    temp <- data_processor$process_label(
        label = label
        , objective = objective
        , params = params
    )
    label <- temp$label
    objective <- temp$objective
    params <- temp$params
    rm(temp)
  } else {
    data_processor <- NULL
221
222
223
    if (objective == "auto") {
      objective <- "regression"
    }
224
225
  }

226
227
228
  # Set data to a temporary variable
  dtrain <- data

229
  # Check whether data is lgb.Dataset, if not then create lgb.Dataset manually
230
  if (!.is_Dataset(x = dtrain)) {
231
232
233
234
235
236
237
238
    dtrain <- lgb.Dataset(
      data = data
      , label = label
      , weight = weights
      , init_score = init_score
      , categorical_feature = categorical_feature
      , colnames = colnames
    )
Guolin Ke's avatar
Guolin Ke committed
239
  }
Guolin Ke's avatar
Guolin Ke committed
240

241
242
243
244
  train_args <- list(
    "params" = params
    , "data" = dtrain
    , "nrounds" = nrounds
245
    , "obj" = objective
246
    , "verbose" = params[["verbosity"]]
247
248
249
250
    , "eval_freq" = eval_freq
    , "early_stopping_rounds" = early_stopping_rounds
    , "init_model" = init_model
    , "callbacks" = callbacks
251
    , "serializable" = serializable
252
253
254
255
256
257
258
  )
  train_args <- append(train_args, list(...))

  if (! "valids" %in% names(train_args)) {
    train_args[["valids"]] <- list()
  }

259
  # Train a model using the regular way
260
261
262
  bst <- do.call(
    what = lgb.train
    , args = train_args
263
  )
264
  bst$data_processor <- data_processor
265

266
  return(bst)
Guolin Ke's avatar
Guolin Ke committed
267
268
}

269
270
271
272
273
#' @name agaricus.train
#' @title Training part from Mushroom Data Set
#' @description This data set is originally from the Mushroom data set,
#'              UCI Machine Learning Repository.
#'              This data set includes the following fields:
274
#'
275
276
277
278
#'               \itemize{
#'                   \item{\code{label}: the label for each record}
#'                   \item{\code{data}: a sparse Matrix of \code{dgCMatrix} class, with 126 columns.}
#'                }
Guolin Ke's avatar
Guolin Ke committed
279
280
281
#'
#' @references
#' https://archive.ics.uci.edu/ml/datasets/Mushroom
282
283
#'
#' Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository
284
#' [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
Guolin Ke's avatar
Guolin Ke committed
285
#' School of Information and Computer Science.
286
#'
Guolin Ke's avatar
Guolin Ke committed
287
288
289
#' @docType data
#' @keywords datasets
#' @usage data(agaricus.train)
290
#' @format A list containing a label vector, and a dgCMatrix object with 6513
Guolin Ke's avatar
Guolin Ke committed
291
292
293
#' rows and 127 variables
NULL

294
295
296
297
298
299
300
301
302
303
#' @name agaricus.test
#' @title Test part from Mushroom Data Set
#' @description This data set is originally from the Mushroom data set,
#'              UCI Machine Learning Repository.
#'              This data set includes the following fields:
#'
#'              \itemize{
#'                  \item{\code{label}: the label for each record}
#'                  \item{\code{data}: a sparse Matrix of \code{dgCMatrix} class, with 126 columns.}
#'              }
Guolin Ke's avatar
Guolin Ke committed
304
305
#' @references
#' https://archive.ics.uci.edu/ml/datasets/Mushroom
306
307
#'
#' Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository
308
#' [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
Guolin Ke's avatar
Guolin Ke committed
309
#' School of Information and Computer Science.
310
#'
Guolin Ke's avatar
Guolin Ke committed
311
312
313
#' @docType data
#' @keywords datasets
#' @usage data(agaricus.test)
314
#' @format A list containing a label vector, and a dgCMatrix object with 1611
Guolin Ke's avatar
Guolin Ke committed
315
316
317
#' rows and 126 variables
NULL

318
319
320
321
#' @name bank
#' @title Bank Marketing Data Set
#' @description This data set is originally from the Bank Marketing data set,
#'              UCI Machine Learning Repository.
322
#'
323
324
#'              It contains only the following: bank.csv with 10% of the examples and 17 inputs,
#'              randomly selected from 3 (older version of this dataset with less inputs).
325
326
#'
#' @references
327
#' https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
328
#'
329
330
331
332
333
334
335
336
337
#' S. Moro, P. Cortez and P. Rita. (2014)
#' A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems
#'
#' @docType data
#' @keywords datasets
#' @usage data(bank)
#' @format A data.table with 4521 rows and 17 variables
NULL

Guolin Ke's avatar
Guolin Ke committed
338
# Various imports
Guolin Ke's avatar
Guolin Ke committed
339
#' @import methods
340
#' @importFrom Matrix Matrix
Guolin Ke's avatar
Guolin Ke committed
341
#' @importFrom R6 R6Class
342
#' @useDynLib lightgbm , .registration = TRUE
343
NULL
James Lamb's avatar
James Lamb committed
344
345
346
347
348
349
350

# Suppress false positive warnings from R CMD CHECK about
# "unrecognized global variable"
globalVariables(c(
    "."
    , ".N"
    , ".SD"
351
    , "abs_contribution"
352
    , "bar_color"
James Lamb's avatar
James Lamb committed
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
    , "Contribution"
    , "Cover"
    , "Feature"
    , "Frequency"
    , "Gain"
    , "internal_count"
    , "internal_value"
    , "leaf_index"
    , "leaf_parent"
    , "leaf_value"
    , "node_parent"
    , "split_feature"
    , "split_gain"
    , "split_index"
    , "tree_index"
))