Unverified Commit 0e5eb9e3 authored by James Lamb's avatar James Lamb Committed by GitHub
Browse files

[docs] expand documentation on 'group' for ranking task (#3772)



* [python-package] expand documentation on 'group' for ranking task

* add R package

* update Query Data section

* Apply suggestions from code review
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* fix typo in group example

* regenerate parameters

* Apply suggestions from code review
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* regenerate R docs
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent 35612633
...@@ -998,7 +998,11 @@ slice.lgb.Dataset <- function(dataset, idxset, ...) { ...@@ -998,7 +998,11 @@ slice.lgb.Dataset <- function(dataset, idxset, ...) {
#' \itemize{ #' \itemize{
#' \item \code{label}: label lightgbm learn from ; #' \item \code{label}: label lightgbm learn from ;
#' \item \code{weight}: to do a weight rescale ; #' \item \code{weight}: to do a weight rescale ;
#' \item \code{group}: group size ; #' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
#' group rows together as ordered results from the same set of candidate results to be ranked.
#' For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
#' that means that you have 6 groups, where the first 10 records are in the first group,
#' records 11-30 are in the second group, etc.}
#' \item \code{init_score}: initial score is the base prediction lightgbm will boost from. #' \item \code{init_score}: initial score is the base prediction lightgbm will boost from.
#' } #' }
#' #'
...@@ -1052,8 +1056,9 @@ getinfo.lgb.Dataset <- function(dataset, name, ...) { ...@@ -1052,8 +1056,9 @@ getinfo.lgb.Dataset <- function(dataset, name, ...) {
#' \item{\code{init_score}: initial score is the base prediction lightgbm will boost from} #' \item{\code{init_score}: initial score is the base prediction lightgbm will boost from}
#' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to #' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
#' group rows together as ordered results from the same set of candidate results to be ranked. #' group rows together as ordered results from the same set of candidate results to be ranked.
#' For example, if you have a 1000-row dataset that contains 250 4-document query results, #' For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
#' set this to \code{rep(4L, 250L)}} #' that means that you have 6 groups, where the first 10 records are in the first group,
#' records 11-30 are in the second group, etc.}
#' } #' }
#' #'
#' @examples #' @examples
......
...@@ -30,7 +30,11 @@ The \code{name} field can be one of the following: ...@@ -30,7 +30,11 @@ The \code{name} field can be one of the following:
\itemize{ \itemize{
\item \code{label}: label lightgbm learn from ; \item \code{label}: label lightgbm learn from ;
\item \code{weight}: to do a weight rescale ; \item \code{weight}: to do a weight rescale ;
\item \code{group}: group size ; \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results to be ranked.
For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
that means that you have 6 groups, where the first 10 records are in the first group,
records 11-30 are in the second group, etc.}
\item \code{init_score}: initial score is the base prediction lightgbm will boost from. \item \code{init_score}: initial score is the base prediction lightgbm will boost from.
} }
} }
......
...@@ -35,8 +35,9 @@ The \code{name} field can be one of the following: ...@@ -35,8 +35,9 @@ The \code{name} field can be one of the following:
\item{\code{init_score}: initial score is the base prediction lightgbm will boost from} \item{\code{init_score}: initial score is the base prediction lightgbm will boost from}
\item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results to be ranked. group rows together as ordered results from the same set of candidate results to be ranked.
For example, if you have a 1000-row dataset that contains 250 4-document query results, For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
set this to \code{rep(4L, 250L)}} that means that you have 6 groups, where the first 10 records are in the first group,
records 11-30 are in the second group, etc.}
} }
} }
\examples{ \examples{
......
...@@ -760,7 +760,7 @@ Dataset Parameters ...@@ -760,7 +760,7 @@ Dataset Parameters
- **Note**: works only in case of loading data directly from file - **Note**: works only in case of loading data directly from file
- **Note**: data should be grouped by query\_id - **Note**: data should be grouped by query\_id, for more information, see `Query Data <#query-data>`__
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0`` - **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
...@@ -1229,6 +1229,7 @@ Query Data ...@@ -1229,6 +1229,7 @@ Query Data
~~~~~~~~~~ ~~~~~~~~~~
For learning to rank, it needs query information for training data. For learning to rank, it needs query information for training data.
LightGBM uses an additional file to store query data, like the following: LightGBM uses an additional file to store query data, like the following:
:: ::
...@@ -1238,7 +1239,13 @@ LightGBM uses an additional file to store query data, like the following: ...@@ -1238,7 +1239,13 @@ LightGBM uses an additional file to store query data, like the following:
67 67
... ...
It means first ``27`` lines samples belong to one query and next ``18`` lines belong to another, and so on. For wrapper libraries like in Python and R, this information can also be provided as an array-like via the Dataset parameter ``group``.
::
[27, 18, 67, ...]
For example, if you have a 112-document dataset with ``group = [27, 18, 67]``, that means that you have 3 groups, where the first 27 records are in the first group, records 28-45 are in the second group, and records 46-112 are in the third group.
**Note**: data should be ordered by the query. **Note**: data should be ordered by the query.
......
...@@ -670,7 +670,7 @@ struct Config { ...@@ -670,7 +670,7 @@ struct Config {
// desc = use number for index, e.g. ``query=0`` means column\_0 is the query id // desc = use number for index, e.g. ``query=0`` means column\_0 is the query id
// desc = add a prefix ``name:`` for column name, e.g. ``query=name:query_id`` // desc = add a prefix ``name:`` for column name, e.g. ``query=name:query_id``
// desc = **Note**: works only in case of loading data directly from file // desc = **Note**: works only in case of loading data directly from file
// desc = **Note**: data should be grouped by query\_id // desc = **Note**: data should be grouped by query\_id, for more information, see `Query Data <#query-data>`__
// desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0`` // desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
std::string group_column = ""; std::string group_column = "";
......
...@@ -941,7 +941,10 @@ class Dataset: ...@@ -941,7 +941,10 @@ class Dataset:
weight : list, numpy 1-D array, pandas Series or None, optional (default=None) weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
Weight for each instance. Weight for each instance.
group : list, numpy 1-D array, pandas Series or None, optional (default=None) group : list, numpy 1-D array, pandas Series or None, optional (default=None)
Group/query size for Dataset. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None) init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
Init score for Dataset. Init score for Dataset.
silent : bool, optional (default=False) silent : bool, optional (default=False)
...@@ -1356,7 +1359,10 @@ class Dataset: ...@@ -1356,7 +1359,10 @@ class Dataset:
weight : list, numpy 1-D array, pandas Series or None, optional (default=None) weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
Weight for each instance. Weight for each instance.
group : list, numpy 1-D array, pandas Series or None, optional (default=None) group : list, numpy 1-D array, pandas Series or None, optional (default=None)
Group/query size for Dataset. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None) init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
Init score for Dataset. Init score for Dataset.
silent : bool, optional (default=False) silent : bool, optional (default=False)
...@@ -1715,7 +1721,10 @@ class Dataset: ...@@ -1715,7 +1721,10 @@ class Dataset:
Parameters Parameters
---------- ----------
group : list, numpy 1-D array, pandas Series or None group : list, numpy 1-D array, pandas Series or None
Group size of each group. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
Returns Returns
------- -------
...@@ -1830,7 +1839,10 @@ class Dataset: ...@@ -1830,7 +1839,10 @@ class Dataset:
Returns Returns
------- -------
group : numpy array or None group : numpy array or None
Group size of each group. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
""" """
if self.group is None: if self.group is None:
self.group = self.get_field('group') self.group = self.get_field('group')
......
...@@ -36,7 +36,10 @@ class _ObjectiveFunctionWrapper: ...@@ -36,7 +36,10 @@ class _ObjectiveFunctionWrapper:
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values. The predicted values.
group : array-like group : array-like
Group/query data, used for ranking task. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The value of the first order derivative (gradient) for each sample point. The value of the first order derivative (gradient) for each sample point.
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
...@@ -122,7 +125,10 @@ class _EvalFunctionWrapper: ...@@ -122,7 +125,10 @@ class _EvalFunctionWrapper:
weight : array-like of shape = [n_samples] weight : array-like of shape = [n_samples]
The weight of samples. The weight of samples.
group : array-like group : array-like
Group/query data, used for ranking task. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
eval_name : string eval_name : string
The name of evaluation function (without whitespaces). The name of evaluation function (without whitespaces).
eval_result : float eval_result : float
...@@ -266,7 +272,10 @@ class LGBMModel(_LGBMModelBase): ...@@ -266,7 +272,10 @@ class LGBMModel(_LGBMModelBase):
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values. The predicted values.
group : array-like group : array-like
Group/query data, used for ranking task. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The value of the first order derivative (gradient) for each sample point. The value of the first order derivative (gradient) for each sample point.
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task) hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
...@@ -384,7 +393,10 @@ class LGBMModel(_LGBMModelBase): ...@@ -384,7 +393,10 @@ class LGBMModel(_LGBMModelBase):
init_score : array-like of shape = [n_samples] or None, optional (default=None) init_score : array-like of shape = [n_samples] or None, optional (default=None)
Init score of training data. Init score of training data.
group : array-like or None, optional (default=None) group : array-like or None, optional (default=None)
Group data of training data. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
eval_set : list or None, optional (default=None) eval_set : list or None, optional (default=None)
A list of (X, y) tuple pairs to use as validation sets. A list of (X, y) tuple pairs to use as validation sets.
eval_names : list of strings or None, optional (default=None) eval_names : list of strings or None, optional (default=None)
...@@ -460,7 +472,10 @@ class LGBMModel(_LGBMModelBase): ...@@ -460,7 +472,10 @@ class LGBMModel(_LGBMModelBase):
weight : array-like of shape = [n_samples] weight : array-like of shape = [n_samples]
The weight of samples. The weight of samples.
group : array-like group : array-like
Group/query data, used for ranking task. Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
eval_name : string eval_name : string
The name of evaluation function (without whitespaces). The name of evaluation function (without whitespaces).
eval_result : float eval_result : float
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment