Commit a58aca64 authored by dmitryikh's avatar dmitryikh Committed by Nikita Titov
Browse files

Docs & Warning on sparse categorical features (#1636)

* warning on categorical feature with sparse values

* [docs] categorical features note
parent 83565f01
......@@ -573,6 +573,8 @@ IO Parameters
- **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
- **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers started from zero
- **Note**: all negative values will be treated as **missing values**
- ``predict_raw_score`` :raw-html:`<a id="predict_raw_score" title="Permalink to this parameter" href="#predict_raw_score">&#x1F517;&#xFE0E;</a>`, default = ``false``, type = bool, aliases: ``is_predict_raw_score``, ``predict_rawscore``, ``raw_score``
......
......@@ -539,6 +539,7 @@ public:
// desc = **Note**: only supports categorical with ``int`` type
// desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
// desc = **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
// desc = **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers started from zero
// desc = **Note**: all negative values will be treated as **missing values**
std::string categorical_feature = "";
......
......@@ -646,6 +646,7 @@ class Dataset(object):
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider to use consecutive integers started from zero.
All negative values in categorical features will be treated as missing values.
params : dict or None, optional (default=None)
Other parameters.
......
......@@ -57,6 +57,7 @@ def train(params, train_set, num_boost_round=100,
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider to use consecutive integers started from zero.
All negative values in categorical features will be treated as missing values.
early_stopping_rounds: int or None, optional (default=None)
Activates early stopping. The model will train until the validation score stops improving.
......@@ -364,6 +365,7 @@ def cv(params, train_set, num_boost_round=100,
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider to use consecutive integers started from zero.
All negative values in categorical features will be treated as missing values.
early_stopping_rounds: int or None, optional (default=None)
Activates early stopping.
......
......@@ -349,6 +349,7 @@ class LGBMModel(_LGBMModelBase):
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider to use consecutive integers started from zero.
All negative values in categorical features will be treated as missing values.
callbacks : list of callback functions or None, optional (default=None)
List of callback functions that are applied at each iteration.
......
......@@ -320,6 +320,11 @@ namespace LightGBM {
num_bin_ = 0;
int rest_cnt = static_cast<int>(total_sample_cnt - na_cnt);
if (rest_cnt > 0) {
const int SPARSE_RATIO = 100;
if (distinct_values_int.back() / SPARSE_RATIO > static_cast<int>(distinct_values_int.size())) {
Log::Warning("Met categorical feature which contains sparse values. "
"Consider renumbering to consecutive integers started from zero");
}
// sort by counts
Common::SortForPair<int, int>(counts_int, distinct_values_int, 0, true);
// avoid first bin is zero
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment