[docs] clarify that categorical features will be converted to integers internally (#4959)

* clarify that categoricals will be converted to ints and not that they should be ints in the input data * update remaining sections * update config.h * add suggestions

[docs] clarify that categorical features will be converted to integers internally (#4959)
* clarify that categoricals will be converted to ints and not that they should be ints in the input data * update remaining sections * update config.h * add suggestions
820ae7e6 · José Morales · GitHub · 0db573c3 · 820ae7e6 · 820ae7e6
Unverified Commit 820ae7e6 authored Feb 20, 2022 by José Morales Committed by GitHub Feb 20, 2022
6 changed files
--- a/docs/Advanced-Topics.rst
+++ b/docs/Advanced-Topics.rst
@@ -23,7 +23,8 @@ Categorical Feature Support
 -  Use ``categorical_feature`` to specify the categorical features.
   Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst#categorical_feature>`__.

-  Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
+-  Categorical features will be cast to ``int32`` so they must be encoded as non-negative integers (negative values will be treated as missing)
+   less than ``Int32.MaxValue`` (2147483647).
   It is best to use a contiguous range of integers started from zero.
   Floating point numbers in categorical features will be rounded towards 0.


--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -816,7 +816,7 @@ Dataset Parameters

   -  add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features

-   -  **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
+   -  **Note**: all values will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package)

   -  **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``


--- a/include/LightGBM/config.h
+++ b/include/LightGBM/config.h
@@ -697,7 +697,7 @@ struct Config {
  // desc = used to specify categorical features
  // desc = use number for index, e.g. ``categorical_feature=0,1,2`` means column\_0, column\_1 and column\_2 are categorical features
  // desc = add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features
-  // desc = **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
+  // desc = **Note**: all values will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package)
  // desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
  // desc = **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
  // desc = **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero

--- a/python-package/lightgbm/basic.py
+++ b/python-package/lightgbm/basic.py
@@ -1155,7 +1155,7 @@ class Dataset:
            If list of int, interpreted as indices.
            If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
            If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
-            All values in categorical features should be less than int32 max value (2147483647).
+            All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
            Large values could be memory consuming. Consider using consecutive integers starting from zero.
            All negative values in categorical features will be treated as missing values.
            The output cannot be monotonically constrained with respect to a categorical feature.
@@ -3560,7 +3560,7 @@ class Booster:
            If list of int, interpreted as indices.
            If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
            If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
-            All values in categorical features should be less than int32 max value (2147483647).
+            All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
            Large values could be memory consuming. Consider using consecutive integers starting from zero.
            All negative values in categorical features will be treated as missing values.
            The output cannot be monotonically constrained with respect to a categorical feature.

--- a/python-package/lightgbm/engine.py
+++ b/python-package/lightgbm/engine.py
@@ -105,7 +105,7 @@ def train(
        If list of int, interpreted as indices.
        If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
-        All values in categorical features should be less than int32 max value (2147483647).
+        All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
        Large values could be memory consuming. Consider using consecutive integers starting from zero.
        All negative values in categorical features will be treated as missing values.
        The output cannot be monotonically constrained with respect to a categorical feature.
@@ -460,7 +460,7 @@ def cv(params, train_set, num_boost_round=100,
        If list of int, interpreted as indices.
        If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
-        All values in categorical features should be less than int32 max value (2147483647).
+        All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
        Large values could be memory consuming. Consider using consecutive integers starting from zero.
        All negative values in categorical features will be treated as missing values.
        The output cannot be monotonically constrained with respect to a categorical feature.

--- a/python-package/lightgbm/sklearn.py
+++ b/python-package/lightgbm/sklearn.py
@@ -258,7 +258,7 @@ _lgbmmodel_doc_fit = (
        If list of int, interpreted as indices.
        If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
-        All values in categorical features should be less than int32 max value (2147483647).
+        All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
        Large values could be memory consuming. Consider using consecutive integers starting from zero.
        All negative values in categorical features will be treated as missing values.
        The output cannot be monotonically constrained with respect to a categorical feature.