[docs] documented crash in case categorical values is bigger max int32 (#1376)

* added checks for categorical features > max_int32 * added tests * fixed pylint * removed warnings about overridden categorical features * Revert "removed warnings about overridden categorical features" This reverts commit 289a426c700ce8934a526cc456a1b1cd5c621db9. * a little bit more efficient checks * added notes about max values in categorical features * Revert "a little bit more efficient checks" This reverts commit bed88830243da21a2db454873c0e308126e05732. * Revert "fixed pylint" This reverts commit a229e1563b0abc1b13de6358577abf90bd529015. * Revert "added tests" This reverts commit 299e001b7550111555b80730d673d4f225cf5f74. * Revert "added checks for categorical features > max_int32" This reverts commit 2cc7afacde7c6366644f6988ccedc344752b68c7.

[docs] documented crash in case categorical values is bigger max int32 (#1376)
* added checks for categorical features > max_int32 * added tests * fixed pylint * removed warnings about overridden categorical features * Revert "removed warnings about overridden categorical features" This reverts commit 289a426c700ce8934a526cc456a1b1cd5c621db9. * a little bit more efficient checks * added notes about max values in categorical features * Revert "a little bit more efficient checks" This reverts commit bed88830243da21a2db454873c0e308126e05732. * Revert "fixed pylint" This reverts commit a229e1563b0abc1b13de6358577abf90bd529015. * Revert "added tests" This reverts commit 299e001b7550111555b80730d673d4f225cf5f74. * Revert "added checks for categorical features > max_int32" This reverts commit 2cc7afacde7c6366644f6988ccedc344752b68c7.
a0c69417 · Nikita Titov · GitHub · 3f54429c · a0c69417 · a0c69417
Unverified Commit a0c69417 authored May 22, 2018 by Nikita Titov Committed by GitHub May 22, 2018
9 changed files
--- a/README.md
+++ b/README.md
@@ -105,12 +105,11 @@ Microsoft Open Source Code of Conduct
 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
-Reference Paper
+Reference Papers
---------------
+----------------
 Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)". In Advances in Neural Information Processing Systems (NIPS), pp. 3149-3157. 2017.
 Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tieyan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](http://papers.nips.cc/paper/6380-a-communication-efficient-parallel-algorithm-for-decision-tree)". Advances in Neural Information Processing Systems 29 (NIPS 2016).
 Huan Zhang, Si Si and Cho-Jui Hsieh. "[GPU Acceleration for Large-scale Tree Boosting](https://arxiv.org/abs/1706.08359)". arXiv:1706.08359, 2017.
--- a/docs/Advanced-Topics.rst
+++ b/docs/Advanced-Topics.rst
@@ -15,13 +15,13 @@ Missing Value Handle
 Categorical Feature Support
 ---------------------------
-  LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features.
+-  LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
-   Such an optimal split can provide the much better accuracy than one-hot coding solution.
+   Such an optimal split can provide the much better accuracy than one-hot encoding solution.
 -  Use ``categorical_feature`` to specify the categorical features.
   Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.
-  Converting to ``int`` type is needed first, and there is support for non-negative numbers only.
+-  Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
   It is better to convert into continues ranges.
 -  Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting

--- a/docs/FAQ.rst
+++ b/docs/FAQ.rst
@@ -107,6 +107,14 @@ LightGBM
 --------------
+-  **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.
+-  **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
+   In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
+   (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__.). You should convert them into integer range from zero to number of categories first.
+--------------
 R-package
 ~~~~~~~~~

--- a/docs/Features.rst
+++ b/docs/Features.rst
@@ -63,7 +63,7 @@ So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tre
 Optimal Split for Categorical Features
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-We often convert the categorical features into one-hot coding.
+We often convert the categorical features into one-hot encoding.
 However, it is not a good solution in tree learner.
 The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.

--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -441,6 +441,8 @@ IO Parameters
   -  **Note**: only supports categorical with ``int`` type. Index starts from ``0``. And it doesn't count the label column
+   -  **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
   -  **Note**: the negative values will be treated as **missing values**
 -  ``predict_raw_score``, default=\ ``false``, type=bool, alias=\ ``raw_score``, ``is_predict_raw_score``, ``predict_rawscore``

--- a/docs/Quick-Start.rst
+++ b/docs/Quick-Start.rst
@@ -29,7 +29,7 @@ Some columns could be ignored.
 Categorical Feature Support
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-LightGBM can use categorical features directly (without one-hot coding).
+LightGBM can use categorical features directly (without one-hot encoding).
 The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding.
 For the setting details, please refer to `Parameters <./Parameters.rst>`__.

--- a/python-package/lightgbm/basic.py
+++ b/python-package/lightgbm/basic.py
@@ -603,6 +603,7 @@ class Dataset(object):
            If list of int, interpreted as indices.
            If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
            If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+            All values should be less than int32 max value (2147483647).
        params: dict or None, optional (default=None)
            Other parameters.
        free_raw_data: bool, optional (default=True)

--- a/python-package/lightgbm/engine.py
+++ b/python-package/lightgbm/engine.py
@@ -53,6 +53,7 @@ def train(params, train_set, num_boost_round=100,
        If list of int, interpreted as indices.
        If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+        All values should be less than int32 max value (2147483647).
    early_stopping_rounds: int or None, optional (default=None)
        Activates early stopping. The model will train until the validation score stops improving.
        Requires at least one validation data and one metric. If there's more than one, will check all of them.
@@ -354,6 +355,7 @@ def cv(params, train_set, num_boost_round=100,
        If list of int, interpreted as indices.
        If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+        All values should be less than int32 max value (2147483647).
    early_stopping_rounds: int or None, optional (default=None)
        Activates early stopping. CV error needs to decrease at least
        every ``early_stopping_rounds`` round(s) to continue.

--- a/python-package/lightgbm/sklearn.py
+++ b/python-package/lightgbm/sklearn.py
@@ -341,6 +341,7 @@ class LGBMModel(_LGBMModelBase):
            If list of int, interpreted as indices.
            If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
            If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
+            All values should be less than int32 max value (2147483647).
        callbacks : list of callback functions or None, optional (default=None)
            List of callback functions that are applied at each iteration.
            See Callbacks in Python API for more information.