[docs] negative values in category columns (#1567)

* broadcast info about negative values in categorical features to python package * update link to categorical_feature parameter

[docs] negative values in category columns (#1567)
* broadcast info about negative values in categorical features to python package * update link to categorical_feature parameter
93764fda · Nikita Titov · GitHub · 05484f1d · 93764fda · 93764fda
Unverified Commit 93764fda authored Aug 08, 2018 by Nikita Titov Committed by GitHub Aug 08, 2018
7 changed files
--- a/docs/Advanced-Topics.rst
+++ b/docs/Advanced-Topics.rst
@@ -21,7 +21,7 @@ Categorical Feature Support
   `described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.

 -  Use ``categorical_feature`` to specify the categorical features.
-   Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.
+   Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst#categorical_feature>`__.

 -  Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
   It is best to use a contiguous range of integers.

--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -567,7 +567,7 @@ IO Parameters

   -  **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)

-   -  **Note**: the negative values will be treated as **missing values**
+   -  **Note**: all negative values will be treated as **missing values**

 -  ``predict_raw_score`` :raw-html:`<a id="predict_raw_score" title="Permalink to this parameter" href="#predict_raw_score">&#x1F517;&#xFE0E;</a>`, default = ``false``, type = bool, aliases: ``is_predict_raw_score``, ``predict_rawscore``, ``raw_score``


--- a/docs/Quick-Start.rst
+++ b/docs/Quick-Start.rst
@@ -32,7 +32,7 @@ Categorical Feature Support
 LightGBM can use categorical features directly (without one-hot encoding).
 The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding.

-For the setting details, please refer to `Parameters <./Parameters.rst>`__.
+For the setting details, please refer to `Parameters <./Parameters.rst#categorical_feature>`__.

 Weight and Query/Group Data
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/include/LightGBM/config.h
+++ b/include/LightGBM/config.h
@@ -533,7 +533,7 @@ public:
  // desc = **Note**: only supports categorical with ``int`` type
  // desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
  // desc = **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
-  // desc = **Note**: the negative values will be treated as **missing values**
+  // desc = **Note**: all negative values will be treated as **missing values**
  std::string categorical_feature = "";

  // alias = is_predict_raw_score, predict_rawscore, raw_score

--- a/python-package/lightgbm/basic.py
+++ b/python-package/lightgbm/basic.py
@@ -605,7 +605,8 @@ class Dataset(object):
            If list of int, interpreted as indices.
            If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
            If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
-            All values should be less than int32 max value (2147483647).
+            All values in categorical features should be less than int32 max value (2147483647).
+            All negative values in categorical features will be treated as missing values.
        params: dict or None, optional (default=None)
            Other parameters.
        free_raw_data: bool, optional (default=True)

--- a/python-package/lightgbm/engine.py
+++ b/python-package/lightgbm/engine.py
@@ -56,7 +56,8 @@ def train(params, train_set, num_boost_round=100,
        If list of int, interpreted as indices.
        If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
-        All values should be less than int32 max value (2147483647).
+        All values in categorical features should be less than int32 max value (2147483647).
+        All negative values in categorical features will be treated as missing values.
    early_stopping_rounds: int or None, optional (default=None)
        Activates early stopping. The model will train until the validation score stops improving.
        Requires at least one validation data and one metric. If there's more than one, will check all of them except the training data.
@@ -365,7 +366,8 @@ def cv(params, train_set, num_boost_round=100,
        If list of int, interpreted as indices.
        If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
        If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
-        All values should be less than int32 max value (2147483647).
+        All values in categorical features should be less than int32 max value (2147483647).
+        All negative values in categorical features will be treated as missing values.
    early_stopping_rounds: int or None, optional (default=None)
        Activates early stopping. CV error needs to decrease at least
        every ``early_stopping_rounds`` round(s) to continue.

--- a/python-package/lightgbm/sklearn.py
+++ b/python-package/lightgbm/sklearn.py
@@ -346,7 +346,8 @@ class LGBMModel(_LGBMModelBase):
            If list of int, interpreted as indices.
            If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
            If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
-            All values should be less than int32 max value (2147483647).
+            All values in categorical features should be less than int32 max value (2147483647).
+            All negative values in categorical features will be treated as missing values.
        callbacks : list of callback functions or None, optional (default=None)
            List of callback functions that are applied at each iteration.
            See Callbacks in Python API for more information.