update documents.

d121ac7e · Guolin Ke · cc11525d · d121ac7e · d121ac7e · d121ac7e
Commit d121ac7e authored Sep 28, 2017 by Guolin Ke
Hide whitespace changes
Inline Side-by-side

Showing with 58 additions and 7 deletions

README.md README.md +7 -5

docs/Advanced-Topic.md docs/Advanced-Topic.md +34 -0

docs/Parameters.md docs/Parameters.md +17 -2

No files found.
--- a/README.md
+++ b/README.md
@@ -25,13 +25,15 @@ For more details, please refer to [Features](https://github.com/Microsoft/LightG
 News
 ----
-07/13/2017: [Gitter](https://gitter.im/Microsoft/LightGBM) is avaiable.
+08/15/2017 : Optimal split for categorical features.
-06/20/2017: Python-package is on [PyPI](https://pypi.python.org/pypi/lightgbm) now.
+07/13/2017 : [Gitter](https://gitter.im/Microsoft/LightGBM) is avaiable.
-06/09/2017: [LightGBM Slack team](https://lightgbm.slack.com) is available.
+06/20/2017 : Python-package is on [PyPI](https://pypi.python.org/pypi/lightgbm) now.
-05/03/2017: LightGBM v2 stable release.
+06/09/2017 : [LightGBM Slack team](https://lightgbm.slack.com) is available.
+05/03/2017 : LightGBM v2 stable release.
 04/10/2017 : LightGBM supports GPU-accelerated tree learning now. Please read our [GPU Tutorial](./docs/GPU-Tutorial.md) and [Performance Comparison](./docs/GPU-Performance.md).
@@ -41,7 +43,7 @@ News
 01/08/2017 : Release [**R-package**](https://github.com/Microsoft/LightGBM/tree/master/R-package) beta version, welcome to have a try and provide feedback.
-12/05/2016 : **Categorical Features as input directly** (without one-hot coding). Experiment on [Expo data](http://stat-computing.org/dataexpo/2009/) shows about 8x speed-up with same accuracy compared with one-hot coding.
+12/05/2016 : **Categorical Features as input directly** (without one-hot coding). 
 12/02/2016 : Release [**python-package**](https://github.com/Microsoft/LightGBM/tree/master/python-package) beta version, welcome to have a try and provide feedback.

--- a/docs/Advanced-Topic.md
+++ b/docs/Advanced-Topic.md
+# Advanced Topics
+## Missing value handle
+* LightGBM enables the missing value handle by default, you can disable it by set ```use_missing=false```.
+* LightGBM uses NA (NAN) to represent the missing value by default, you can change it to use zero by set ```zero_as_missing=true```.
+* When ```zero_as_missing=false``` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros. 
+* When ```zero_as_missing=true```, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing. 
+## Categorical feature support
+* LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such a optimal split can provide the much better accuracy than one-hot coding solution. 
+* Use `categorical_feature` to specific the categorical features. Refer to the parameter `categorical_feature` in [Parameters](./Parameters.md).
+* Need to convert to `int` type first, and only support non-negative numbers. It is better to convert into continues ranges.
+* Use `max_cat_group`, `cat_smooth_ratio` to deal with over-fitting (when #data is small or #category is large).
+* For categocal features with high cardinality (#categoriy is large), it is better to convert it to numerical features. 
+## LambdaRank 
+* The label should be `int` type, and larger number represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
+* Use `label_gain` to set the gain(weight) of `int` label.
+* Use `max_position` to set the NDCG optimization position.
+## Parameters Tuning
+* Refer to [Parameters tuning](./Parameters-tuning.md).
+## GPU support
+* Refer to [GPU Tutorial](./GPU-Tutorial.md) and [GPU Targets](./GPU-Targets.md).
+## Parallel Learning 
+* Refer to https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide
--- a/docs/Parameters.md
+++ b/docs/Parameters.md
 # Parameters
-This is a page contains all parameters in LightGBM command line program.
+This is a page contains all parameters in LightGBM.
 ***List of other Helpful Links***
 * [Python API Reference](./Python-API.md)
@@ -125,6 +125,21 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
  * only used in `goss`,  the retain ratio of large gradient data
 * `other_rate`, default=`0.1`, type=int
  * only used in `goss`,  the retain ratio of small gradient data
+* `max_cat_group`, default=`64`, type=int
+  * use for the categorical features.
+  * When #catogory is large, finding the split point on it is easily over-fitting. So LightGBM merges them into `max_cat_group` groups, and finds the split points on the group boundaries.
+* `min_data_per_group`, default=`10`, type=int
+  * Min number of data per categorical group.
+* `max_cat_threshold`, default=`256`, type=int
+  * use for the categorical features. Limit the max threshold points in categorical features.
+* `min_cat_smooth`, default=`5`, type=double
+  * use for the categorical features. Refer to the descrption in paramater `cat_smooth_ratio`.
+* `max_cat_smooth`, default=`100`, type=double
+  * use for the categorical features. Refer to the descrption in paramater `cat_smooth_ratio`.
+* `cat_smooth_ratio`, default=`0.01`, type=double
+  * use for the categorical features. This can reduce the effect of noises in categorical features, especially for categories with few data.
+  * The smooth denominator is `a = min(max_cat_smooth, max(min_cat_smooth, num_data/num_category*cat_smooth_ratio))`.
+  * The smooth numerator  is `b = a * sum_gradient / sum_hessian`.
 ## IO parameters
@@ -181,7 +196,7 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
  * specific categorical features
  * Use number for index, e.g. `categorical_feature=0,1,2` means column_0, column_1 and column_2 are categorical features.
  * Add a prefix `name:` for column name, e.g. `categorical_feature=name:c1,c2,c3` means c1, c2 and c3 are categorical features.
-  * Note: Only support categorical with `int` type. Index start from `0`. And it doesn't count the label column.
+  * Note: Only support categorical with `int` type (Note: the negative values will be treated as Missing values). Index start from `0`. And it doesn't count the label column.
 * `predict_raw_score`, default=`false`, type=bool, alias=`raw_score`,`is_predict_raw_score`
  * only used in prediction task
  * Set to `true` will only predict the raw scores.