Commit d121ac7e authored by Guolin Ke's avatar Guolin Ke
Browse files

update documents.

parent cc11525d
...@@ -25,13 +25,15 @@ For more details, please refer to [Features](https://github.com/Microsoft/LightG ...@@ -25,13 +25,15 @@ For more details, please refer to [Features](https://github.com/Microsoft/LightG
News News
---- ----
07/13/2017: [Gitter](https://gitter.im/Microsoft/LightGBM) is avaiable. 08/15/2017 : Optimal split for categorical features.
06/20/2017: Python-package is on [PyPI](https://pypi.python.org/pypi/lightgbm) now. 07/13/2017 : [Gitter](https://gitter.im/Microsoft/LightGBM) is avaiable.
06/09/2017: [LightGBM Slack team](https://lightgbm.slack.com) is available. 06/20/2017 : Python-package is on [PyPI](https://pypi.python.org/pypi/lightgbm) now.
05/03/2017: LightGBM v2 stable release. 06/09/2017 : [LightGBM Slack team](https://lightgbm.slack.com) is available.
05/03/2017 : LightGBM v2 stable release.
04/10/2017 : LightGBM supports GPU-accelerated tree learning now. Please read our [GPU Tutorial](./docs/GPU-Tutorial.md) and [Performance Comparison](./docs/GPU-Performance.md). 04/10/2017 : LightGBM supports GPU-accelerated tree learning now. Please read our [GPU Tutorial](./docs/GPU-Tutorial.md) and [Performance Comparison](./docs/GPU-Performance.md).
...@@ -41,7 +43,7 @@ News ...@@ -41,7 +43,7 @@ News
01/08/2017 : Release [**R-package**](https://github.com/Microsoft/LightGBM/tree/master/R-package) beta version, welcome to have a try and provide feedback. 01/08/2017 : Release [**R-package**](https://github.com/Microsoft/LightGBM/tree/master/R-package) beta version, welcome to have a try and provide feedback.
12/05/2016 : **Categorical Features as input directly** (without one-hot coding). Experiment on [Expo data](http://stat-computing.org/dataexpo/2009/) shows about 8x speed-up with same accuracy compared with one-hot coding. 12/05/2016 : **Categorical Features as input directly** (without one-hot coding).
12/02/2016 : Release [**python-package**](https://github.com/Microsoft/LightGBM/tree/master/python-package) beta version, welcome to have a try and provide feedback. 12/02/2016 : Release [**python-package**](https://github.com/Microsoft/LightGBM/tree/master/python-package) beta version, welcome to have a try and provide feedback.
......
# Advanced Topics
## Missing value handle
* LightGBM enables the missing value handle by default, you can disable it by set ```use_missing=false```.
* LightGBM uses NA (NAN) to represent the missing value by default, you can change it to use zero by set ```zero_as_missing=true```.
* When ```zero_as_missing=false``` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros.
* When ```zero_as_missing=true```, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing.
## Categorical feature support
* LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such a optimal split can provide the much better accuracy than one-hot coding solution.
* Use `categorical_feature` to specific the categorical features. Refer to the parameter `categorical_feature` in [Parameters](./Parameters.md).
* Need to convert to `int` type first, and only support non-negative numbers. It is better to convert into continues ranges.
* Use `max_cat_group`, `cat_smooth_ratio` to deal with over-fitting (when #data is small or #category is large).
* For categocal features with high cardinality (#categoriy is large), it is better to convert it to numerical features.
## LambdaRank
* The label should be `int` type, and larger number represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
* Use `label_gain` to set the gain(weight) of `int` label.
* Use `max_position` to set the NDCG optimization position.
## Parameters Tuning
* Refer to [Parameters tuning](./Parameters-tuning.md).
## GPU support
* Refer to [GPU Tutorial](./GPU-Tutorial.md) and [GPU Targets](./GPU-Targets.md).
## Parallel Learning
* Refer to https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide
# Parameters # Parameters
This is a page contains all parameters in LightGBM command line program. This is a page contains all parameters in LightGBM.
***List of other Helpful Links*** ***List of other Helpful Links***
* [Python API Reference](./Python-API.md) * [Python API Reference](./Python-API.md)
...@@ -125,6 +125,21 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s ...@@ -125,6 +125,21 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* only used in `goss`, the retain ratio of large gradient data * only used in `goss`, the retain ratio of large gradient data
* `other_rate`, default=`0.1`, type=int * `other_rate`, default=`0.1`, type=int
* only used in `goss`, the retain ratio of small gradient data * only used in `goss`, the retain ratio of small gradient data
* `max_cat_group`, default=`64`, type=int
* use for the categorical features.
* When #catogory is large, finding the split point on it is easily over-fitting. So LightGBM merges them into `max_cat_group` groups, and finds the split points on the group boundaries.
* `min_data_per_group`, default=`10`, type=int
* Min number of data per categorical group.
* `max_cat_threshold`, default=`256`, type=int
* use for the categorical features. Limit the max threshold points in categorical features.
* `min_cat_smooth`, default=`5`, type=double
* use for the categorical features. Refer to the descrption in paramater `cat_smooth_ratio`.
* `max_cat_smooth`, default=`100`, type=double
* use for the categorical features. Refer to the descrption in paramater `cat_smooth_ratio`.
* `cat_smooth_ratio`, default=`0.01`, type=double
* use for the categorical features. This can reduce the effect of noises in categorical features, especially for categories with few data.
* The smooth denominator is `a = min(max_cat_smooth, max(min_cat_smooth, num_data/num_category*cat_smooth_ratio))`.
* The smooth numerator is `b = a * sum_gradient / sum_hessian`.
## IO parameters ## IO parameters
...@@ -181,7 +196,7 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s ...@@ -181,7 +196,7 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* specific categorical features * specific categorical features
* Use number for index, e.g. `categorical_feature=0,1,2` means column_0, column_1 and column_2 are categorical features. * Use number for index, e.g. `categorical_feature=0,1,2` means column_0, column_1 and column_2 are categorical features.
* Add a prefix `name:` for column name, e.g. `categorical_feature=name:c1,c2,c3` means c1, c2 and c3 are categorical features. * Add a prefix `name:` for column name, e.g. `categorical_feature=name:c1,c2,c3` means c1, c2 and c3 are categorical features.
* Note: Only support categorical with `int` type. Index start from `0`. And it doesn't count the label column. * Note: Only support categorical with `int` type (Note: the negative values will be treated as Missing values). Index start from `0`. And it doesn't count the label column.
* `predict_raw_score`, default=`false`, type=bool, alias=`raw_score`,`is_predict_raw_score` * `predict_raw_score`, default=`false`, type=bool, alias=`raw_score`,`is_predict_raw_score`
* only used in prediction task * only used in prediction task
* Set to `true` will only predict the raw scores. * Set to `true` will only predict the raw scores.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment