@@ -17,7 +17,7 @@ For more details, please refer to [Features](https://github.com/Microsoft/LightG
...
@@ -17,7 +17,7 @@ For more details, please refer to [Features](https://github.com/Microsoft/LightG
News
News
----
----
12/05/2016 : **Categorical Features as input directly**(without one-hot coding). Experiment on [Expo data](http://stat-computing.org/dataexpo/2009/) shows about 8x speed-up with same accuracy compared with one-hot coding (refer to [categorical log](https://github.com/guolinke/boosting_tree_benchmarks/blob/master/lightgbm/lightgbm_dataexpo_speed.log) and [one-hot log](https://github.com/guolinke/boosting_tree_benchmarks/blob/master/lightgbm/lightgbm_dataexpo_onehot_speed.log)).
12/05/2016 : **Categorical Features as input directly**(without one-hot coding). Experiment on [Expo data](http://stat-computing.org/dataexpo/2009/) shows about 8x speed-up with same accuracy compared with one-hot coding (refer to [categorical log](https://github.com/guolinke/boosting_tree_benchmarks/blob/master/lightgbm/lightgbm_dataexpo_speed.log) and [one-hot log](https://github.com/guolinke/boosting_tree_benchmarks/blob/master/lightgbm/lightgbm_dataexpo_onehot_speed.log)).
For the setting details, please refer to [Configuration](https://github.com/Microsoft/LightGBM/wiki/Configuration#io-parameters).
For the setting details, please refer to [IO Parameters](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.md#io-parameters).
12/02/2016 : Release [**python-package**](https://github.com/Microsoft/LightGBM/tree/master/python-package) beta version, welcome to have a try and provide issues and feedback.
12/02/2016 : Release [**python-package**](https://github.com/Microsoft/LightGBM/tree/master/python-package) beta version, welcome to have a try and provide issues and feedback.
This is a page contains all parameters in LightGBM.
## Parameter format
The parameter format is ```key1=value1 key2=value2 ... ``` . And parameters can be set both in config file and command line. By using command line, parameters should not have spaces before and after ```=```. By using config files, one line can only contain one parameter. you can use ```#``` to comment. If one parameter appears in both command line and config file, LightGBM will use the parameter in command line.
* For the best speed, set this to the number of **real CPU cores**, not the number of threads (most CPU using [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading) to generate 2 threads per CPU core).
* For parallel learning, should not use full CPU cores since this will cause poor performance for the network.
## Learning control parameters
*```max_depth```, default=```-1```, type=int
* Limit the max depth for tree model. This is used to deal with overfit when #data is small. Tree still grow by leaf-wise.
* LightGBM will random select part of features on each iteration if ```feature_fraction``` smaller than ```1.0```. For example, if set to ```0.8```, will select 80% features before training each tree.
* by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed. But it may out of memory when the data file is very big.
* set this to ```true``` if data file is too big to fit in memory.
* Use number for index, e.g. ```weight=0``` means column_0 is the weight
* Add a prefix ```name:``` for column name, e.g. ```weight=name:weight```
* Note: Index start from ```0```. And it doesn't count the label column when passing type is Index. e.g. when label is column_0, and weight is column_1, the correct parameter is ```weight=0```.
* Use number for index, e.g. ```query=0``` means column_0 is the query id
* Add a prefix ```name:``` for column name, e.g. ```query=name:query_id```
* Note: Data should group by query_id. Index start from ```0```. And it doesn't count the label column when passing type is Index. e.g. when label is column_0, and query_id is column_1, the correct parameter is ```query=0```.
*```metric```, default={```l2``` for regression}, {```binary_logloss``` for binary classification},{```ndcg``` for lambdarank}, type=multi-enum, options=```l1```,```l2```,```ndcg```,```auc```,```binary_logloss```,```binary_error```
* File that list machines for this parallel learning application
* Each line contains one IP and one port for one machine. The format is ```ip port```, separate by space.
## Tuning Parameters
### Convert parameters from XGBoost
LightGBM uses [leaf-wise](https://github.com/Microsoft/LightGBM/wiki/Features#optimization-in-accuracy) tree growth algorithm. But other popular tools, e.g. XGBoost, use depth-wise tree growth. So LightGBM use ```num_leaves``` to control complexity of tree model, and other tools usually use ```max_depth```. Following table is the correspond between leaves and depths. The relation is ```num_leaves = 2^(max_depth) ```.
| max_depth | num_leaves |
| --------- | ---------- |
| 1 | 2 |
| 2 | 4 |
| 3 | 8 |
| 7 | 128 |
| 10 | 1024 |
### For faster speed
* Use bagging by set ```bagging_fraction``` and ```bagging_freq```
* Use feature sub-sampling by set ```feature_fraction```
* Use small ```max_bin```
* Use ```save_binary``` to speed up data loading in future learning
* Use parallel learning, refer to [parallel learning guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide).
### For better accuracy
* Use large ```max_bin``` (may slower)
* Use small ```learning_rate``` with large ```num_iterations```
* Use large ```num_leave```(may over-fitting)
* Use bigger training data
* Try ```dart```
### Deal with over-fitting
* Use small ```max_bin```
* Use small ```num_leaves```
* Use ```min_data_in_leaf``` and ```min_sum_hessian_in_leaf```
* Use bagging by set ```bagging_fraction``` and ```bagging_freq```
* Use feature sub-sampling by set ```feature_fraction```
* Use bigger training data
* Try ```lambda_l1```, ```lambda_l2``` and ```min_gain_to_split``` to regularization
* Try ```max_depth``` to avoid growing deep tree
## Others
### Continued training with input score
LightGBM support continued train with initial score. It uses an additional file to store these initial score, like the following:
```
0.5
-0.1
0.9
...
```
It means the initial score of first data is ```0.5```, second is ```-0.1```, and so on. The initial score file corresponds with data file line by line, and has per score per line. And if the name of data file is "train.txt", the initial score file should be named as "train.txt.init" and in the same folder as the data file. And LightGBM will auto load initial score file if it exists.
### Weight data
LightGBM support weighted training. It uses an additional file to store weight data, like the following:
```
1.0
0.5
0.8
...
```
It means the weight of first data is ```1.0```, second is ```0.5```, and so on. The weight file corresponds with data file line by line, and has per weight per line. And if the name of data file is "train.txt", the weight file should be named as "train.txt.weight" and in the same folder as the data file. And LightGBM will auto load weight file if it exists.
update:
You can specific weight column in data file now. Please refer to parameter ```weight``` in above.
### Query data
For LambdaRank learning, it needs query information for training data. LightGBM use an additional file to store query data. Following is an example:
```
27
18
67
...
```
It means first ```27``` lines samples belong one query and next ```18``` lines belong to another, and so on.(**Note: data should order by query**) If name of data file is "train.txt", the query file should be named as "train.txt.query" and in same folder of training data. LightGBM will load the query file automatically if it exists.
update:
You can specific query/group id in data file now. Please refer to parameter ```group``` in above.