Commit 6d34fb86 authored by Nikita Titov's avatar Nikita Titov Committed by Guolin Ke
Browse files

[docs] move wiki to Read the Docs (#945)

* fixed Python-API references

* moved Features section to ReadTheDocs

* fixed index of ReadTheDocs

* moved Experiments section to ReadTheDocs

* fixed capital letter

* fixed citing

* moved Parallel Learning section to ReadTheDocs

* fixed markdown

* fixed Python-API

* fixed link to Quick-Start

* fixed gpu docker README

* moved Installation Guide from wiki to ReadTheDocs

* removed references to wiki

* fixed capital letters in headings

* hotfixes

* fixed non-Unicode symbols and reference to Python API

* fixed citing references

* fixed links in .md files

* fixed links in .rst files

* store images locally in the repo

* fixed missed word

* fixed indent in Experiments.rst

* fixed 'Duplicate implicit target name' message which is successfully
resolved by adding anchors

* less verbose

* prevented maito: ref creation

* fixed indents

* fixed 404

* fixed 403

* fixed 301

* fixed fake anchors

* fixed file extentions

* fixed Sphinx warnings

* added StrikerRUS profile link to FAQ

* added henry0312 profile link to FAQ
parent 4d15e4ff
......@@ -4,11 +4,11 @@ This is a page contains all parameters in LightGBM.
***List of other Helpful Links***
* [Parameters](./Parameters.md)
* [Python API Reference](./Python-API.md)
* [Python API](./Python-API.rst)
## Tune parameters for the leaf-wise(best-first) tree
## Tune Parameters for the Leaf-wise (Best-first) Tree
LightGBM uses the [leaf-wise](https://github.com/Microsoft/LightGBM/wiki/Features#optimization-in-accuracy) tree growth algorithm, while many other popular tools use depth-wise tree growth. Compared with depth-wise growth, the leaf-wise algorithm can convenge much faster. However, the leaf-wise growth may be over-fitting if not used with the appropriate parameters.
LightGBM uses the [leaf-wise](./Features.md) tree growth algorithm, while many other popular tools use depth-wise tree growth. Compared with depth-wise growth, the leaf-wise algorithm can convenge much faster. However, the leaf-wise growth may be over-fitting if not used with the appropriate parameters.
To get good results using a leaf-wise tree, these are some important parameters:
......@@ -19,15 +19,15 @@ To get good results using a leaf-wise tree, these are some important parameters:
3. ```max_depth```. You also can use ```max_depth``` to limit the tree depth explicitly.
## For faster speed
## For Faster Speed
* Use bagging by setting ```bagging_fraction``` and ```bagging_freq```
* Use feature sub-sampling by setting ```feature_fraction```
* Use small ```max_bin```
* Use ```save_binary``` to speed up data loading in future learning
* Use parallel learning, refer to [parallel learning guide](./Parallel-Learning-Guide.md).
* Use parallel learning, refer to [Parallel Learning Guide](./Parallel-Learning-Guide.rst).
## For better accuracy
## For Better Accuracy
* Use large ```max_bin``` (may be slower)
* Use small ```learning_rate``` with large ```num_iterations```
......@@ -35,7 +35,7 @@ To get good results using a leaf-wise tree, these are some important parameters:
* Use bigger training data
* Try ```dart```
## Deal with over-fitting
## Deal with Over-fitting
* Use small ```max_bin```
* Use small ```num_leaves```
......
......@@ -3,7 +3,7 @@
This is a page contains all parameters in LightGBM.
***List of other Helpful Links***
* [Python API Reference](./Python-API.md)
* [Python API](./Python-API.rst)
* [Parameters Tuning](./Parameters-tuning.md)
***External Links***
......@@ -18,7 +18,7 @@ Default values for the following parameters have changed:
* num_leaves = 127 => 31
* num_iterations = 10 => 100
## Parameter format
## Parameter Format
The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be set both in config file and command line. By using command line, parameters should not have spaces before and after `=`. By using config files, one line can only contain one parameter. you can use `#` to comment. If one parameter appears in both command line and config file, LightGBM will use the parameter in command line.
......@@ -65,7 +65,7 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* `serial`, single machine tree learner
* `feature`, feature parallel tree learner
* `data`, data parallel tree learner
* Refer to [Parallel Learning Guide](./Parallel-Learning-Guide.md) to get more details.
* Refer to [Parallel Learning Guide](./Parallel-Learning-Guide.rst) to get more details.
* `num_threads`, default=OpenMP_default, type=int, alias=`num_thread`,`nthread`
* Number of threads for LightGBM.
* For the best speed, set this to the number of **real CPU cores**, not the number of threads (most CPU using [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading) to generate 2 threads per CPU core).
......@@ -74,10 +74,11 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* For parallel learning, should not use full CPU cores since this will cause poor performance for the network.
* `device`, default=`cpu`, options=`cpu`,`gpu`
* Choose device for the tree learning, can use gpu to achieve the faster learning.
* Note: 1. Recommend use the smaller `max_bin`(e.g `63`) to get the better speed up. 2. For the faster speed, GPU use 32-bit float point to sum up by default, may affect the accuracy for some tasks. You can set `gpu_use_dp=true` to enable 64-bit float point, but it will slow down the training. 3. Refer to [Installation Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide#with-gpu-support) to build with GPU .
* Note: 1. Recommend use the smaller `max_bin`(e.g `63`) to get the better speed up. 2. For the faster speed, GPU use 32-bit float point to sum up by default, may affect the accuracy for some tasks. You can set `gpu_use_dp=true` to enable 64-bit float point, but it will slow down the training. 3. Refer to [Installation Guide](./Installation-Guide.rst) to build with GPU .
## Learning control parameters
## Learning Control Parameters
* `max_depth`, default=`-1`, type=int
* Limit the max depth for tree model. This is used to deal with overfit when #data is small. Tree still grow by leaf-wise.
* `< 0` means no limit
......@@ -142,7 +143,7 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* The smooth numerator is `b = a * sum_gradient / sum_hessian`.
## IO parameters
## IO Parameters
* `max_bin`, default=`255`, type=int
* max number of bin that feature values will bucket in. Small bin may reduce training accuracy but may increase general power (deal with over-fit).
......@@ -231,7 +232,8 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* Path of validation initial score file, `""` will use `valid_data_file+".init"` (if exists).
* separate by `,` for multi-validation data
## Objective parameters
## Objective Parameters
* `sigmoid`, default=`1.0`, type=double
* parameter for sigmoid function. Will be used in binary classification and lambdarank.
......@@ -257,7 +259,8 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* `num_class`, default=`1`, type=int, alias=`num_classes`
* only used in multi-class classification
## Metric parameters
## Metric Parameters
* `metric`, default={`l2` for regression}, {`binary_logloss` for binary classification},{`ndcg` for lambdarank}, type=multi-enum, options=`l1`,`l2`,`ndcg`,`auc`,`binary_logloss`,`binary_error`...
* `l1`, absolute loss, alias=`mean_absolute_error`, `mae`
......@@ -281,7 +284,8 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
* `ndcg_at`, default=`1,2,3,4,5`, type=multi-int, alias=`ndcg_eval_at`,`eval_at`
* NDCG evaluation position, separate by `,`
## Network parameters
## Network Parameters
Following parameters are used for parallel learning, and only used for base(socket) version.
......@@ -297,7 +301,8 @@ Following parameters are used for parallel learning, and only used for base(sock
* File that list machines for this parallel learning application
* Each line contains one IP and one port for one machine. The format is `ip port`, separate by space.
## GPU parameters
## GPU Parameters
* `gpu_platform_id`, default=`-1`, type=int
* OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform.
......@@ -308,7 +313,8 @@ Following parameters are used for parallel learning, and only used for base(sock
* `gpu_use_dp`, default=`false`, type=bool
* Set to true to use double precision math on GPU (default using single precision).
## Convert model parameters
## Convert Model Parameters
This feature is only supported in command line version yet.
......@@ -321,7 +327,8 @@ This feature is only supported in command line version yet.
## Others
### Continued training with input score
### Continued Training with Input Score
LightGBM support continued train with initial score. It uses an additional file to store these initial score, like the following:
```
......@@ -334,7 +341,8 @@ LightGBM support continued train with initial score. It uses an additional file
It means the initial score of first data is `0.5`, second is `-0.1`, and so on. The initial score file corresponds with data file line by line, and has per score per line. And if the name of data file is "train.txt", the initial score file should be named as "train.txt.init" and in the same folder as the data file. And LightGBM will auto load initial score file if it exists.
### Weight data
### Weight Data
LightGBM support weighted training. It uses an additional file to store weight data, like the following:
```
......@@ -349,7 +357,8 @@ It means the weight of first data is `1.0`, second is `0.5`, and so on. The weig
update:
You can specific weight column in data file now. Please refer to parameter `weight` in above.
### Query data
### Query Data
For LambdaRank learning, it needs query information for training data. LightGBM use an additional file to store query data. Following is an example:
......
# Python API Reference
Move to [Read The Docs](http://lightgbm.readthedocs.io/en/latest/python/lightgbm.html#lightgbm-package).
\ No newline at end of file
lightgbm package
================
Python API
==========
Data Structure API
------------------
......@@ -64,4 +63,3 @@ Plotting
.. autofunction:: lightgbm.plot_tree
.. autofunction:: lightgbm.create_tree_digraph
Python Package Introduction
===========================
This document gives a basic walkthrough of LightGBM python package.
This document gives a basic walkthrough of LightGBM Python-package.
***List of other Helpful Links***
* [Python Examples](https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide)
* [Python API Reference](./Python-API.md)
* [Python API](./Python-API.rst)
* [Parameters Tuning](./Parameters-tuning.md)
Install
-------
* Install the library first, follow the wiki [here](./Installation-Guide.md).
* Install python-package dependencies, `setuptools`, `numpy` and `scipy` is required, `scikit-learn` is required for sklearn interface and recommended. Run:
```
pip install setuptools numpy scipy scikit-learn -U
```
* In the `python-package` directory, run
Install Python-package dependencies, `setuptools`, `wheel`, `numpy` and `scipy` are required, `scikit-learn` is required for sklearn interface and recommended:
```
python setup.py install
pip install setuptools wheel numpy scipy scikit-learn -U
```
* To verify your installation, try to `import lightgbm` in Python.
Refer to [Python-package](https://github.com/Microsoft/LightGBM/tree/master/python-package) folder for the installation guide.
To verify your installation, try to `import lightgbm` in Python:
```
import lightgbm as lgb
```
Data Interface
--------------
The LightGBM python module is able to load data from:
The LightGBM Python module is able to load data from:
- libsvm/tsv/csv txt format file
- Numpy 2D array, pandas object
- LightGBM binary file
......@@ -36,27 +37,35 @@ The LightGBM python module is able to load data from:
The data is stored in a ```Dataset``` object.
#### To load a libsvm text file or a LightGBM binary file into ```Dataset```:
```python
train_data = lgb.Dataset('train.svm.bin')
```
#### To load a numpy array into ```Dataset```:
```python
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target
train_data = lgb.Dataset(data, label=label)
```
#### To load a scpiy.sparse.csr_matrix array into ```Dataset```:
```python
csr = scipy.sparse.csr_matrix((dat, (row, col)))
train_data = lgb.Dataset(csr)
```
#### Saving ```Dataset``` into a LightGBM binary file will make loading faster:
```python
train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary('train.bin')
```
#### Create validation data
#### Create validation data:
```python
test_data = train_data.create_valid('test.svm')
```
......@@ -69,20 +78,25 @@ test_data = lgb.Dataset('test.svm', reference=train_data)
In LightGBM, the validation data should be aligned with training data.
#### Specific feature names and categorical features
#### Specific feature names and categorical features:
```python
train_data = lgb.Dataset(data, label=label, feature_name=['c1', 'c2', 'c3'], categorical_feature=['c3'])
```
LightGBM can use categorical features as input directly. It doesn't need to covert to one-hot coding, and is much faster than one-hot coding (about 8x speed-up).
**Note:You should convert your categorical features to int type before you construct `Dataset`.**
LightGBM can use categorical features as input directly. It doesn't need to covert to one-hot coding, and is much faster than one-hot coding (about 8x speed-up).
**Note**:You should convert your categorical features to int type before you construct `Dataset`.
#### Weights can be set when needed:
```python
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)
```
or
```python
train_data = lgb.Dataset(data, label=label)
w = np.random.rand(500, )
......@@ -102,43 +116,56 @@ However, Numpy/Array/Pandas object is memory cost. If you concern about your mem
Setting Parameters
------------------
LightGBM can use either a list of pairs or a dictionary to set [parameters](./Parameters.md). For instance:
* Booster parameters
LightGBM can use either a list of pairs or a dictionary to set [Parameters](./Parameters.md). For instance:
* Booster parameters:
```python
param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc'
```
* You can also specify multiple eval metrics:
```python
param['metric'] = ['auc', 'binary_logloss']
```
Training
--------
Training a model requires a parameter list and data set.
```python
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])
```
After training, the model can be saved.
```python
bst.save_model('model.txt')
```
The trained model can also be dumped to JSON format
The trained model can also be dumped to JSON format.
```python
# dump model
json_model = bst.dump_model()
```
A saved model can be loaded as follows:
A saved model can be loaded.
```python
bst = lgb.Booster(model_file='model.txt') #init model
```
CV
--
Training with 5-fold CV:
```python
num_round = 10
lgb.cv(param, train_data, num_round, nfold=5)
......@@ -146,6 +173,7 @@ lgb.cv(param, train_data, num_round, nfold=5)
Early Stopping
--------------
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
Early stopping requires at least one set in `valid_sets`. If there's more than one, it will use all of them.
......@@ -162,7 +190,9 @@ This works with both metrics to minimize (L2, log loss, etc.) and to maximize (N
Prediction
----------
A model that has been trained or loaded can perform predictions on data sets.
```python
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
......@@ -170,6 +200,7 @@ ypred = bst.predict(data)
```
If early stopping is enabled during training, you can get predictions from the best iteration with `bst.best_iteration`:
```python
ypred = bst.predict(data, num_iteration=bst.best_iteration)
```
......@@ -2,21 +2,21 @@
This is a quick start guide for LightGBM of cli version.
Follow the [Installation Guide](./Installation-Guide.md) to install LightGBM first.
Follow the [Installation Guide](./Installation-Guide.rst) to install LightGBM first.
***List of other Helpful Links***
* [Parameters](./Parameters.md)
* [Parameters Tuning](./Parameters-tuning.md)
* [Python Package quick start guide](./Python-intro.md)
* [Python API Reference](./Python-API.md)
* [Python-package Quick Start](./Python-intro.md)
* [Python API](./Python-API.rst)
## Training data format
## Training Data Format
LightGBM supports input data file with [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), [TSV](https://en.wikipedia.org/wiki/Tab-separated_values) and [LibSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) formats.
Label is the data of first column, and there is no header in the file.
### Categorical feature support
### Categorical Feature Support
update 12/5/2016:
......@@ -24,7 +24,8 @@ LightGBM can use categorical feature directly (without one-hot coding). The expe
For the setting details, please refer to [Parameters](./Parameters.md).
### Weight and query/group data
### Weight and Query/Group Data
LightGBM also support weighted training, it needs an additional [weight data](./Parameters.md). And it needs an additional [query data](./Parameters.md) for ranking task.
update 11/3/2016:
......@@ -33,9 +34,7 @@ update 11/3/2016:
2. can specific label column, weight column and query/group id column. Both index and column are supported
3. can specific a list of ignored columns
For the detailed usage, please refer to [Configuration](./Parameters.md).
## Parameter quick look
## Parameter Quick Look
The parameter format is ```key1=value1 key2=value2 ... ``` . And parameters can be in both config file and command line.
......@@ -78,7 +77,7 @@ Some important parameters:
* ```serial```, single machine tree learner
* ```feature```, feature parallel tree learner
* ```data```, data parallel tree learner
* Refer to [Parallel Learning Guide](./Parallel-Learning-Guide.md) to get more details.
* Refer to [Parallel Learning Guide](./Parallel-Learning-Guide.rst) to get more details.
* ```num_threads```, default=OpenMP_default, type=int, alias=```num_thread```,```nthread```
* Number of threads for LightGBM.
* For the best speed, set this to the number of **real CPU cores**, not the number of threads (most CPU using [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading) to generate 2 threads per CPU core).
......@@ -93,7 +92,6 @@ Some important parameters:
For all parameters, please refer to [Parameters](./Parameters.md).
## Run LightGBM
For Windows:
......@@ -101,7 +99,7 @@ For Windows:
lightgbm.exe config=your_config_file other_args ...
```
For unix:
For Unix:
```
./lightgbm config=your_config_file other_args ...
```
......
......@@ -2,17 +2,13 @@
Documentation for LightGBM is generated using [Sphinx](http://www.sphinx-doc.org/) and [recommonmark](https://recommonmark.readthedocs.io/).
After each commit on `master`, documentation is updated and published to https://lightgbm.readthedocs.io/
After each commit on `master`, documentation is updated and published to [https://lightgbm.readthedocs.io/](https://lightgbm.readthedocs.io/).
## Build
You can build the documentation locally. Just run:
You can build the documentation locally. Just run in `docs` folder:
```sh
pip install -r requirements.txt
make html
```
## Wiki
In addition to our documentation hosted on Read the Docs, some additional topics are explained at https://github.com/Microsoft/LightGBM/wiki.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment