[docs] updates and improvements for documentation (#940)

* added python version badge * fixed typos * fixed links * readthedocs doesn't support links with anchor out of box * fixed table rendering at ReadTheDocs #776#issuecomment-319851551 * fixed table rendering at ReadTheDocs * added link to Key-Events page * fixed links * hotfix * fixed markdown

[docs] updates and improvements for documentation (#940)
* added python version badge * fixed typos * fixed links * readthedocs doesn't support links with anchor out of box * fixed table rendering at ReadTheDocs #776#issuecomment-319851551 * fixed table rendering at ReadTheDocs * added link to Key-Events page * fixed links * hotfix * fixed markdown
d292512e · Nikita Titov · Guolin Ke · 8aef4bf7 · d292512e · d292512e
Commit d292512e authored Sep 29, 2017 by Nikita Titov Committed by Guolin Ke Sep 29, 2017
16 changed files
--- a/README.md
+++ b/README.md
@@ -7,8 +7,8 @@ LightGBM, Light Gradient Boosting Machine
 [![Documentation Status](https://readthedocs.org/projects/lightgbm/badge/?version=latest)](https://lightgbm.readthedocs.io/)
 [![GitHub Issues](https://img.shields.io/github/issues/Microsoft/LightGBM.svg)](https://github.com/Microsoft/LightGBM/issues)
 [![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/Microsoft/LightGBM/blob/master/LICENSE)
+[![Python Versions](https://img.shields.io/pypi/pyversions/lightgbm.svg)](https://pypi.python.org/pypi/lightgbm)
 [![PyPI Version](https://badge.fury.io/py/lightgbm.svg)](https://badge.fury.io/py/lightgbm)
-<!--- # Uncomment after updating PyPI [![Python Versions](https://img.shields.io/pypi/pyversions/lightgbm.svg)](https://pypi.python.org/pypi/lightgbm) -->
 LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
@@ -35,7 +35,7 @@ News
 05/03/2017 : LightGBM v2 stable release.
-04/10/2017 : LightGBM supports GPU-accelerated tree learning now. Please read our [GPU Tutorial](./docs/GPU-Tutorial.md) and [Performance Comparison](./docs/GPU-Performance.md).
+04/10/2017 : LightGBM supports GPU-accelerated tree learning now. Please read our [GPU Tutorial](./docs/GPU-Tutorial.md) and [Performance Comparison](./docs/GPU-Performance.rst).
 02/20/2017 : Update to LightGBM v2.
@@ -47,6 +47,8 @@ News
 12/02/2016 : Release [**python-package**](https://github.com/Microsoft/LightGBM/tree/master/python-package) beta version, welcome to have a try and provide feedback.
+More detailed update logs : [Key Events](https://github.com/Microsoft/LightGBM/blob/master/docs/Key-Events.md).
 External (unofficial) Repositories
 ----------------------------------

--- a/docs/Advanced-Topic.md
+++ b/docs/Advanced-Topic.md
 # Advanced Topics
-## Missing value handle
+## Missing Value Handle
 * LightGBM enables the missing value handle by default, you can disable it by set ```use_missing=false```.
 * LightGBM uses NA (NAN) to represent the missing value by default, you can change it to use zero by set ```zero_as_missing=true```.
-* When ```zero_as_missing=false``` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros. 
+* When ```zero_as_missing=false``` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros.
-* When ```zero_as_missing=true```, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing. 
+* When ```zero_as_missing=true```, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing.
-## Categorical feature support
+## Categorical Feature Support
-* LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such a optimal split can provide the much better accuracy than one-hot coding solution. 
+* LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such an optimal split can provide the much better accuracy than one-hot coding solution.
-* Use `categorical_feature` to specific the categorical features. Refer to the parameter `categorical_feature` in [Parameters](./Parameters.md).
+* Use `categorical_feature` to specify the categorical features. Refer to the parameter `categorical_feature` in [Parameters](./Parameters.md).
-* Need to convert to `int` type first, and only support non-negative numbers. It is better to convert into continues ranges.
+* Converting to `int` type is needed first, and there is support for non-negative numbers only. It is better to convert into continues ranges.
 * Use `max_cat_group`, `cat_smooth_ratio` to deal with over-fitting (when #data is small or #category is large).
-* For categocal features with high cardinality (#categoriy is large), it is better to convert it to numerical features. 
+* For categorical features with high cardinality (#category is large), it is better to convert it to numerical features.
-## LambdaRank 
+## LambdaRank
-* The label should be `int` type, and larger number represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
+* The label should be `int` type, and larger numbers represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
 * Use `label_gain` to set the gain(weight) of `int` label.
 * Use `max_position` to set the NDCG optimization position.
 ## Parameters Tuning
-* Refer to [Parameters tuning](./Parameters-tuning.md).
+* Refer to [Parameters Tuning](./Parameters-tuning.md).
-## GPU support
+## GPU Support
-* Refer to [GPU Tutorial](./GPU-Tutorial.md) and [GPU Targets](./GPU-Targets.md).
+* Refer to [GPU Tutorial](./GPU-Tutorial.md) and [GPU Targets](./GPU-Targets.rst).
-## Parallel Learning 
+## Parallel Learning
-* Refer to https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide
+* Refer to [Parallel Learning Guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide).
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
 LightGBM FAQ
-=======================
+============
 ### Catalog
- [Critical](FAQ.md#Critical)
+- [Critical](#critical)
- [LightGBM](FAQ.md#LightGBM)
+- [LightGBM](#lightgbm)
- [R-package](FAQ.md#R-package)
+- [R-package](#r-package)
- [Python-package](FAQ.md#python-package)
+- [Python-package](#python-package)
 ---
@@ -14,7 +14,7 @@ LightGBM FAQ
 You encountered a critical issue when using LightGBM (crash, prediction error, non sense outputs...). Who should you contact?
-If your issue is not critical, just post an issue [Microsoft/LightGBM repository](https://github.com/Microsoft/LightGBM/issues).
+If your issue is not critical, just post an issue in [Microsoft/LightGBM repository](https://github.com/Microsoft/LightGBM/issues).
 If it is a critical issue, identify first what error you have:
@@ -40,7 +40,7 @@ Remember this is a free/open community support. We may not be available 24/7 to
 - **Question 1**: Where do I find more details about LightGBM parameters?
- **Solution 1**: Look at [Parameters.md](Parameters.md) and [Laurae++/Parameters](https://sites.google.com/view/lauraepp/parameters) website
+- **Solution 1**: Look at [Parameters](./Parameters.md) and [Laurae++/Parameters](https://sites.google.com/view/lauraepp/parameters) website.
 ---
@@ -52,7 +52,7 @@ Remember this is a free/open community support. We may not be available 24/7 to
 - **Question 3**: When running LightGBM on a large dataset, my computer runs out of RAM.
- **Solution 3**: Multiple solutions: set `histogram_pool_size` parameter to the MB you want to use for LightGBM (histogram_pool_size + dataset size = approximately RAM used), lower `num_leaves` or lower `max_bin` (see [issue #562](https://github.com/Microsoft/LightGBM/issues/562)).
+- **Solution 3**: Multiple solutions: set `histogram_pool_size` parameter to the MB you want to use for LightGBM (histogram_pool_size + dataset size = approximately RAM used), lower `num_leaves` or lower `max_bin` (see [Microsoft/LightGBM#562](https://github.com/Microsoft/LightGBM/issues/562)).
 ---
@@ -64,7 +64,7 @@ Remember this is a free/open community support. We may not be available 24/7 to
 - **Question 5**: When using LightGBM GPU, I cannot reproduce results over several runs.
- **Solution 5**: It is a normal issue, there is nothing we/you can do about, you may try to use `gpu_use_dp = true` for reproducibility (see [issue #560](https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654)). You may also use CPU version.
+- **Solution 5**: It is a normal issue, there is nothing we/you can do about, you may try to use `gpu_use_dp = true` for reproducibility (see [Microsoft/LightGBM#560](https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654)). You may also use CPU version.
 ---
@@ -115,7 +115,18 @@ Remember this is a free/open community support. We may not be available 24/7 to
 ---
- **Question 2**: I see error messages like `Cannot get/set label/weight/init_score/group/num_data/num_feature before construct dataset`, but I already construct dataset by some code like `train = lightgbm.Dataset(X_train, y_train)`, or error messages like `Cannot set predictor/reference/categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.`.
+- **Question 2**: I see error messages like 
+    ```
+    Cannot get/set label/weight/init_score/group/num_data/num_feature before construct dataset
+    ```
+    but I already construct dataset by some code like
+    ```
+    train = lightgbm.Dataset(X_train, y_train)
+    ```
+    or error messages like
+    ```
+    Cannot set predictor/reference/categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
+    ```
 - **Solution 2**: Because LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers, categorical features and feature names etc., the Dataset objects are constructed when construct a Booster. And if you set `free_raw_data=True` (default), the raw data (with python data struct) will be freed. So, if you want to:

--- a/docs/GPU-Performance.md
+++ b/docs/GPU-Performance.md
-GPU Tuning Guide and Performance Comparison
-============================================
-How it works?
--------------------------
-In LightGBM, the main computation cost during training is building the feature
-histograms.  We use an efficient algorithm on GPU to accelerate this process.
-The implementation is highly modular, and works for all learning tasks
-(classification, ranking, regression, etc).  GPU acceleration also works in
-distributed learning settings.  GPU algorithm implementation is based on OpenCL
-and can work with a wide range of GPUs.
-Supported Hardware
--------------------------
-We target AMD Graphics Core Next (GCN) architecture and NVIDIA
-Maxwell and Pascal architectures. Most AMD GPUs released after 2012 and NVIDIA
-GPUs released after 2014 should be supported. We have tested the GPU
-implementation on the following GPUs:
- AMD RX 480 with AMDGPU-pro driver 16.60 on Ubuntu 16.10
- AMD R9 280X (aka Radeon HD 7970) with fglrx driver 15.302.2301 on Ubuntu 16.10
- NVIDIA GTX 1080 with driver 375.39 and CUDA 8.0 on Ubuntu 16.10 
- NVIDIA Titan X (Pascal) with driver 367.48 and CUDA 8.0 on Ubuntu 16.04
- NVIDIA Tesla M40 with driver 375.39 and CUDA 7.5 on Ubuntu 16.04
-Using the following hardware is discouraged:
- NVIDIA Kepler (K80, K40, K20, most GeForce GTX 700 series GPUs) or earlier
-  NVIDIA GPUs. They don't support hardware atomic operations in local memory space
-  and thus histogram construction will be slow.
- AMD VLIW4-based GPUs, including Radeon HD 6xxx series and earlier GPUs. These
-  GPUs have been discontinued for years and are rarely seen nowadays.
-How to Achieve Good Speedup on GPU
----------------------------------
-1. You want to run a few datasets that we have verified with good speedup
-   (including Higgs, epsilon, Bosch, etc) to ensure your
-   setup is correct. If you have multiple GPUs, make sure to set 
-   `gpu_platform_id` and `gpu_device_id` to use the desired GPU.
-   Also make sure your system is idle (especially when using a
-   shared computer) to get accuracy performance measurements. 
-2. GPU works best on large scale and dense datasets. If dataset is too small,
-   computing it on GPU is inefficient as the data transfer overhead can be
-   significant.  For dataset with a mixture of sparse and dense features, you
-   can control the `sparse_threshold` parameter to make sure there are enough
-   dense features to process on the GPU. If you have categorical features, use
-   the `categorical_column` option and input them into LightGBM directly; do
-   not convert them into one-hot variables. Make sure to check the run log and
-   look at the reported number of sparse and dense features.
-3. To get good speedup with GPU, it is suggested to use a smaller number of
-   bins.  Setting `max_bin=63` is recommended, as it usually does not
-   noticeably affect training accuracy on large datasets, but GPU training can
-   be significantly faster than using the default bin size of 255.  For some
-   dataset, even using 15 bins is enough (`max_bin=15`); using 15 bins will
-   maximize GPU performance. Make sure to check the run log and verify that the
-   desired number of bins is used.
-4. Try to use single precision training (`gpu_use_dp=false`) when possible,
-   because most GPUs (especially NVIDIA consumer GPUs) have poor
-   double-precision performance.
-Performance Comparison
--------------------------
-We evaluate the training performance of GPU acceleration on the following datasets:
-| Data     |      Task     |  Link | #Examples | #Feature| Comments|
-|----------|---------------|-------|-------|---------|---------|
-| Higgs    |  Binary classification | [link](https://archive.ics.uci.edu/ml/datasets/HIGGS) |10,500,000|28| use last 500,000 samples as test set  | 
-| Epsilon  |  Binary classification | [link](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) | 400,000 | 2,000 | use the provided test set |
-| Bosch    |  Binary classification | [link](https://www.kaggle.com/c/bosch-production-line-performance/data) | 1,000,000 | 968 | use the provided test set |
-| Yahoo LTR|  Learning to rank      | [link](https://webscope.sandbox.yahoo.com/catalog.php?datatype=c)     |473,134|700|   set1.train as train, set1.test as test |
-| MS LTR   |  Learning to rank      | [link](http://research.microsoft.com/en-us/projects/mslr/) |2,270,296|137| {S1,S2,S3} as train set, {S5} as test set |
-| Expo     |  Binary classification (Categorical) | [link](http://stat-computing.org/dataexpo/2009/) |11,000,000|700| use last 1,000,000 as test set |
-We used the following hardware to evaluate the performance of LightGBM GPU training.
-Our CPU reference is **a high-end dual socket Haswell-EP Xeon server with 28 cores**;
-GPUs include a budget GPU (RX 480) and a mainstream (GTX 1080) GPU installed on
-the same server.  It is worth mentioning that **the GPUs used are not the best GPUs in
-the market**; if you are using a better GPU (like AMD RX 580, NVIDIA GTX 1080 Ti,
-Titan X Pascal, Titan Xp, Tesla P100, etc), you are likely to get a better speedup.
-| Hardware                     | Peak FLOPS   | Peak Memory BW | Cost (MSRP) |
-|------------------------------|--------------|----------------|-------------|
-| AMD Radeon RX 480            | 5,161 GFLOPS | 256 GB/s       | $199        |
-| NVIDIA GTX 1080              | 8,228 GFLOPS | 320 GB/s       | $499        |
-| 2x Xeon E5-2683v3 (28 cores) | 1,792 GFLOPS | 133 GB/s       | $3,692      |
-During benchmarking on CPU we used only 28 physical cores of the CPU, and did
-not use hyper-threading cores, because we found that using too many threads
-actually makes performance worse. The following shows the training configuration we used:
-```
-max_bin = 63
-num_leaves = 255
-num_iterations = 500
-learning_rate = 0.1
-tree_learner = serial
-task = train
-is_training_metric = false
-min_data_in_leaf = 1
-min_sum_hessian_in_leaf = 100
-ndcg_eval_at = 1,3,5,10
-sparse_threshold=1.0
-device = gpu
-gpu_platform_id = 0
-gpu_device_id = 0
-num_thread = 28
-```
-We use the configuration shown above, except for the 
-Bosch dataset, we use a smaller `learning_rate=0.015` and set
-`min_sum_hessian_in_leaf=5`. For all GPU training we set
-`sparse_threshold=1`, and vary the max number of bins (255, 63 and 15).  The
-GPU implementation is from commit
-[0bb4a82](https://github.com/Microsoft/LightGBM/commit/0bb4a82)
-of LightGBM, when the GPU support was just merged in.
-The following table lists the accuracy on test set that CPU and GPU learner
-can achieve after 500 iterations.  GPU with the same number of bins can achieve
-a similar level of accuracy as on the CPU, despite using single precision
-arithmetic.  For most datasets, using 63 bins is sufficient.
-|                   | CPU 255 bins | CPU 63 bins | CPU 15 bins | GPU 255 bins | GPU 63 bins | GPU 15 bins |
-|-------------------|--------------|-------------|-------------|--------------|-------------|-------------|
-| Higgs AUC         | 0.845612     | 0.845239    | 0.841066    | 0.845612     | 0.845209    | 0.840748    |
-| Epsilon AUC       | 0.950243     | 0.949952    | 0.948365    | 0.950057     | 0.949876    | 0.948365    |
-| Yahoo-LTR NDCG@1  | 0.730824     | 0.730165    | 0.729647    | 0.730936     | 0.732257    | 0.73114     |
-| Yahoo-LTR NDCG@3  | 0.738687     | 0.737243    | 0.736445    | 0.73698      | 0.739474    | 0.735868    |
-| Yahoo-LTR NDCG@5  | 0.756609     | 0.755729    | 0.754607    | 0.756206     | 0.757007    | 0.754203    |
-| Yahoo-LTR NDCG@10 | 0.79655      | 0.795827    | 0.795273    | 0.795894     | 0.797302    | 0.795584    |
-| Expo AUC          | 0.776217     | 0.771566    | 0.743329    | 0.776285     | 0.77098     | 0.744078    |
-| MS-LTR NDCG@1     | 0.521265     | 0.521392    | 0.518653    | 0.521789     | 0.522163    | 0.516388    |
-| MS-LTR NDCG@3     | 0.503153     | 0.505753    | 0.501697    | 0.503886     | 0.504089    | 0.501691    |
-| MS-LTR NDCG@5     | 0.509236     | 0.510391    | 0.507193    | 0.509861     | 0.510095    | 0.50663     |
-| MS-LTR NDCG@10    | 0.527835     | 0.527304    | 0.524603    | 0.528009     | 0.527059    | 0.524722    |
-| Bosch AUC         | 0.718115     | 0.721791    | 0.716677    | 0.717184     | 0.724761    | 0.717005    |
-We record the wall clock time after 500 iterations, as shown in the figure below:
-![Performance Comparison](http://www.huan-zhang.com/images/upload/lightgbm-gpu/compare_0bb4a825.png)
-When using a GPU, it is advisable to use a bin size of 63 rather than 255,
-because it can speed up training significantly without noticeably affecting
-accuracy. On CPU, using a smaller bin size only marginally improves
-performance, sometimes even slows down training, like in Higgs (we can
-reproduce the same slowdown on two different machines, with different GCC
-versions).  We found that GPU can achieve impressive acceleration on large and
-dense datasets like Higgs and Epsilon.  Even on smaller and sparse datasets,
-a *budget* GPU can still compete and be faster than a 28-core Haswell server.
-Memory Usage
---------------
-The next table shows GPU memory usage reported by `nvidia-smi` during training
-with 63 bins.  We can see that even the largest dataset just uses about 1 GB of
-GPU memory, indicating that our GPU implementation can scale to huge
-datasets over 10x larger than Bosch or Epsilon.  Also, we can observe that
-generally a larger dataset (using more GPU memory, like Epsilon or Bosch)
-has better speedup, because the overhead of invoking GPU functions becomes
-significant when the dataset is small.
-| Datasets              | Higgs | Epsilon | Bosch  |  MS-LTR |  Expo |Yahoo-LTR |
-|-----------------------|-------|---------|--------|---------|-------|----------|
-| GPU Memory Usage (MB) | 611   | 901     |  1067  |   413   |  405  |  291     |
-Further Reading
--------------------------
-You can find more details about the GPU algorithm and benchmarks in the following article:
-Huan Zhang, Si Si and Cho-Jui Hsieh. [GPU Acceleration for Large-scale Tree Boosting](https://arxiv.org/abs/1706.08359). arXiv:1706.08359, 2017.
--- a/docs/GPU-Performance.rst
+++ b/docs/GPU-Performance.rst
+GPU Tuning Guide and Performance Comparison
+===========================================
+How it works?
+-------------
+In LightGBM, the main computation cost during training is building the feature histograms. We use an efficient algorithm on GPU to accelerate this process.
+The implementation is highly modular, and works for all learning tasks (classification, ranking, regression, etc). GPU acceleration also works in distributed learning settings.
+GPU algorithm implementation is based on OpenCL and can work with a wide range of GPUs.
+Supported Hardware
+------------------
+We target AMD Graphics Core Next (GCN) architecture and NVIDIA Maxwell and Pascal architectures.
+Most AMD GPUs released after 2012 and NVIDIA GPUs released after 2014 should be supported. We have tested the GPU implementation on the following GPUs:
+-  AMD RX 480 with AMDGPU-pro driver 16.60 on Ubuntu 16.10
+-  AMD R9 280X (aka Radeon HD 7970) with fglrx driver 15.302.2301 on
+   Ubuntu 16.10
+-  NVIDIA GTX 1080 with driver 375.39 and CUDA 8.0 on Ubuntu 16.10
+-  NVIDIA Titan X (Pascal) with driver 367.48 and CUDA 8.0 on Ubuntu
+   16.04
+-  NVIDIA Tesla M40 with driver 375.39 and CUDA 7.5 on Ubuntu 16.04
+Using the following hardware is discouraged:
+-  NVIDIA Kepler (K80, K40, K20, most GeForce GTX 700 series GPUs) or earlier NVIDIA GPUs. They don't support hardware atomic operations in local memory space and thus histogram construction will be slow.
+-  AMD VLIW4-based GPUs, including Radeon HD 6xxx series and earlier GPUs. These GPUs have been discontinued for years and are rarely seen nowadays.
+How to Achieve Good Speedup on GPU
+----------------------------------
+#.  You want to run a few datasets that we have verified with good speedup (including Higgs, epsilon, Bosch, etc) to ensure your setup is correct.
+    If you have multiple GPUs, make sure to set ``gpu_platform_id`` and ``gpu_device_id`` to use the desired GPU.
+    Also make sure your system is idle (especially when using a shared computer) to get accuracy performance measurements.
+#.  GPU works best on large scale and dense datasets. If dataset is too small, computing it on GPU is inefficient as the data transfer overhead can be significant.
+    For dataset with a mixture of sparse and dense features, you can control the ``sparse_threshold`` parameter to make sure there are enough dense features to process on the GPU.
+    If you have categorical features, use the ``categorical_column`` option and input them into LightGBM directly; do not convert them into one-hot variables.
+    Make sure to check the run log and look at the reported number of sparse and dense features.
+#.  To get good speedup with GPU, it is suggested to use a smaller number of bins.
+    Setting ``max_bin=63`` is recommended, as it usually does not noticeably affect training accuracy on large datasets, but GPU training can be significantly faster than using the default bin size of 255.
+    For some dataset, even using 15 bins is enough (``max_bin=15``); using 15 bins will maximize GPU performance. Make sure to check the run log and verify that the desired number of bins is used.
+#.  Try to use single precision training (``gpu_use_dp=false``) when possible, because most GPUs (especially NVIDIA consumer GPUs) have poor double-precision performance.
+Performance Comparison
+----------------------
+We evaluate the training performance of GPU acceleration on the following datasets:
+-----------+----------------+----------+------------+-----------+------------+
+| Data      | Task           | Link     | #Examples  | #Features | Comments   |
+===========+================+==========+============+===========+============+
+| Higgs     | Binary         | `link1`_ | 10,500,000 | 28        | use last   |
+|           | classification |          |            |           | 500,000    |
+|           |                |          |            |           | samples    |
+|           |                |          |            |           | as test    |
+|           |                |          |            |           | set        |
+-----------+----------------+----------+------------+-----------+------------+
+| Epsilon   | Binary         | `link2`_ | 400,000    | 2,000     | use the    |
+|           | classification |          |            |           | provided   |
+|           |                |          |            |           | test set   |
+-----------+----------------+----------+------------+-----------+------------+
+| Bosch     | Binary         | `link3`_ | 1,000,000  | 968       | use the    |
+|           | classification |          |            |           | provided   |
+|           |                |          |            |           | test set   |
+-----------+----------------+----------+------------+-----------+------------+
+| Yahoo LTR | Learning to    | `link4`_ | 473,134    | 700       | set1.train |
+|           | rank           |          |            |           | as train,  |
+|           |                |          |            |           | set1.test  |
+|           |                |          |            |           | as test    |
+-----------+----------------+----------+------------+-----------+------------+
+| MS LTR    | Learning to    | `link5`_ | 2,270,296  | 137       | {S1,S2,S3} |
+|           | rank           |          |            |           | as train   |
+|           |                |          |            |           | set, {S5}  |
+|           |                |          |            |           | as test    |
+|           |                |          |            |           | set        |
+-----------+----------------+----------+------------+-----------+------------+
+| Expo      | Binary         | `link6`_ | 11,000,000 | 700       | use last   |
+|           | classification |          |            |           | 1,000,000  |
+|           | (Categorical)  |          |            |           | as test    |
+|           |                |          |            |           | set        |
+-----------+----------------+----------+------------+-----------+------------+
+We used the following hardware to evaluate the performance of LightGBM GPU training.
+Our CPU reference is **a high-end dual socket Haswell-EP Xeon server with 28 cores**;
+GPUs include a budget GPU (RX 480) and a mainstream (GTX 1080) GPU installed on the same server.
+It is worth mentioning that **the GPUs used are not the best GPUs in the market**;
+if you are using a better GPU (like AMD RX 580, NVIDIA GTX 1080 Ti, Titan X Pascal, Titan Xp, Tesla P100, etc), you are likely to get a better speedup.
+--------------------------------+----------------+------------------+---------------+
+| Hardware                       | Peak FLOPS     | Peak Memory BW   | Cost (MSRP)   |
+================================+================+==================+===============+
+| AMD Radeon RX 480              | 5,161 GFLOPS   | 256 GB/s         | $199          |
+--------------------------------+----------------+------------------+---------------+
+| NVIDIA GTX 1080                | 8,228 GFLOPS   | 320 GB/s         | $499          |
+--------------------------------+----------------+------------------+---------------+
+| 2x Xeon E5-2683v3 (28 cores)   | 1,792 GFLOPS   | 133 GB/s         | $3,692        |
+--------------------------------+----------------+------------------+---------------+
+During benchmarking on CPU we used only 28 physical cores of the CPU, and did not use hyper-threading cores,
+because we found that using too many threads actually makes performance worse.
+The following shows the training configuration we used:
+::
+    max_bin = 63
+    num_leaves = 255
+    num_iterations = 500
+    learning_rate = 0.1
+    tree_learner = serial
+    task = train
+    is_training_metric = false
+    min_data_in_leaf = 1
+    min_sum_hessian_in_leaf = 100
+    ndcg_eval_at = 1,3,5,10
+    sparse_threshold=1.0
+    device = gpu
+    gpu_platform_id = 0
+    gpu_device_id = 0
+    num_thread = 28
+We use the configuration shown above, except for the Bosch dataset, we use a smaller ``learning_rate=0.015`` and set ``min_sum_hessian_in_leaf=5``.
+For all GPU training we set ``sparse_threshold=1``, and vary the max number of bins (255, 63 and 15).
+The GPU implementation is from commit `0bb4a82`_ of LightGBM, when the GPU support was just merged in.
+The following table lists the accuracy on test set that CPU and GPU learner can achieve after 500 iterations.
+GPU with the same number of bins can achieve a similar level of accuracy as on the CPU, despite using single precision arithmetic.
+For most datasets, using 63 bins is sufficient.
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+|                     | CPU 255 bins   | CPU 63 bins   | CPU 15 bins   | GPU 255 bins   | GPU 63 bins   | GPU 15 bins   |
+=====================+================+===============+===============+================+===============+===============+
+| Higgs AUC           | 0.845612       | 0.845239      | 0.841066      | 0.845612       | 0.845209      | 0.840748      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Epsilon AUC         | 0.950243       | 0.949952      | 0.948365      | 0.950057       | 0.949876      | 0.948365      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Yahoo-LTR NDCG@1    | 0.730824       | 0.730165      | 0.729647      | 0.730936       | 0.732257      | 0.73114       |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Yahoo-LTR NDCG@3    | 0.738687       | 0.737243      | 0.736445      | 0.73698        | 0.739474      | 0.735868      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Yahoo-LTR NDCG@5    | 0.756609       | 0.755729      | 0.754607      | 0.756206       | 0.757007      | 0.754203      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Yahoo-LTR NDCG@10   | 0.79655        | 0.795827      | 0.795273      | 0.795894       | 0.797302      | 0.795584      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Expo AUC            | 0.776217       | 0.771566      | 0.743329      | 0.776285       | 0.77098       | 0.744078      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| MS-LTR NDCG@1       | 0.521265       | 0.521392      | 0.518653      | 0.521789       | 0.522163      | 0.516388      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| MS-LTR NDCG@3       | 0.503153       | 0.505753      | 0.501697      | 0.503886       | 0.504089      | 0.501691      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| MS-LTR NDCG@5       | 0.509236       | 0.510391      | 0.507193      | 0.509861       | 0.510095      | 0.50663       |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| MS-LTR NDCG@10      | 0.527835       | 0.527304      | 0.524603      | 0.528009       | 0.527059      | 0.524722      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+| Bosch AUC           | 0.718115       | 0.721791      | 0.716677      | 0.717184       | 0.724761      | 0.717005      |
+---------------------+----------------+---------------+---------------+----------------+---------------+---------------+
+We record the wall clock time after 500 iterations, as shown in the figure below:
+|Performance Comparison|
+When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training significantly without noticeably affecting accuracy.
+On CPU, using a smaller bin size only marginally improves performance, sometimes even slows down training,
+like in Higgs (we can reproduce the same slowdown on two different machines, with different GCC versions).
+We found that GPU can achieve impressive acceleration on large and dense datasets like Higgs and Epsilon.
+Even on smaller and sparse datasets, a *budget* GPU can still compete and be faster than a 28-core Haswell server.
+Memory Usage
+------------
+The next table shows GPU memory usage reported by ``nvidia-smi`` during training with 63 bins.
+We can see that even the largest dataset just uses about 1 GB of GPU memory,
+indicating that our GPU implementation can scale to huge datasets over 10x larger than Bosch or Epsilon.
+Also, we can observe that generally a larger dataset (using more GPU memory, like Epsilon or Bosch) has better speedup,
+because the overhead of invoking GPU functions becomes significant when the dataset is small.
+-------------------------+---------+-----------+---------+----------+--------+-------------+
+| Datasets                | Higgs   | Epsilon   | Bosch   | MS-LTR   | Expo   | Yahoo-LTR   |
+=========================+=========+===========+=========+==========+========+=============+
+| GPU Memory Usage (MB)   | 611     | 901       | 1067    | 413      | 405    | 291         |
+-------------------------+---------+-----------+---------+----------+--------+-------------+
+Further Reading
+---------------
+You can find more details about the GPU algorithm and benchmarks in the
+following article:
+Huan Zhang, Si Si and Cho-Jui Hsieh. `GPU Acceleration for Large-scale Tree Boosting`_. arXiv:1706.08359, 2017.
+.. _link1: https://archive.ics.uci.edu/ml/datasets/HIGGS
+.. _link2: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
+.. _link3: https://www.kaggle.com/c/bosch-production-line-performance/data
+.. _link4: https://webscope.sandbox.yahoo.com/catalog.php?datatype=c
+.. _link5: http://research.microsoft.com/en-us/projects/mslr/
+.. _link6: http://stat-computing.org/dataexpo/2009/
+.. _0bb4a82: https://github.com/Microsoft/LightGBM/commit/0bb4a82
+.. |Performance Comparison| image:: http://www.huan-zhang.com/images/upload/lightgbm-gpu/compare_0bb4a825.png
+.. _GPU Acceleration for Large-scale Tree Boosting: https://arxiv.org/abs/1706.08359
--- a/docs/GPU-Targets.md
+++ b/docs/GPU-Targets.md
-GPU Targets Table
-==================================
-When using OpenCL SDKs, targeting CPU and GPU at the same time is sometimes possible. This is especially true for Intel OpenCL SDK and AMD APP SDK.
-You can find below a table of correspondence:
-| SDK | CPU Intel/AMD | GPU Intel | GPU AMD | GPU NVIDIA |
-| --- | :---: | :---: | :---: | :---: |
-| [Intel SDK for OpenCL](https://software.intel.com/en-us/articles/opencl-drivers) | Supported | Supported * | Supported | Untested |
-| [AMD APP SDK](http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/) | Supported | Untested * | Supported | Untested |
-| [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) | Untested ** | Untested ** | Untested ** | Supported |
-Legend:
- \* Not usable directly.
- \** Reported as unsupported in public forums.
-AMD GPUs using Intel SDK for OpenCL is not a typo, nor AMD APP SDK compatibility with CPUs.
---
-# Targeting Table
-We present the following scenarii:
-* CPU, no GPU
-* Single CPU and GPU (even with integrated graphics)
-* Multiple CPU/GPU
-We provide test R code below, but you can use the language of your choice with the examples of your choices:
-```r
-library(lightgbm)
-data(agaricus.train, package = "lightgbm")
-train <- agaricus.train
-train$data[, 1] <- 1:6513
-dtrain <- lgb.Dataset(train$data, label = train$label)
-data(agaricus.test, package = "lightgbm")
-test <- agaricus.test
-dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
-valids <- list(test = dtest)
-params <- list(objective = "regression",
-               metric = "rmse",
-               device = "gpu",
-               gpu_platform_id = 0,
-               gpu_device_id = 0,
-               nthread = 1,
-               boost_from_average = FALSE,
-               num_tree_per_iteration = 10,
-               max_bin = 32)
-model <- lgb.train(params,
-                   dtrain,
-                   2,
-                   valids,
-                   min_data = 1,
-                   learning_rate = 1,
-                   early_stopping_rounds = 10)
-```
-Using a bad `gpu_device_id` is not critical, as it will fallback to:
-* `gpu_device_id = 0` if using `gpu_platform_id = 0`
-* `gpu_device_id = 1` if using `gpu_platform_id = 1`
-However, using a bad combination of `gpu_platform_id` and `gpu_device_id` will lead to a **crash** (you will lose your entire session content). Beware of it.
-## CPU only architectures
-When you have a single device (one CPU), OpenCL usage is straightforward: `gpu_platform_id = 0`, `gpu_device_id = 0`
-This will use the CPU with OpenCL, even though it says it says GPU.
-Example:
-```r
-> params <- list(objective = "regression",
-+                metric = "rmse",
-+                device = "gpu",
-+                gpu_platform_id = 0,
-+                gpu_device_id = 0,
-+                nthread = 1,
-+                boost_from_average = FALSE,
-+                num_tree_per_iteration = 10,
-+                max_bin = 32)
-> model <- lgb.train(params,
-+                    dtrain,
-+                    2,
-+                    valids,
-+                    min_data = 1,
-+                    learning_rate = 1,
-+                    early_stopping_rounds = 10)
-[LightGBM] [Info] This is the GPU trainer!!
-[LightGBM] [Info] Total Bins 232
-[LightGBM] [Info] Number of data: 6513, number of used features: 116
-[LightGBM] [Info] Using requested OpenCL platform 0 device 1
-[LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: GenuineIntel
-[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
-[LightGBM] [Info] GPU programs have been built
-[LightGBM] [Info] Size of histogram bin entry: 12
-[LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004540 secs. 76 sparse feature groups.
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
-[1]:	test's rmse:1.10643e-17 
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
-[2]:	test's rmse:0
-```
-## Single CPU and GPU (even with integrated graphics)
-If you have integrated graphics card (Intel HD Graphics) and a dedicated graphics card (AMD, NVIDIA), the dedicated graphics card will automatically override the integrated graphics card. The workaround is to disable your dedicated graphics card to be able to use your integrated graphics card.
-When you have multiple devices (one CPU and one GPU), the order is usually the following:
-* GPU: `gpu_platform_id = 0`, `gpu_device_id = 0`, sometimes it is usable using `gpu_platform_id = 1`, `gpu_device_id = 1` but at your own risk!
-* CPU: `gpu_platform_id = 0`, `gpu_device_id = 1`
-Example of GPU (gpu_platform_id = 0`, `gpu_device_id = 0):
-```r
-> params <- list(objective = "regression",
-+                metric = "rmse",
-+                device = "gpu",
-+                gpu_platform_id = 0,
-+                gpu_device_id = 0,
-+                nthread = 1,
-+                boost_from_average = FALSE,
-+                num_tree_per_iteration = 10,
-+                max_bin = 32)
-> model <- lgb.train(params,
-+                    dtrain,
-+                    2,
-+                    valids,
-+                    min_data = 1,
-+                    learning_rate = 1,
-+                    early_stopping_rounds = 10)
-[LightGBM] [Info] This is the GPU trainer!!
-[LightGBM] [Info] Total Bins 232
-[LightGBM] [Info] Number of data: 6513, number of used features: 116
-[LightGBM] [Info] Using GPU Device: Oland, Vendor: Advanced Micro Devices, Inc.
-[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
-[LightGBM] [Info] GPU programs have been built
-[LightGBM] [Info] Size of histogram bin entry: 12
-[LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004211 secs. 76 sparse feature groups.
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
-[1]:	test's rmse:1.10643e-17 
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
-[2]:	test's rmse:0
-```
-Example of CPU (gpu_platform_id = 0`, `gpu_device_id = 1):
-```r
-> params <- list(objective = "regression",
-+                metric = "rmse",
-+                device = "gpu",
-+                gpu_platform_id = 0,
-+                gpu_device_id = 1,
-+                nthread = 1,
-+                boost_from_average = FALSE,
-+                num_tree_per_iteration = 10,
-+                max_bin = 32)
-> model <- lgb.train(params,
-+                    dtrain,
-+                    2,
-+                    valids,
-+                    min_data = 1,
-+                    learning_rate = 1,
-+                    early_stopping_rounds = 10)
-[LightGBM] [Info] This is the GPU trainer!!
-[LightGBM] [Info] Total Bins 232
-[LightGBM] [Info] Number of data: 6513, number of used features: 116
-[LightGBM] [Info] Using requested OpenCL platform 0 device 1
-[LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: GenuineIntel
-[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
-[LightGBM] [Info] GPU programs have been built
-[LightGBM] [Info] Size of histogram bin entry: 12
-[LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004540 secs. 76 sparse feature groups.
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
-[1]:	test's rmse:1.10643e-17 
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
-[2]:	test's rmse:0
-```
-When using a wrong `gpu_device_id`, it will automatically fallback to `gpu_device_id = 0`:
-```r
-> params <- list(objective = "regression",
-+                metric = "rmse",
-+                device = "gpu",
-+                gpu_platform_id = 0,
-+                gpu_device_id = 9999,
-+                nthread = 1,
-+                boost_from_average = FALSE,
-+                num_tree_per_iteration = 10,
-+                max_bin = 32)
-> model <- lgb.train(params,
-+                    dtrain,
-+                    2,
-+                    valids,
-+                    min_data = 1,
-+                    learning_rate = 1,
-+                    early_stopping_rounds = 10)
-[LightGBM] [Info] This is the GPU trainer!!
-[LightGBM] [Info] Total Bins 232
-[LightGBM] [Info] Number of data: 6513, number of used features: 116
-[LightGBM] [Info] Using GPU Device: Oland, Vendor: Advanced Micro Devices, Inc.
-[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
-[LightGBM] [Info] GPU programs have been built
-[LightGBM] [Info] Size of histogram bin entry: 12
-[LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004211 secs. 76 sparse feature groups.
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
-[1]:	test's rmse:1.10643e-17 
-[LightGBM] [Info] No further splits with positive gain, best gain: -inf
-[LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
-[2]:	test's rmse:0
-```
-Do not ever run under the following scenario as it is known to crash even if it says it is using the CPU because it is NOT the case:
-* One CPU and one GPU
-* `gpu_platform_id = 1`, `gpu_device_id = 0`
-```r
-> params <- list(objective = "regression",
-+                metric = "rmse",
-+                device = "gpu",
-+                gpu_platform_id = 1,
-+                gpu_device_id = 0,
-+                nthread = 1,
-+                boost_from_average = FALSE,
-+                num_tree_per_iteration = 10,
-+                max_bin = 32)
-> model <- lgb.train(params,
-+                    dtrain,
-+                    2,
-+                    valids,
-+                    min_data = 1,
-+                    learning_rate = 1,
-+                    early_stopping_rounds = 10)
-[LightGBM] [Info] This is the GPU trainer!!
-[LightGBM] [Info] Total Bins 232
-[LightGBM] [Info] Number of data: 6513, number of used features: 116
-[LightGBM] [Info] Using requested OpenCL platform 1 device 0
-[LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: Intel(R) Corporation
-[LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
-terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::compute::opencl_error> >'
-  what():  Invalid Program
-This application has requested the Runtime to terminate it in an unusual way.
-Please contact the application's support team for more information.
-```
-## Multiple CPU and GPU
-If you have multiple devices (multiple CPUs and multiple GPUs), you will have to test different `gpu_device_id` and different `gpu_platform_id` values to find out the values which suits the CPU/GPU you want to use. Keep in mind that using the integrated graphics card is not directly possible without disabling every dedicated graphics card.
--- a/docs/GPU-Targets.rst
+++ b/docs/GPU-Targets.rst
+GPU Targets Table
+=================
+When using OpenCL SDKs, targeting CPU and GPU at the same time is
+sometimes possible. This is especially true for Intel OpenCL SDK and AMD
+APP SDK.
+You can find below a table of correspondence:
+---------------------------+-----------------+-----------------+-----------------+--------------+
+| SDK                       | CPU Intel/AMD   | GPU Intel       | GPU AMD         | GPU NVIDIA   |
+===========================+=================+=================+=================+==============+
+| `Intel SDK for OpenCL`_   | Supported       | Supported \*    | Supported       | Untested     |
+---------------------------+-----------------+-----------------+-----------------+--------------+
+| `AMD APP SDK`_            | Supported       | Untested \*     | Supported       | Untested     |
+---------------------------+-----------------+-----------------+-----------------+--------------+
+| `NVIDIA CUDA Toolkit`_    | Untested \*\*   | Untested \*\*   | Untested \*\*   | Supported    |
+---------------------------+-----------------+-----------------+-----------------+--------------+
+Legend:
+-  \* Not usable directly.
+-  \*\* Reported as unsupported in public forums.
+AMD GPUs using Intel SDK for OpenCL is not a typo, nor AMD APP SDK
+compatibility with CPUs.
+--------------
+Targeting Table
+===============
+We present the following scenarii:
+-  CPU, no GPU
+-  Single CPU and GPU (even with integrated graphics)
+-  Multiple CPU/GPU
+We provide test R code below, but you can use the language of your
+choice with the examples of your choices:
+.. code:: r
+    library(lightgbm)
+    data(agaricus.train, package = "lightgbm")
+    train <- agaricus.train
+    train$data[, 1] <- 1:6513
+    dtrain <- lgb.Dataset(train$data, label = train$label)
+    data(agaricus.test, package = "lightgbm")
+    test <- agaricus.test
+    dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
+    valids <- list(test = dtest)
+    params <- list(objective = "regression",
+                   metric = "rmse",
+                   device = "gpu",
+                   gpu_platform_id = 0,
+                   gpu_device_id = 0,
+                   nthread = 1,
+                   boost_from_average = FALSE,
+                   num_tree_per_iteration = 10,
+                   max_bin = 32)
+    model <- lgb.train(params,
+                       dtrain,
+                       2,
+                       valids,
+                       min_data = 1,
+                       learning_rate = 1,
+                       early_stopping_rounds = 10)
+Using a bad ``gpu_device_id`` is not critical, as it will fallback to:
+-  ``gpu_device_id = 0`` if using ``gpu_platform_id = 0``
+-  ``gpu_device_id = 1`` if using ``gpu_platform_id = 1``
+However, using a bad combination of ``gpu_platform_id`` and
+``gpu_device_id`` will lead to a **crash** (you will lose your entire
+session content). Beware of it.
+CPU only architectures
+----------------------
+When you have a single device (one CPU), OpenCL usage is
+straightforward: ``gpu_platform_id = 0``, ``gpu_device_id = 0``
+This will use the CPU with OpenCL, even though it says it says GPU.
+Example:
+.. code:: r
+    > params <- list(objective = "regression",
+    +                metric = "rmse",
+    +                device = "gpu",
+    +                gpu_platform_id = 0,
+    +                gpu_device_id = 0,
+    +                nthread = 1,
+    +                boost_from_average = FALSE,
+    +                num_tree_per_iteration = 10,
+    +                max_bin = 32)
+    > model <- lgb.train(params,
+    +                    dtrain,
+    +                    2,
+    +                    valids,
+    +                    min_data = 1,
+    +                    learning_rate = 1,
+    +                    early_stopping_rounds = 10)
+    [LightGBM] [Info] This is the GPU trainer!!
+    [LightGBM] [Info] Total Bins 232
+    [LightGBM] [Info] Number of data: 6513, number of used features: 116
+    [LightGBM] [Info] Using requested OpenCL platform 0 device 1
+    [LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: GenuineIntel
+    [LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
+    [LightGBM] [Info] GPU programs have been built
+    [LightGBM] [Info] Size of histogram bin entry: 12
+    [LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004540 secs. 76 sparse feature groups.
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
+    [1]:    test's rmse:1.10643e-17 
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
+    [2]:    test's rmse:0
+Single CPU and GPU (even with integrated graphics)
+--------------------------------------------------
+If you have integrated graphics card (Intel HD Graphics) and a dedicated
+graphics card (AMD, NVIDIA), the dedicated graphics card will
+automatically override the integrated graphics card. The workaround is
+to disable your dedicated graphics card to be able to use your
+integrated graphics card.
+When you have multiple devices (one CPU and one GPU), the order is
+usually the following:
+-  GPU: ``gpu_platform_id = 0``, ``gpu_device_id = 0``, sometimes it is
+   usable using ``gpu_platform_id = 1``, ``gpu_device_id = 1`` but at
+   your own risk!
+-  CPU: ``gpu_platform_id = 0``, ``gpu_device_id = 1``
+Example of GPU (``gpu_platform_id = 0``, ``gpu_device_id = 0``):
+.. code:: r
+    > params <- list(objective = "regression",
+    +                metric = "rmse",
+    +                device = "gpu",
+    +                gpu_platform_id = 0,
+    +                gpu_device_id = 0,
+    +                nthread = 1,
+    +                boost_from_average = FALSE,
+    +                num_tree_per_iteration = 10,
+    +                max_bin = 32)
+    > model <- lgb.train(params,
+    +                    dtrain,
+    +                    2,
+    +                    valids,
+    +                    min_data = 1,
+    +                    learning_rate = 1,
+    +                    early_stopping_rounds = 10)
+    [LightGBM] [Info] This is the GPU trainer!!
+    [LightGBM] [Info] Total Bins 232
+    [LightGBM] [Info] Number of data: 6513, number of used features: 116
+    [LightGBM] [Info] Using GPU Device: Oland, Vendor: Advanced Micro Devices, Inc.
+    [LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
+    [LightGBM] [Info] GPU programs have been built
+    [LightGBM] [Info] Size of histogram bin entry: 12
+    [LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004211 secs. 76 sparse feature groups.
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
+    [1]:    test's rmse:1.10643e-17 
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
+    [2]:    test's rmse:0
+Example of CPU (``gpu_platform_id = 0``, ``gpu_device_id = 1``):
+.. code:: r
+    > params <- list(objective = "regression",
+    +                metric = "rmse",
+    +                device = "gpu",
+    +                gpu_platform_id = 0,
+    +                gpu_device_id = 1,
+    +                nthread = 1,
+    +                boost_from_average = FALSE,
+    +                num_tree_per_iteration = 10,
+    +                max_bin = 32)
+    > model <- lgb.train(params,
+    +                    dtrain,
+    +                    2,
+    +                    valids,
+    +                    min_data = 1,
+    +                    learning_rate = 1,
+    +                    early_stopping_rounds = 10)
+    [LightGBM] [Info] This is the GPU trainer!!
+    [LightGBM] [Info] Total Bins 232
+    [LightGBM] [Info] Number of data: 6513, number of used features: 116
+    [LightGBM] [Info] Using requested OpenCL platform 0 device 1
+    [LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: GenuineIntel
+    [LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
+    [LightGBM] [Info] GPU programs have been built
+    [LightGBM] [Info] Size of histogram bin entry: 12
+    [LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004540 secs. 76 sparse feature groups.
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
+    [1]:    test's rmse:1.10643e-17 
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
+    [2]:    test's rmse:0
+When using a wrong ``gpu_device_id``, it will automatically fallback to
+``gpu_device_id = 0``:
+.. code:: r
+    > params <- list(objective = "regression",
+    +                metric = "rmse",
+    +                device = "gpu",
+    +                gpu_platform_id = 0,
+    +                gpu_device_id = 9999,
+    +                nthread = 1,
+    +                boost_from_average = FALSE,
+    +                num_tree_per_iteration = 10,
+    +                max_bin = 32)
+    > model <- lgb.train(params,
+    +                    dtrain,
+    +                    2,
+    +                    valids,
+    +                    min_data = 1,
+    +                    learning_rate = 1,
+    +                    early_stopping_rounds = 10)
+    [LightGBM] [Info] This is the GPU trainer!!
+    [LightGBM] [Info] Total Bins 232
+    [LightGBM] [Info] Number of data: 6513, number of used features: 116
+    [LightGBM] [Info] Using GPU Device: Oland, Vendor: Advanced Micro Devices, Inc.
+    [LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
+    [LightGBM] [Info] GPU programs have been built
+    [LightGBM] [Info] Size of histogram bin entry: 12
+    [LightGBM] [Info] 40 dense feature groups (0.12 MB) transfered to GPU in 0.004211 secs. 76 sparse feature groups.
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=16 and max_depth=8
+    [1]:    test's rmse:1.10643e-17 
+    [LightGBM] [Info] No further splits with positive gain, best gain: -inf
+    [LightGBM] [Info] Trained a tree with leaves=7 and max_depth=5
+    [2]:    test's rmse:0
+Do not ever run under the following scenario as it is known to crash
+even if it says it is using the CPU because it is NOT the case:
+-  One CPU and one GPU
+-  ``gpu_platform_id = 1``, ``gpu_device_id = 0``
+.. code:: r
+    > params <- list(objective = "regression",
+    +                metric = "rmse",
+    +                device = "gpu",
+    +                gpu_platform_id = 1,
+    +                gpu_device_id = 0,
+    +                nthread = 1,
+    +                boost_from_average = FALSE,
+    +                num_tree_per_iteration = 10,
+    +                max_bin = 32)
+    > model <- lgb.train(params,
+    +                    dtrain,
+    +                    2,
+    +                    valids,
+    +                    min_data = 1,
+    +                    learning_rate = 1,
+    +                    early_stopping_rounds = 10)
+    [LightGBM] [Info] This is the GPU trainer!!
+    [LightGBM] [Info] Total Bins 232
+    [LightGBM] [Info] Number of data: 6513, number of used features: 116
+    [LightGBM] [Info] Using requested OpenCL platform 1 device 0
+    [LightGBM] [Info] Using GPU Device: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, Vendor: Intel(R) Corporation
+    [LightGBM] [Info] Compiling OpenCL Kernel with 16 bins...
+    terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::compute::opencl_error> >'
+      what():  Invalid Program
+    This application has requested the Runtime to terminate it in an unusual way.
+    Please contact the application's support team for more information.
+Multiple CPU and GPU
+--------------------
+If you have multiple devices (multiple CPUs and multiple GPUs), you will
+have to test different ``gpu_device_id`` and different
+``gpu_platform_id`` values to find out the values which suits the
+CPU/GPU you want to use. Keep in mind that using the integrated graphics
+card is not directly possible without disabling every dedicated graphics
+card.
+.. _Intel SDK for OpenCL: https://software.intel.com/en-us/articles/opencl-drivers
+.. _AMD APP SDK: http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
+.. _NVIDIA CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
\ No newline at end of file
--- a/docs/GPU-Tutorial.md
+++ b/docs/GPU-Tutorial.md
@@ -165,9 +165,9 @@ Also, you can compare the training speed with CPU:
 Further Reading
 ---------------
-[GPU Tuning Guide and Performance Comparison](./GPU-Performance.md)
+[GPU Tuning Guide and Performance Comparison](./GPU-Performance.rst)
-[GPU SDK Correspondence and Device Targeting Table](./GPU-Targets.md).
+[GPU SDK Correspondence and Device Targeting Table](./GPU-Targets.rst).
 [GPU Windows Tutorial](./GPU-Windows.md)

--- a/docs/GPU-Windows.md
+++ b/docs/GPU-Windows.md
@@ -3,7 +3,7 @@ GPU Windows Compilation
 This guide is for the MinGW build.
-For the MSVC (Visual Studio) build with GPU, please refer to https://github.com/Microsoft/LightGBM/wiki/Installation-Guide#windows-2 ( We recommend you to use this since it is much easier).
+For the MSVC (Visual Studio) build with GPU, please refer to [Installation Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide#windows-2). (We recommend you to use this since it is much easier).
 # Install LightGBM GPU version in Windows (CLI / R / Python), using MinGW/gcc
@@ -57,11 +57,11 @@ Does not apply to you if you do not use a third-party antivirus nor the default
 Installing the appropriate OpenCL SDK requires you to download the correct vendor source SDK. You need to know on what you are going to use LightGBM!:
-* For running on Intel, get Intel SDK for OpenCL: https://software.intel.com/en-us/articles/opencl-drivers (NOT RECOMMENDED)
+* For running on Intel, get [Intel SDK for OpenCL](https://software.intel.com/en-us/articles/opencl-drivers) (NOT RECOMMENDED)
-* For running on AMD, get AMD APP SDK: http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
+* For running on AMD, get [AMD APP SDK](http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/)
-* For running on NVIDIA, get CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
+* For running on NVIDIA, get [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
-Further reading and correspondnce table (especially if you intend to use cross-platform devices, like Intel CPU with AMD APP SDK): [GPU SDK Correspondence and Device Targeting Table](./GPU-Targets.md).
+Further reading and correspondnce table (especially if you intend to use cross-platform devices, like Intel CPU with AMD APP SDK): [GPU SDK Correspondence and Device Targeting Table](./GPU-Targets.rst).
 Warning: using Intel OpenCL is not recommended and may crash your machine due to being non compliant to OpenCL standards. If your objective is to use LightGBM + OpenCL on CPU, please use AMD APP SDK instead (it can run also on Intel CPUs without any issues).
@@ -69,7 +69,7 @@ Warning: using Intel OpenCL is not recommended and may crash your machine due to
 ## MinGW correct compiler selection
-If you are expecting to use LightGBM without R, you need to install MinGW. Installing MinGW is straightforward, download this: http://iweb.dl.sourceforge.net/project/mingw-w64/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/installer/mingw-w64-install.exe
+If you are expecting to use LightGBM without R, you need to install MinGW. Installing MinGW is straightforward, download [this](http://iweb.dl.sourceforge.net/project/mingw-w64/Toolchains%20targetting%20Win32/Personal%20Builds/mingw-builds/installer/mingw-w64-install.exe).
 Make sure you are using the x86_64 architecture, and do not modify anything else. You may choose a version other than the most recent one if you need a previous MinGW version.
@@ -134,8 +134,14 @@ We can now start downloading and compiling the required Boost libraries:
 To build the Boost libraries, you have two choices for command prompt:
-* If you have only one single core, you can use the default `b2 install --build_dir="C:\boost\boost-build" --prefix="C:\boost\boost-build" toolset=gcc --with=filesystem,system threading=multi --layout=system release`.
+* If you have only one single core, you can use the default
-* If you want to do a multithreaded library building (faster), add -j N by replacing N by the number of cores/threads you have. For instance, for 2 cores, you would do `b2 install --build_dir="C:\boost\boost-build" --prefix="C:\boost\boost-build" toolset=gcc --with=filesystem,system threading=multi --layout=system release -j 2`
+  ```
+  b2 install --build_dir="C:\boost\boost-build" --prefix="C:\boost\boost-build" toolset=gcc --with=filesystem,system threading=multi --layout=system release
+  ```
+* If you want to do a multithreaded library building (faster), add `-j N` by replacing N by the number of cores/threads you have. For instance, for 2 cores, you would do
+  ```
+  b2 install --build_dir="C:\boost\boost-build" --prefix="C:\boost\boost-build" toolset=gcc --with=filesystem,system threading=multi --layout=system release -j 2
+  ```
 Ignore all the errors popping up, like Python, etc., they do not matter for us.
@@ -169,7 +175,7 @@ If you are getting an error:
 ## Git Installation
-Installing Git for Windows is straightforward, use the following link: https://git-for-windows.github.io/
+Installing Git for Windows is straightforward, use the following [link](https://git-for-windows.github.io/).
 ![git for Windows](https://cloud.githubusercontent.com/assets/9083669/24919716/e2612ea6-1ee4-11e7-9eca-d30997b911ff.png)
@@ -273,7 +279,7 @@ cd C:/github_repos/LightGBM/examples/binary_classification
 Congratulations for reaching this stage!
-To learn how to target a correct CPU or GPU for training, please see: [GPU SDK Correspondence and Device Targeting Table](./GPU-Targets.md).
+To learn how to target a correct CPU or GPU for training, please see: [GPU SDK Correspondence and Device Targeting Table](./GPU-Targets.rst).
 ---
@@ -305,7 +311,9 @@ And then, follow the regular LightGBM CLI installation from there.
 Once you have installed LightGBM CLI, assuming your LightGBM is in `C:\github_repos\LightGBM`, open a command prompt and run the following:
-`gdb --args "../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu`
+```
+gdb --args "../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu
+```
 ![Debug run](https://cloud.githubusercontent.com/assets/9083669/25041067/8fdbee66-210d-11e7-8adb-79b688c051d5.png)
@@ -440,4 +448,4 @@ l-fast-relaxed-math") at C:/boost/boost-build/include/boost/compute/program.hpp:
 #14 main (argc=6, argv=0x1f21e90) at C:\LightGBM\src\main.cpp:7
 ```
-And open an issue in GitHub here with that log: https://github.com/Microsoft/LightGBM/issues
+And open an issue in GitHub [here](https://github.com/Microsoft/LightGBM/issues) with that log.
--- a/docs/Installation-Guide.md
+++ b/docs/Installation-Guide.md
-Refer to https://github.com/Microsoft/LightGBM/wiki/Installation-Guide.
+Refer to [Installation Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide).
--- a/docs/Parallel-Learning-Guide.md
+++ b/docs/Parallel-Learning-Guide.md
-Refer to https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide
+Refer to [Parallel Learning Guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide).
\ No newline at end of file
--- a/docs/Parameters.md
+++ b/docs/Parameters.md
@@ -29,7 +29,7 @@ The parameter format is `key1=value1 key2=value2 ... ` . And parameters can be s
 * `task`, default=`train`, type=enum, options=`train`,`prediction`
  * `train` for training
  * `prediction` for prediction.
-  * `convert_model` for converting model file into if-else format, see more information in [Convert model parameters](Parameters.md#convert-model-parameters)
+  * `convert_model` for converting model file into if-else format, see more information in [Convert model parameters](#convert-model-parameters)
 * `application`, default=`regression`, type=enum, options=`regression`,`regression_l1`,`huber`,`fair`,`poisson`,`binary`,`lambdarank`,`multiclass`, alias=`objective`,`app`
  * `regression`, regression application
    * `regression_l2`, L2 loss, alias=`mean_squared_error`,`mse`

--- a/docs/Python-intro.md
+++ b/docs/Python-intro.md
 Python Package Introduction
 ===========================
 This document gives a basic walkthrough of LightGBM python package.
 ***List of other Helpful Links***
-* [Python Examples](../examples/python-guide/)
+* [Python Examples](https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide)
 * [Python API Reference](./Python-API.md)
 * [Parameters Tuning](./Parameters-tuning.md)

--- a/docs/Quick-Start.md
+++ b/docs/Quick-Start.md
@@ -22,10 +22,10 @@ update 12/5/2016:
 LightGBM can use categorical feature directly (without one-hot coding). The experiment on [Expo data](http://stat-computing.org/dataexpo/2009/) shows about 8x speed-up compared with one-hot coding.
-For the setting details, please refer to [Parameters](./Parameters.md#io-parameters).
+For the setting details, please refer to [Parameters](./Parameters.md).
 ### Weight and query/group data
-LightGBM also support weighted training, it needs an additional [weight data](./Parameters.md#weight-data). And it needs an additional [query data](./Parameters.md#query-data) for ranking task.
+LightGBM also support weighted training, it needs an additional [weight data](./Parameters.md). And it needs an additional [query data](./Parameters.md) for ranking task.
 update 11/3/2016:
@@ -33,7 +33,7 @@ update 11/3/2016:
 2. can specific label column, weight column and query/group id column. Both index and column are supported
 3. can specific a list of ignored columns
-For the detailed usage, please refer to [Configuration](./Parameters.md#io-parameters).
+For the detailed usage, please refer to [Configuration](./Parameters.md).
 ## Parameter quick look
@@ -114,7 +114,7 @@ For example, following command line will keep 'num_trees=10' and ignore same par
 ## Examples
-* [Binary Classification](../examples/binary_classification)
+* [Binary Classification](https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification)
-* [Regression](../examples/regression)
+* [Regression](https://github.com/Microsoft/LightGBM/tree/master/examples/regression)
-* [Lambdarank](../examples/lambdarank)
+* [Lambdarank](https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank)
-* [Parallel Learning](../examples/parallel_learning)
+* [Parallel Learning](https://github.com/Microsoft/LightGBM/tree/master/examples/parallel_learning)
--- a/docs/development.rst
+++ b/docs/development.rst
@@ -78,7 +78,7 @@ Refere to the comments in `c\_api.h <https://github.com/Microsoft/LightGBM/blob/
 High Level Language Package
 ---------------------------
-See the implementations at `python-package <https://github.com/Microsoft/LightGBM/tree/master/python-package/lightgbm>`__ and `R-package <https://github.com/Microsoft/LightGBM/tree/master/R-package>`__.
+See the implementations at `python-package <https://github.com/Microsoft/LightGBM/tree/master/python-package>`__ and `R-package <https://github.com/Microsoft/LightGBM/tree/master/R-package>`__.
 Ask Questions
 -------------

--- a/python-package/README.rst
+++ b/python-package/README.rst
 LightGBM Python Package
 =======================
-|License| |PyPI Version|
+|License| |Python Versions| |PyPI Version|
-.. # Uncomment after updating PyPI |Python Versions|
 Installation
 ------------
@@ -112,4 +110,4 @@ E501 can be ignored (line too long).
 .. |Python Versions| image:: https://img.shields.io/pypi/pyversions/lightgbm.svg
   :target: https://pypi.python.org/pypi/lightgbm
 .. |PyPI Version| image:: https://badge.fury.io/py/lightgbm.svg
-    :target: https://badge.fury.io/py/lightgbm
+   :target: https://badge.fury.io/py/lightgbm