Unverified Commit 71715584 authored by James Lamb's avatar James Lamb Committed by GitHub
Browse files

[doc] Reorganize documentation on distributed learning (fixes #3596) (#3951)



* rework distributed learning page

* more references

* more changes

* more changes

* add anchors for olds links

* revert changes from #4000

* fix links

* more links

* Apply suggestions from code review
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>

* Update docs/Parallel-Learning-Guide.rst
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
Co-authored-by: default avatarNikita Titov <nekit94-08@mail.ru>
parent 605c97b5
...@@ -59,8 +59,10 @@ Parameters Tuning ...@@ -59,8 +59,10 @@ Parameters Tuning
- Refer to `Parameters Tuning <./Parameters-Tuning.rst>`__. - Refer to `Parameters Tuning <./Parameters-Tuning.rst>`__.
Parallel Learning .. _Parallel Learning:
-----------------
Distributed Learning
--------------------
- Refer to `Distributed Learning Guide <./Parallel-Learning-Guide.rst>`__. - Refer to `Distributed Learning Guide <./Parallel-Learning-Guide.rst>`__.
......
...@@ -72,8 +72,10 @@ It only needs to use some collective communication algorithms, like "All reduce" ...@@ -72,8 +72,10 @@ It only needs to use some collective communication algorithms, like "All reduce"
LightGBM implements state-of-art algorithms\ `[9] <#references>`__. LightGBM implements state-of-art algorithms\ `[9] <#references>`__.
These collective communication algorithms can provide much better performance than point-to-point communication. These collective communication algorithms can provide much better performance than point-to-point communication.
Optimization in Parallel Learning .. _Optimization in Parallel Learning:
---------------------------------
Optimization in Distributed Learning
------------------------------------
LightGBM provides the following distributed learning algorithms. LightGBM provides the following distributed learning algorithms.
......
Parallel Learning Guide Distributed Learning Guide
======================= ==========================
.. _Parallel Learning Guide:
This guide describes distributed learning in LightGBM. Distributed learning allows the use of multiple machines to produce a single model. This guide describes distributed learning in LightGBM. Distributed learning allows the use of multiple machines to produce a single model.
Follow the `Quick Start <./Quick-Start.rst>`__ to know how to use LightGBM first. Follow the `Quick Start <./Quick-Start.rst>`__ to know how to use LightGBM first.
**List of external libraries in which LightGBM can be used in a distributed fashion** How Distributed LightGBM Works
------------------------------
- `Dask API of LightGBM <./Python-API.rst#dask-api>`__ (formerly it was a separate package) allows to create ML workflow on Dask distributed data structures.
- `MMLSpark`_ integrates LightGBM into Apache Spark ecosystem.
`The following example`_ demonstrates how easy it's possible to utilize the great power of Spark.
- `Kubeflow Fairing`_ suggests using LightGBM in a Kubernetes cluster. This section describes how distributed learning in LightGBM works. To learn how to do this in various programming languages and frameworks, please see `Integrations <#integrations>`__.
`These examples`_ help to get started with LightGBM in a hybrid cloud environment.
Also you can use `Kubeflow XGBoost Operator`_ to train LightGBM model.
Please check `this example`_ for how to do this.
Choose Appropriate Parallel Algorithm Choose Appropriate Parallel Algorithm
------------------------------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LightGBM provides 3 distributed learning algorithms now. LightGBM provides 3 distributed learning algorithms now.
...@@ -42,20 +37,58 @@ These algorithms are suited for different scenarios, which is listed in the foll ...@@ -42,20 +37,58 @@ These algorithms are suited for different scenarios, which is listed in the foll
| **#feature is large** | Feature Parallel | Voting Parallel | | **#feature is large** | Feature Parallel | Voting Parallel |
+-------------------------+-------------------+-----------------+ +-------------------------+-------------------+-----------------+
More details about these parallel algorithms can be found in `optimization in parallel learning <./Features.rst#optimization-in-parallel-learning>`__. More details about these parallel algorithms can be found in `optimization in distributed learning <./Features.rst#optimization-in-distributed-learning>`__.
Integrations
------------
This section describes how to run distributed LightGBM training in various programming languages and frameworks. To learn how distributed learning in LightGBM works generally, please see `How Distributed LightGBM Works <#how-distributed-lightgbm-works>`__.
Apache Spark
^^^^^^^^^^^^
Apache Spark users can use `MMLSpark`_ for machine learning workflows with LightGBM. This project is not maintained by LightGBM's maintainers.
See `this MMLSpark example`_ and `the MMLSpark documentation`_ for additional information on using LightGBM on Spark.
.. note::
``MMLSpark`` is not maintained by LightGBM's maintainers. Bug reports or feature requests should be directed to https://github.com/Azure/mmlspark/issues.
Dask
^^^^
.. versionadded:: 3.2.0
Build Parallel Version LightGBM's Python package supports distributed learning via `Dask`_. This integration is maintained by LightGBM's maintainers.
----------------------
Default build version support parallel learning based on the socket. Kubeflow
^^^^^^^^
If you need to build parallel version with MPI support, please refer to `Installation Guide <./Installation-Guide.rst#build-mpi-version>`__. `Kubeflow Fairing`_ supports LightGBM distributed training. `These examples`_ show how to get started with LightGBM and Kubeflow Fairing in a hybrid cloud environment.
Kubeflow users can also use the `Kubeflow XGBoost Operator`_ for machine learning workflows with LightGBM. You can see `this example`_ for more details.
Kubeflow integrations for LightGBM are not maintained by LightGBM's maintainers.
.. note::
The Kubeflow integrations for LightGBM are not maintained by LightGBM's maintainers. Bug reports or feature requests should be directed to https://github.com/kubeflow/fairing/issues or https://github.com/kubeflow/xgboost-operator/issues.
LightGBM CLI
^^^^^^^^^^^^
.. _Build Parallel Version:
Preparation Preparation
----------- '''''''''''
By default, distributed learning with LightGBM uses socket-based communication.
If you need to build distributed version with MPI support, please refer to `Installation Guide <./Installation-Guide.rst#build-mpi-version>`__.
Socket Version Socket Version
^^^^^^^^^^^^^^ **************
It needs to collect IP of all machines that want to run distributed learning in and allocate one TCP port (assume 12345 here) for all machines, It needs to collect IP of all machines that want to run distributed learning in and allocate one TCP port (assume 12345 here) for all machines,
and change firewall rules to allow income of this port (12345). Then write these IP and ports in one file (assume ``mlist.txt``), like following: and change firewall rules to allow income of this port (12345). Then write these IP and ports in one file (assume ``mlist.txt``), like following:
...@@ -66,7 +99,7 @@ and change firewall rules to allow income of this port (12345). Then write these ...@@ -66,7 +99,7 @@ and change firewall rules to allow income of this port (12345). Then write these
machine2_ip 12345 machine2_ip 12345
MPI Version MPI Version
^^^^^^^^^^^ ***********
It needs to collect IP (or hostname) of all machines that want to run distributed learning in. It needs to collect IP (or hostname) of all machines that want to run distributed learning in.
Then write these IP in one file (assume ``mlist.txt``) like following: Then write these IP in one file (assume ``mlist.txt``) like following:
...@@ -78,11 +111,13 @@ Then write these IP in one file (assume ``mlist.txt``) like following: ...@@ -78,11 +111,13 @@ Then write these IP in one file (assume ``mlist.txt``) like following:
**Note**: For Windows users, need to start "smpd" to start MPI service. More details can be found `here`_. **Note**: For Windows users, need to start "smpd" to start MPI service. More details can be found `here`_.
Run Parallel Learning Run Distributed Learning
--------------------- ''''''''''''''''''''''''
.. _Run Parallel Learning:
Socket Version Socket Version
^^^^^^^^^^^^^^ **************
1. Edit following parameters in config file: 1. Edit following parameters in config file:
...@@ -103,7 +138,7 @@ Socket Version ...@@ -103,7 +138,7 @@ Socket Version
For Linux: ``./lightgbm config=your_config_file`` For Linux: ``./lightgbm config=your_config_file``
MPI Version MPI Version
^^^^^^^^^^^ ***********
1. Edit following parameters in config file: 1. Edit following parameters in config file:
...@@ -130,13 +165,17 @@ MPI Version ...@@ -130,13 +165,17 @@ MPI Version
mpiexec --machinefile mlist.txt ./lightgbm config=your_config_file mpiexec --machinefile mlist.txt ./lightgbm config=your_config_file
Example Example
^^^^^^^ '''''''
- `A simple distributed learning example`_ - `A simple distributed learning example`_
.. _Dask: https://docs.dask.org/en/latest/
.. _MMLSpark: https://aka.ms/spark .. _MMLSpark: https://aka.ms/spark
.. _The following example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/LightGBM%20-%20Quantile%20Regression%20for%20Drug%20Discovery.ipynb .. _this MMLSpark example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/LightGBM%20-%20Quantile%20Regression%20for%20Drug%20Discovery.ipynb
.. _the MMLSpark Documentation: https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md
.. _Kubeflow Fairing: https://www.kubeflow.org/docs/components/fairing/fairing-overview .. _Kubeflow Fairing: https://www.kubeflow.org/docs/components/fairing/fairing-overview
......
Parallel Learning Example Distributed Learning Example
========================= ============================
<a name="parallel-learning-example"></a>
Here is an example for LightGBM to perform distributed learning for 2 machines. Here is an example for LightGBM to perform distributed learning for 2 machines.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment