Parallel-Learning-Guide.rst 6.86 KB
Newer Older
1
2
3
4
Distributed Learning Guide
==========================

.. _Parallel Learning Guide:
5

6
This guide describes distributed learning in LightGBM. Distributed learning allows the use of multiple machines to produce a single model.
7

8
Follow the `Quick Start <./Quick-Start.rst>`__ to know how to use LightGBM first.
9

10
11
How Distributed LightGBM Works
------------------------------
12

13
This section describes how distributed learning in LightGBM works. To learn how to do this in various programming languages and frameworks, please see `Integrations <#integrations>`__.
14

15
Choose Appropriate Parallel Algorithm
16
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17

18
LightGBM provides 3 distributed learning algorithms now.
19
20
21
22
23
24
25
26
27
28

+--------------------+---------------------------+
| Parallel Algorithm | How to Use                |
+====================+===========================+
| Data parallel      | ``tree_learner=data``     |
+--------------------+---------------------------+
| Feature parallel   | ``tree_learner=feature``  |
+--------------------+---------------------------+
| Voting parallel    | ``tree_learner=voting``   |
+--------------------+---------------------------+
29
30
31

These algorithms are suited for different scenarios, which is listed in the following table:

32
33
34
35
36
37
38
+-------------------------+-------------------+-----------------+
|                         | #data is small    | #data is large  |
+=========================+===================+=================+
| **#feature is small**   | Feature Parallel  | Data Parallel   |
+-------------------------+-------------------+-----------------+
| **#feature is large**   | Feature Parallel  | Voting Parallel |
+-------------------------+-------------------+-----------------+
39

40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
More details about these parallel algorithms can be found in `optimization in distributed learning <./Features.rst#optimization-in-distributed-learning>`__.

Integrations
------------

This section describes how to run distributed LightGBM training in various programming languages and frameworks. To learn how distributed learning in LightGBM works generally, please see `How Distributed LightGBM Works <#how-distributed-lightgbm-works>`__.

Apache Spark
^^^^^^^^^^^^

Apache Spark users can use `MMLSpark`_ for machine learning workflows with LightGBM. This project is not maintained by LightGBM's maintainers.

See `this MMLSpark example`_ and `the MMLSpark documentation`_ for additional information on using LightGBM on Spark.

.. note::

  ``MMLSpark`` is not maintained by LightGBM's maintainers. Bug reports or feature requests should be directed to https://github.com/Azure/mmlspark/issues.

Dask
^^^^

.. versionadded:: 3.2.0
62

63
LightGBM's Python package supports distributed learning via `Dask`_. This integration is maintained by LightGBM's maintainers.
64

65
66
Kubeflow
^^^^^^^^
67

68
69
70
71
72
73
74
75
76
77
78
79
80
81
`Kubeflow Fairing`_ supports LightGBM distributed training. `These examples`_ show how to get started with LightGBM and Kubeflow Fairing in a hybrid cloud environment.

Kubeflow users can also use the `Kubeflow XGBoost Operator`_ for machine learning workflows with LightGBM. You can see `this example`_ for more details.

Kubeflow integrations for LightGBM are not maintained by LightGBM's maintainers.

.. note::

  The Kubeflow integrations for LightGBM are not maintained by LightGBM's maintainers. Bug reports or feature requests should be directed to https://github.com/kubeflow/fairing/issues or https://github.com/kubeflow/xgboost-operator/issues.

LightGBM CLI
^^^^^^^^^^^^

.. _Build Parallel Version:
82
83

Preparation
84
85
86
87
88
'''''''''''

By default, distributed learning with LightGBM uses socket-based communication.

If you need to build distributed version with MPI support, please refer to `Installation Guide <./Installation-Guide.rst#build-mpi-version>`__.
89
90

Socket Version
91
**************
92

93
It needs to collect IP of all machines that want to run distributed learning in and allocate one TCP port (assume 12345 here) for all machines,
94
95
96
97
98
99
100
101
and change firewall rules to allow income of this port (12345). Then write these IP and ports in one file (assume ``mlist.txt``), like following:

.. code::

    machine1_ip 12345
    machine2_ip 12345

MPI Version
102
***********
103

104
It needs to collect IP (or hostname) of all machines that want to run distributed learning in.
105
106
107
108
109
110
111
Then write these IP in one file (assume ``mlist.txt``) like following:

.. code::

    machine1_ip
    machine2_ip

112
**Note**: For Windows users, need to start "smpd" to start MPI service. More details can be found `here`_.
113

114
115
116
117
Run Distributed Learning
''''''''''''''''''''''''

.. _Run Parallel Learning:
118
119

Socket Version
120
**************
121
122
123

1. Edit following parameters in config file:

124
   ``tree_learner=your_parallel_algorithm``, edit ``your_parallel_algorithm`` (e.g. feature/data) here.
125

126
   ``num_machines=your_num_machines``, edit ``your_num_machines`` (e.g. 4) here.
127

128
   ``machine_list_file=mlist.txt``, ``mlist.txt`` is created in `Preparation section <#preparation>`__.
129

130
   ``local_listen_port=12345``, ``12345`` is allocated in `Preparation section <#preparation>`__.
131
132
133
134
135

2. Copy data file, executable file, config file and ``mlist.txt`` to all machines.

3. Run following command on all machines, you need to change ``your_config_file`` to real config file.

136
   For Windows: ``lightgbm.exe config=your_config_file``
137

138
   For Linux: ``./lightgbm config=your_config_file``
139
140

MPI Version
141
***********
142
143
144

1. Edit following parameters in config file:

145
   ``tree_learner=your_parallel_algorithm``, edit ``your_parallel_algorithm`` (e.g. feature/data) here.
146

147
   ``num_machines=your_num_machines``, edit ``your_num_machines`` (e.g. 4) here.
148

149
150
151
2. Copy data file, executable file, config file and ``mlist.txt`` to all machines.

   **Note**: MPI needs to be run in the **same path on all machines**.
152
153
154

3. Run following command on one machine (not need to run on all machines), need to change ``your_config_file`` to real config file.

155
156
157
   For Windows:
   
   .. code::
158

159
       mpiexec.exe /machinefile mlist.txt lightgbm.exe config=your_config_file
160

161
   For Linux:
162

163
   .. code::
164

165
       mpiexec --machinefile mlist.txt ./lightgbm config=your_config_file
166

167
Example
168
'''''''
169

170
-  `A simple distributed learning example`_
171

172
173
.. _Dask: https://docs.dask.org/en/latest/

174
175
.. _MMLSpark: https://aka.ms/spark

176
177
178
.. _this MMLSpark example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/LightGBM%20-%20Quantile%20Regression%20for%20Drug%20Discovery.ipynb

.. _the MMLSpark Documentation: https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md
179

180
.. _Kubeflow Fairing: https://www.kubeflow.org/docs/components/fairing/fairing-overview
181
182
183

.. _These examples: https://github.com/kubeflow/fairing/tree/master/examples/lightgbm

184
185
186
.. _Kubeflow XGBoost Operator: https://github.com/kubeflow/xgboost-operator

.. _this example: https://github.com/kubeflow/xgboost-operator/tree/master/config/samples/lightgbm-dist
187

Nikita Titov's avatar
Nikita Titov committed
188
.. _here: https://www.youtube.com/watch?v=iqzXhp5TxUY
189

190
.. _A simple distributed learning example: https://github.com/microsoft/lightgbm/tree/master/examples/parallel_learning