Parallel-Learning-Guide.rst 5.4 KB
Newer Older
1
2
3
4
5
Parallel Learning Guide
=======================

This is a guide for parallel learning of LightGBM.

6
Follow the `Quick Start <./Quick-Start.rst>`__ to know how to use LightGBM first.
7

8
9
**List of external libraries in which LightGBM can be used in a distributed fashion**

10
-  `Dask API of LightGBM <./Python-API.rst#dask-api>`__ (formerly it was a separate package) allows to create ML workflow on Dask distributed data structures.
11
12
13
14
15
16

-  `MMLSpark`_ integrates LightGBM into Apache Spark ecosystem.
   `The following example`_ demonstrates how easy it's possible to utilize the great power of Spark.

-  `Kubeflow Fairing`_ suggests using LightGBM in a Kubernetes cluster.
   `These examples`_ help to get started with LightGBM in a hybrid cloud environment.
17
18
   Also you can use `Kubeflow XGBoost Operator`_ to train LightGBM model.
   Please check `this example`_ for how to do this.
19

20
21
22
Choose Appropriate Parallel Algorithm
-------------------------------------

23
24
25
26
27
28
29
30
31
32
33
LightGBM provides 3 parallel learning algorithms now.

+--------------------+---------------------------+
| Parallel Algorithm | How to Use                |
+====================+===========================+
| Data parallel      | ``tree_learner=data``     |
+--------------------+---------------------------+
| Feature parallel   | ``tree_learner=feature``  |
+--------------------+---------------------------+
| Voting parallel    | ``tree_learner=voting``   |
+--------------------+---------------------------+
34
35
36

These algorithms are suited for different scenarios, which is listed in the following table:

37
38
39
40
41
42
43
+-------------------------+-------------------+-----------------+
|                         | #data is small    | #data is large  |
+=========================+===================+=================+
| **#feature is small**   | Feature Parallel  | Data Parallel   |
+-------------------------+-------------------+-----------------+
| **#feature is large**   | Feature Parallel  | Voting Parallel |
+-------------------------+-------------------+-----------------+
44

45
More details about these parallel algorithms can be found in `optimization in parallel learning <./Features.rst#optimization-in-parallel-learning>`__.
46
47
48
49
50
51

Build Parallel Version
----------------------

Default build version support parallel learning based on the socket.

52
If you need to build parallel version with MPI support, please refer to `Installation Guide <./Installation-Guide.rst#build-mpi-version>`__.
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

Preparation
-----------

Socket Version
^^^^^^^^^^^^^^

It needs to collect IP of all machines that want to run parallel learning in and allocate one TCP port (assume 12345 here) for all machines,
and change firewall rules to allow income of this port (12345). Then write these IP and ports in one file (assume ``mlist.txt``), like following:

.. code::

    machine1_ip 12345
    machine2_ip 12345

MPI Version
^^^^^^^^^^^

It needs to collect IP (or hostname) of all machines that want to run parallel learning in.
Then write these IP in one file (assume ``mlist.txt``) like following:

.. code::

    machine1_ip
    machine2_ip

79
**Note**: For Windows users, need to start "smpd" to start MPI service. More details can be found `here`_.
80
81
82
83
84
85
86
87
88

Run Parallel Learning
---------------------

Socket Version
^^^^^^^^^^^^^^

1. Edit following parameters in config file:

89
   ``tree_learner=your_parallel_algorithm``, edit ``your_parallel_algorithm`` (e.g. feature/data) here.
90

91
   ``num_machines=your_num_machines``, edit ``your_num_machines`` (e.g. 4) here.
92

93
   ``machine_list_file=mlist.txt``, ``mlist.txt`` is created in `Preparation section <#preparation>`__.
94

95
   ``local_listen_port=12345``, ``12345`` is allocated in `Preparation section <#preparation>`__.
96
97
98
99
100

2. Copy data file, executable file, config file and ``mlist.txt`` to all machines.

3. Run following command on all machines, you need to change ``your_config_file`` to real config file.

101
   For Windows: ``lightgbm.exe config=your_config_file``
102

103
   For Linux: ``./lightgbm config=your_config_file``
104
105
106
107
108
109

MPI Version
^^^^^^^^^^^

1. Edit following parameters in config file:

110
   ``tree_learner=your_parallel_algorithm``, edit ``your_parallel_algorithm`` (e.g. feature/data) here.
111

112
   ``num_machines=your_num_machines``, edit ``your_num_machines`` (e.g. 4) here.
113

114
115
116
2. Copy data file, executable file, config file and ``mlist.txt`` to all machines.

   **Note**: MPI needs to be run in the **same path on all machines**.
117
118
119

3. Run following command on one machine (not need to run on all machines), need to change ``your_config_file`` to real config file.

120
121
122
   For Windows:
   
   .. code::
123

124
       mpiexec.exe /machinefile mlist.txt lightgbm.exe config=your_config_file
125

126
   For Linux:
127

128
   .. code::
129

130
       mpiexec --machinefile mlist.txt ./lightgbm config=your_config_file
131

132
133
Example
^^^^^^^
134

135
-  `A simple parallel example`_
136

137
138
139
140
.. _MMLSpark: https://aka.ms/spark

.. _The following example: https://github.com/Azure/mmlspark/blob/master/notebooks/samples/LightGBM%20-%20Quantile%20Regression%20for%20Drug%20Discovery.ipynb

141
.. _Kubeflow Fairing: https://www.kubeflow.org/docs/components/fairing/fairing-overview
142
143
144

.. _These examples: https://github.com/kubeflow/fairing/tree/master/examples/lightgbm

145
146
147
.. _Kubeflow XGBoost Operator: https://github.com/kubeflow/xgboost-operator

.. _this example: https://github.com/kubeflow/xgboost-operator/tree/master/config/samples/lightgbm-dist
148

Nikita Titov's avatar
Nikita Titov committed
149
.. _here: https://www.youtube.com/watch?v=iqzXhp5TxUY
150

151
.. _A simple parallel example: https://github.com/microsoft/lightgbm/tree/master/examples/parallel_learning