distributed.rst 1.76 KB
Newer Older
xingjinliang's avatar
xingjinliang committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
distributed package
===================

This package contains various utilities to finalize model weight gradients
on each rank before the optimizer step. This includes a distributed data
parallelism wrapper to all-reduce or reduce-scatter the gradients across
data-parallel replicas, and a `finalize\_model\_grads` method to
synchronize gradients across different parallelism modes (e.g., 'tied'
layers on different pipeline stages, or gradients for experts in a MoE on
different ranks due to expert parallelism).

Submodules
----------

distributed.distributed\_data\_parallel
---------------------------------------

Model wrapper for distributed data parallelism. Stores gradients in a
contiguous buffer, and supports the option of overlapping communication
(all-reduce or reduce-scatter) with backprop computation by breaking up
full model's gradients into smaller buckets and running all-reduce /
reduce-scatter on each bucket asynchronously. 

.. automodule:: core.distributed.distributed_data_parallel
   :members:
   :undoc-members:
   :show-inheritance:

distributed.finalize\_model\_grads
----------------------------------

Finalize model gradients for optimizer step across all used parallelism modes.
Synchronizes the all-reduce / reduce-scatter of model gradients across DP replicas,
all-reduces the layernorm gradients for sequence parallelism, embedding gradients
across first and last pipeline stages (if not tied), and expert gradients for expert
parallelism.

.. automodule:: core.distributed.finalize_model_grads
   :members:
   :undoc-members:
   :show-inheritance:


Module contents
---------------

Contains functionality to synchronize gradients across different ranks before
optimizer step.

.. automodule:: core.distributed
   :members:
   :undoc-members:
   :show-inheritance: