README.md 1.35 KB
Newer Older
Rick Ho's avatar
Rick Ho committed
1
FastMoE currently works with both `v2.0` and `v2.1` release of
Rick Ho's avatar
Rick Ho committed
2
[Megatron-LM](https://github.com/nvidia/megatron-lm).
Rick Ho's avatar
Rick Ho committed
3

Rick Ho's avatar
Rick Ho committed
4
5
6
Patches which you can find in this directory are used to easily enable MoE in
different versions of Megatron-LM for training Bert. The usage is the same in
other training scripts.
Rick Ho's avatar
Rick Ho committed
7

Rick Ho's avatar
Rick Ho committed
8
The patch works in the following way.
Rick Ho's avatar
Rick Ho committed
9

Rick Ho's avatar
Rick Ho committed
10
### Building the model in FastMoE style
Rick Ho's avatar
Rick Ho committed
11

Rick Ho's avatar
Rick Ho committed
12
In `pretrain_bert.py`, the `fmoe.megatron.fmoefy` function is used as an
Rick Ho's avatar
Rick Ho committed
13
entrance to one-key introduce FastMoE layer to replace the MLP layers in the
Rick Ho's avatar
Rick Ho committed
14
15
16
17
18
transformer language models.

```python
from fmoe.megatron import fmoefy
model = fmoefy(model, num_experts=4)
Rick Ho's avatar
Rick Ho committed
19
20
```

Rick Ho's avatar
Rick Ho committed
21
22
23
24
Note that the `fmoefy` function currently only takes a standard Megatron-LM's
top-level raw model as input, i.e. the MLP layers should be available at
`model.language_model.transformer.layers[i].mlp`.

Rick Ho's avatar
Rick Ho committed
25
### Using FastMoE's model parallellization
Rick Ho's avatar
Rick Ho committed
26
27
28
29
30
31
32
33
34
35
36
37
38

In `megatron/training.py`, the `LocalDDP` module is replaced by the one in 
`fmoe.megatron` to enable the sophiscated data parallel strategies that can
parallelize the experts across both the data parallel group and the (tensor) 
model parallel model group.

```python
# from megatron.model import DistributedDataParallel as LocalDDP
from fmoe.megatron import DistributedDataParallel as LocalDDP
```

### Train as usual

Rick Ho's avatar
Rick Ho committed
39
Start traning with FastMoE by using the scripts provided by Megatron-LM.