README.md 1.69 KB
Newer Older
Rick Ho's avatar
Rick Ho committed
1
FastMoE works with different versions of
Rick Ho's avatar
Rick Ho committed
2
[Megatron-LM](https://github.com/nvidia/megatron-lm).
Rick Ho's avatar
Rick Ho committed
3
See `fmoe/megatron/utils.py` for arguments of FastMoE.
Rick Ho's avatar
Rick Ho committed
4

Rick Ho's avatar
Rick Ho committed
5
6
7
8
9
An example patch is provided for `v2.2` release.
The patch can be directly applied to add FastMoE support if you are using
Megatron-LM v2.2.
Otherwise, you may need to manually enable FastMoE in your codebase.
The patch includes the following modifications.
Rick Ho's avatar
Rick Ho committed
10

Rick Ho's avatar
Rick Ho committed
11
12
13
14
15
16
17
18
### Add arguments to Megatron's argparser

In `megatron/arguments.py`, add `_add_fmoe_args` to the parser.

### Patch checkpoint

In `megatron/training.py`, replace `load_checkpoint` and `save_checkpoint` by
functions with the same name in `fmoe.megatron.checkpointing`.
Rick Ho's avatar
Rick Ho committed
19

Rick Ho's avatar
Rick Ho committed
20
### Building the model in FastMoE style
Rick Ho's avatar
Rick Ho committed
21

Rick Ho's avatar
Rick Ho committed
22
In `megatron/training.py`, the `fmoe.megatron.fmoefy` function is used as an
Rick Ho's avatar
Rick Ho committed
23
entrance to one-key introduce FastMoE layer to replace the MLP layers in the
Rick Ho's avatar
Rick Ho committed
24
25
26
27
28
transformer language models.

```python
from fmoe.megatron import fmoefy
model = fmoefy(model, num_experts=4)
Rick Ho's avatar
Rick Ho committed
29
30
```

Rick Ho's avatar
Rick Ho committed
31
32
33
34
Note that the `fmoefy` function currently only takes a standard Megatron-LM's
top-level raw model as input, i.e. the MLP layers should be available at
`model.language_model.transformer.layers[i].mlp`.

Rick Ho's avatar
Rick Ho committed
35
### Using FastMoE's model parallellization
Rick Ho's avatar
Rick Ho committed
36
37
38
39
40
41
42
43
44
45
46
47
48

In `megatron/training.py`, the `LocalDDP` module is replaced by the one in 
`fmoe.megatron` to enable the sophiscated data parallel strategies that can
parallelize the experts across both the data parallel group and the (tensor) 
model parallel model group.

```python
# from megatron.model import DistributedDataParallel as LocalDDP
from fmoe.megatron import DistributedDataParallel as LocalDDP
```

### Train as usual

Rick Ho's avatar
Rick Ho committed
49
Start traning with FastMoE by using the scripts provided by Megatron-LM.