Commit 59b27103 authored by Rick Ho's avatar Rick Ho
Browse files

update instructions for megatron

parent d6e7a429
...@@ -11,10 +11,10 @@ model for PyTorch. ...@@ -11,10 +11,10 @@ model for PyTorch.
### Prerequisites ### Prerequisites
PyTorch with CUDA is required. The repository is currently tested with PyTorch PyTorch with CUDA is required. The repository is currently tested with PyTorch
v1.6.0 and CUDA 10, with designed compatibility to other versions. v1.8.0 and CUDA 10, with designed compatibility to older versions.
If distributed version is enabled, NCCL with P2P communication support, If the distributed expert feature is enabled, NCCL with P2P communication
typically versions >= 2.7.5 is needed. support, typically versions `>=2.7.5`, is needed.
### Installing ### Installing
...@@ -22,6 +22,9 @@ Fast MoE contains a set of PyTorch customized opearators, including both C and ...@@ -22,6 +22,9 @@ Fast MoE contains a set of PyTorch customized opearators, including both C and
Python components. Use `python setup.py install` to easily install and enjoy Python components. Use `python setup.py install` to easily install and enjoy
using Fast MoE for training. using Fast MoE for training.
The distributed expert feature is enabled by default. If you want to disable
it, pass environment variable `USE_NCCL=0` to the setup script.
## Usage ## Usage
### FMoEfy a transformer model ### FMoEfy a transformer model
...@@ -30,27 +33,33 @@ Transformer is currently the most popular model to be extended by MoE. Using ...@@ -30,27 +33,33 @@ Transformer is currently the most popular model to be extended by MoE. Using
Fast MoE, a transformer-based model can be extended as MoE by an one-key plugin Fast MoE, a transformer-based model can be extended as MoE by an one-key plugin
shown as follow. shown as follow.
Assume that there is a PyTorch model `model` with MLP layers located at For example, when using [Megatron-LM](https://github.com/nvidia/megatron-lm),
`model.language_model.transformer.layers[<idx>].mlp`, use the following two using the following lines can help you easily scale up the MLP layers to
lines to easily scale up the MLP layers to multiple experts. multiple experts.
```python ```python
model = ...
from fmoe.megatron import fmoefy from fmoe.megatron import fmoefy
model = fmoefy(model, num_experts=<number of experts per worker>) model = fmoefy(model, num_experts=<number of experts per worker>)
train(model, ...)
``` ```
A detailed tutorial to _moefy_ Megatron-LM can be found
[here](examples/megatron).
### Using Fast MoE as a PyTorch module ### Using Fast MoE as a PyTorch module
Examples can be seen in [examples](examples/). The easist way is to replace the An example MoE transformer model can be seen in the
feed forward layer by the `FMoE` layer. [Transformer-XL](examples/transformer-xl) example. The easist way is to replace
the MLP layer by the `FMoE` layers.
### Using Fast MoE in Parallel ### Using Fast MoE in Parallel
For data parallel, nothing else is needed. For data parallel, no extra coding is needed.
For expert parallel, in which experts are located separately across workers, For expert parallel, in which experts are located separately across workers,
NCCL backend is required to be built with PyTorch. Use environment variable which requires sophiscated data-parallel strategies that neither PyTorch nor
`USE_NCCL=1` to `setup.py` to enable distributing experts across workers. Note Megatron-LM provides. The `fmoe.DistributedGroupedDataParallel` module is
that the arguments of the MoE layers should then be excluded from the data introduced to replace PyTorch's DDP module.
parallel parameter synchronization list.
E
A modified version of Megatron-LM that can cope with FastMoE can be found in Fast MoE currently works with the `v2.0` release of
[this repository](https://github.com/laekov/fmoe-megatron). [Megatron-LM](https://github.com/nvidia/megatron-lm).
Using `fmoe.megatron.create_moe_mlp` to replace the `ParallelMLP` module in A [patch](moefy.patch) is used to easily enable MoE in Megatron-LM for training
Megatron's transformer model is all you need. Bert.
In our fork, the required modifications are located at line 425 of The patch works in the following way.
`megatron/model/transformer.py` as follow.
```Python ### Building the model
# MLP
if args.num_experts == 1:
self.mlp = ParallelMLP(init_method,
output_layer_init_method)
else:
from fmoe.megatron import create_moe_mlp
self.mlp = create_moe_mlp(args)
In `pretrain_bert.py`, the `fmoe.megatron.fmoefy` function is used as an
entrance to one-key introduce Fast MoE layer to replace the MLP layers in the
transformer language models.
```python
from fmoe.megatron import fmoefy
model = fmoefy(model, num_experts=4)
``` ```
When properly added `--num-experts` argument to `megatron/arguments.py`, FastMoE Note that the `fmoefy` function currently only takes a standard Megatron-LM's
is enabled without extra burden. top-level raw model as input, i.e. the MLP layers should be available at
`model.language_model.transformer.layers[i].mlp`.
### Using expert parallellization
In `megatron/training.py`, the `LocalDDP` module is replaced by the one in
`fmoe.megatron` to enable the sophiscated data parallel strategies that can
parallelize the experts across both the data parallel group and the (tensor)
model parallel model group.
```python
# from megatron.model import DistributedDataParallel as LocalDDP
from fmoe.megatron import DistributedDataParallel as LocalDDP
```
### Train as usual
Start traning with Fast MoE by using the scripts provided by Megatron-LM.
...@@ -3,3 +3,4 @@ The fmoe package contains MoE Layers only. ...@@ -3,3 +3,4 @@ The fmoe package contains MoE Layers only.
""" """
from .layers import FMoELinear, FMoENaiveGate, FMoETransformerMLP from .layers import FMoELinear, FMoENaiveGate, FMoETransformerMLP
from .distributed import DistributedGroupedDataParallel
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment