update instructions for megatron

59b27103 · Rick Ho · d6e7a429 · 59b27103 · 59b27103 · 59b27103
Commit 59b27103 authored Feb 05, 2021 by Rick Ho
Hide whitespace changes
Inline Side-by-side

Showing with 57 additions and 31 deletions

README.md README.md +24 -15

examples/megatron/README.md examples/megatron/README.md +32 -16

fmoe/__init__.py fmoe/__init__.py +1 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -11,10 +11,10 @@ model for PyTorch.
 ### Prerequisites
 PyTorch with CUDA is required. The repository is currently tested with PyTorch
-v1.6.0 and CUDA 10, with designed compatibility to other versions.
+v1.8.0 and CUDA 10, with designed compatibility to older versions.
-If distributed version is enabled, NCCL with P2P communication support,
+If the distributed expert feature is enabled, NCCL with P2P communication
-typically versions >= 2.7.5 is needed. 
+support, typically versions `>=2.7.5`, is needed. 
 ### Installing
@@ -22,6 +22,9 @@ Fast MoE contains a set of PyTorch customized opearators, including both C and
 Python components. Use `python setup.py install` to easily install and enjoy
 using Fast MoE for training.
+The distributed expert feature is enabled by default. If you want to disable
+it, pass environment variable `USE_NCCL=0` to the setup script.
 ## Usage 
 ### FMoEfy a transformer model
@@ -30,27 +33,33 @@ Transformer is currently the most popular model to be extended by MoE. Using
 Fast MoE, a transformer-based model can be extended as MoE by an one-key plugin
 shown as follow.
-Assume that there is a PyTorch model `model` with MLP layers located at
+For example, when using [Megatron-LM](https://github.com/nvidia/megatron-lm),
-`model.language_model.transformer.layers[<idx>].mlp`, use the following two
+using the following lines can help you easily scale up the MLP layers to
-lines to easily scale up the MLP layers to multiple experts.
+multiple experts.
 ```python
+model = ...
 from fmoe.megatron import fmoefy
 model = fmoefy(model, num_experts=<number of experts per worker>)
+train(model, ...)
 ```
+A detailed tutorial to _moefy_ Megatron-LM can be found
+[here](examples/megatron).
 ### Using Fast MoE as a PyTorch module
-Examples can be seen in [examples](examples/). The easist way is to replace the
+An example MoE transformer model can be seen in the
-feed forward layer by the `FMoE` layer.
+[Transformer-XL](examples/transformer-xl) example. The easist way is to replace
+the MLP layer by the `FMoE` layers.
 ### Using Fast MoE in Parallel
-For data parallel, nothing else is needed.
+For data parallel, no extra coding is needed.
-For expert parallel, in which experts are located separately across workers,
+For expert parallel, in which experts are located separately across workers, 
-NCCL backend is required to be built with PyTorch. Use environment variable
+which requires sophiscated data-parallel strategies that neither PyTorch nor
-`USE_NCCL=1` to `setup.py` to enable distributing experts across workers. Note
+Megatron-LM provides. The `fmoe.DistributedGroupedDataParallel` module is
-that the arguments of the MoE layers should then be excluded from the data
+introduced to replace PyTorch's DDP module.
-parallel parameter synchronization list.
-E
--- a/examples/megatron/README.md
+++ b/examples/megatron/README.md
-A modified version of Megatron-LM that can cope with FastMoE can be found in 
+Fast MoE currently works with the `v2.0` release of
-[this repository](https://github.com/laekov/fmoe-megatron).
+[Megatron-LM](https://github.com/nvidia/megatron-lm).
-Using `fmoe.megatron.create_moe_mlp` to replace the `ParallelMLP` module in 
+A [patch](moefy.patch) is used to easily enable MoE in Megatron-LM for training
-Megatron's transformer model is all you need. 
+Bert.
-In our fork, the required modifications are located at line 425 of
+The patch works in the following way.
-`megatron/model/transformer.py` as follow.
-```Python
+### Building the model
-        # MLP
-        if args.num_experts == 1:
-            self.mlp = ParallelMLP(init_method,
-                    output_layer_init_method)
-        else:
-            from fmoe.megatron import create_moe_mlp
-            self.mlp = create_moe_mlp(args)
+In `pretrain_bert.py`, the `fmoe.megatron.fmoefy` function is used as an
+entrance to one-key introduce Fast MoE layer to replace the MLP layers in the
+transformer language models.
+```python
+from fmoe.megatron import fmoefy
+model = fmoefy(model, num_experts=4)
 ```
-When properly added `--num-experts` argument to `megatron/arguments.py`, FastMoE
+Note that the `fmoefy` function currently only takes a standard Megatron-LM's
-is enabled without extra burden.
+top-level raw model as input, i.e. the MLP layers should be available at
+`model.language_model.transformer.layers[i].mlp`.
+### Using expert parallellization
+In `megatron/training.py`, the `LocalDDP` module is replaced by the one in 
+`fmoe.megatron` to enable the sophiscated data parallel strategies that can
+parallelize the experts across both the data parallel group and the (tensor) 
+model parallel model group.
+```python
+# from megatron.model import DistributedDataParallel as LocalDDP
+from fmoe.megatron import DistributedDataParallel as LocalDDP
+```
+### Train as usual
+Start traning with Fast MoE by using the scripts provided by Megatron-LM.
--- a/fmoe/__init__.py
+++ b/fmoe/__init__.py
@@ -3,3 +3,4 @@ The fmoe package contains MoE Layers only.
 """
 from .layers import FMoELinear, FMoENaiveGate, FMoETransformerMLP
+from .distributed import DistributedGroupedDataParallel