update megatron example

53b5b8c3 · Rick Ho · 2d067240 · 53b5b8c3 · 53b5b8c3 · 53b5b8c3
Commit 53b5b8c3 authored Mar 01, 2021 by Rick Ho
3 changed files
--- a/examples/megatron/README.md
+++ b/examples/megatron/README.md
-Fast MoE currently works with the `v2.0` release of
+FastMoE currently works with both `v2.0` and `v2.1` release of
 [Megatron-LM](https://github.com/nvidia/megatron-lm).
-A [patch](moefy.patch) is used to easily enable MoE in Megatron-LM for training
+Patches which you can find in this directory are used to easily enable MoE in
-Bert.
+different versions of Megatron-LM for training Bert. The usage is the same in
+other training scripts.
 The patch works in the following way.
-### Building the model
+### Building the model in FastMoE style
 In `pretrain_bert.py`, the `fmoe.megatron.fmoefy` function is used as an
-entrance to one-key introduce Fast MoE layer to replace the MLP layers in the
+entrance to one-key introduce FastMoE layer to replace the MLP layers in the
 transformer language models.
 ```python
@@ -21,7 +22,7 @@ Note that the `fmoefy` function currently only takes a standard Megatron-LM's
 top-level raw model as input, i.e. the MLP layers should be available at
 `model.language_model.transformer.layers[i].mlp`.
-### Using expert parallellization
+### Using FastMoE's model parallellization
 In `megatron/training.py`, the `LocalDDP` module is replaced by the one in 
 `fmoe.megatron` to enable the sophiscated data parallel strategies that can
@@ -35,4 +36,4 @@ from fmoe.megatron import DistributedDataParallel as LocalDDP
 ### Train as usual
-Start traning with Fast MoE by using the scripts provided by Megatron-LM.
+Start traning with FastMoE by using the scripts provided by Megatron-LM.
--- a/examples/megatron/moefy.patch
+++ b/examples/megatron/moefy.patch
--- a/examples/megatron/fmoefy-v2.1.patch
+++ b/examples/megatron/fmoefy-v2.1.patch
+diff --git a/megatron/training.py b/megatron/training.py
+index 56d1c7c..9c624d2 100644
+--- a/megatron/training.py
+++ b/megatron/training.py
+@@ -43,7 +43,8 @@ from megatron.optimizer import get_megatron_optimizer
+ from megatron.initialize import initialize_megatron
+ from megatron.initialize import write_args_to_tensorboard
+ from megatron.learning_rates import AnnealingLR
+-from megatron.model import DistributedDataParallel as LocalDDP
+# from megatron.model import DistributedDataParallel as LocalDDP
+from fmoe.megatron import DistributedDataParallel as LocalDDP
+ from megatron.model.realm_model import ICTBertModel
+ from megatron.utils import check_adlr_autoresume_termination
+ from megatron.data.data_loaders import build_pretraining_data_loader
+diff --git a/pretrain_bert.py b/pretrain_bert.py
+index 48bc6ad..48628ce 100644
+--- a/pretrain_bert.py
+++ b/pretrain_bert.py
+@@ -52,6 +52,8 @@ def model_provider():
+             num_tokentypes=2,
+             add_binary_head=True,
+             parallel_output=True)
+    from fmoe.megatron import fmoefy
+    model = fmoefy(model, num_experts=4)
+     return model