"vscode:/vscode.git/clone" did not exist on "93074dd3583a765608ec5bd30978898cef872551"
Commit 53b5b8c3 authored by Rick Ho's avatar Rick Ho
Browse files

update megatron example

parent 2d067240
Fast MoE currently works with the `v2.0` release of
FastMoE currently works with both `v2.0` and `v2.1` release of
[Megatron-LM](https://github.com/nvidia/megatron-lm).
A [patch](moefy.patch) is used to easily enable MoE in Megatron-LM for training
Bert.
Patches which you can find in this directory are used to easily enable MoE in
different versions of Megatron-LM for training Bert. The usage is the same in
other training scripts.
The patch works in the following way.
### Building the model
### Building the model in FastMoE style
In `pretrain_bert.py`, the `fmoe.megatron.fmoefy` function is used as an
entrance to one-key introduce Fast MoE layer to replace the MLP layers in the
entrance to one-key introduce FastMoE layer to replace the MLP layers in the
transformer language models.
```python
......@@ -21,7 +22,7 @@ Note that the `fmoefy` function currently only takes a standard Megatron-LM's
top-level raw model as input, i.e. the MLP layers should be available at
`model.language_model.transformer.layers[i].mlp`.
### Using expert parallellization
### Using FastMoE's model parallellization
In `megatron/training.py`, the `LocalDDP` module is replaced by the one in
`fmoe.megatron` to enable the sophiscated data parallel strategies that can
......@@ -35,4 +36,4 @@ from fmoe.megatron import DistributedDataParallel as LocalDDP
### Train as usual
Start traning with Fast MoE by using the scripts provided by Megatron-LM.
Start traning with FastMoE by using the scripts provided by Megatron-LM.
diff --git a/megatron/training.py b/megatron/training.py
index 56d1c7c..9c624d2 100644
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -43,7 +43,8 @@ from megatron.optimizer import get_megatron_optimizer
from megatron.initialize import initialize_megatron
from megatron.initialize import write_args_to_tensorboard
from megatron.learning_rates import AnnealingLR
-from megatron.model import DistributedDataParallel as LocalDDP
+# from megatron.model import DistributedDataParallel as LocalDDP
+from fmoe.megatron import DistributedDataParallel as LocalDDP
from megatron.model.realm_model import ICTBertModel
from megatron.utils import check_adlr_autoresume_termination
from megatron.data.data_loaders import build_pretraining_data_loader
diff --git a/pretrain_bert.py b/pretrain_bert.py
index 48bc6ad..48628ce 100644
--- a/pretrain_bert.py
+++ b/pretrain_bert.py
@@ -52,6 +52,8 @@ def model_provider():
num_tokentypes=2,
add_binary_head=True,
parallel_output=True)
+ from fmoe.megatron import fmoefy
+ model = fmoefy(model, num_experts=4)
return model
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment