Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
FastMoE
Commits
59b27103
Commit
59b27103
authored
Feb 05, 2021
by
Rick Ho
Browse files
update instructions for megatron
parent
d6e7a429
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
57 additions
and
31 deletions
+57
-31
README.md
README.md
+24
-15
examples/megatron/README.md
examples/megatron/README.md
+32
-16
fmoe/__init__.py
fmoe/__init__.py
+1
-0
No files found.
README.md
View file @
59b27103
...
@@ -11,10 +11,10 @@ model for PyTorch.
...
@@ -11,10 +11,10 @@ model for PyTorch.
### Prerequisites
### Prerequisites
PyTorch with CUDA is required. The repository is currently tested with PyTorch
PyTorch with CUDA is required. The repository is currently tested with PyTorch
v1.
6
.0 and CUDA 10, with designed compatibility to o
th
er versions.
v1.
8
.0 and CUDA 10, with designed compatibility to o
ld
er versions.
If distributed
version
is enabled, NCCL with P2P communication
support,
If
the
distributed
expert feature
is enabled, NCCL with P2P communication
typically versions >=
2.7.5 is needed.
support,
typically versions
`
>=2.7.5
`
,
is needed.
### Installing
### Installing
...
@@ -22,6 +22,9 @@ Fast MoE contains a set of PyTorch customized opearators, including both C and
...
@@ -22,6 +22,9 @@ Fast MoE contains a set of PyTorch customized opearators, including both C and
Python components. Use
`python setup.py install`
to easily install and enjoy
Python components. Use
`python setup.py install`
to easily install and enjoy
using Fast MoE for training.
using Fast MoE for training.
The distributed expert feature is enabled by default. If you want to disable
it, pass environment variable
`USE_NCCL=0`
to the setup script.
## Usage
## Usage
### FMoEfy a transformer model
### FMoEfy a transformer model
...
@@ -30,27 +33,33 @@ Transformer is currently the most popular model to be extended by MoE. Using
...
@@ -30,27 +33,33 @@ Transformer is currently the most popular model to be extended by MoE. Using
Fast MoE, a transformer-based model can be extended as MoE by an one-key plugin
Fast MoE, a transformer-based model can be extended as MoE by an one-key plugin
shown as follow.
shown as follow.
Assume that there is a PyTorch model
`model`
with MLP layers located at
For example, when using
[
Megatron-LM
](
https://github.com/nvidia/megatron-lm
)
,
`model.language_model.transformer.layers[<idx>].mlp`
, use the following
t
w
o
using the following lines can help you easily scale up the MLP layers
to
lines to easily scale up the MLP layers to
multiple experts.
multiple experts.
```
python
```
python
model
=
...
from
fmoe.megatron
import
fmoefy
from
fmoe.megatron
import
fmoefy
model
=
fmoefy
(
model
,
num_experts
=<
number
of
experts
per
worker
>
)
model
=
fmoefy
(
model
,
num_experts
=<
number
of
experts
per
worker
>
)
train
(
model
,
...)
```
```
A detailed tutorial to _moefy_ Megatron-LM can be found
[
here
](
examples/megatron
)
.
### Using Fast MoE as a PyTorch module
### Using Fast MoE as a PyTorch module
Examples can be seen in
[
examples
](
examples/
)
. The easist way is to replace the
An example MoE transformer model can be seen in the
feed forward layer by the
`FMoE`
layer.
[
Transformer-XL
](
examples/transformer-xl
)
example. The easist way is to replace
the MLP layer by the
`FMoE`
layers.
### Using Fast MoE in Parallel
### Using Fast MoE in Parallel
For data parallel, no
thing else
is needed.
For data parallel, no
extra coding
is needed.
For expert parallel, in which experts are located separately across workers,
For expert parallel, in which experts are located separately across workers,
NCCL backend is required to be built with PyTorch. Use environment variable
which requires sophiscated data-parallel strategies that neither PyTorch nor
`USE_NCCL=1`
to
`setup.py`
to enable distributing experts across workers. Note
Megatron-LM provides. The
`fmoe.DistributedGroupedDataParallel`
module is
that the arguments of the MoE layers should then be excluded from the data
introduced to replace PyTorch's DDP module.
parallel parameter synchronization list.
E
examples/megatron/README.md
View file @
59b27103
A modified version of Megatron-LM that can cope with FastMoE can be found in
Fast MoE currently works with the
`v2.0`
release of
[
this repository
](
https://github.com/
laekov/fmoe-
megatron
)
.
[
Megatron-LM
](
https://github.com/
nvidia/
megatron
-lm
)
.
Using
`fmoe.megatron.create_moe_mlp`
to replace the
`ParallelMLP`
module in
A
[
patch
](
moefy.patch
)
is used to easily enable MoE in Megatron-LM for training
Megatron's transformer model is all you need.
Bert.
In our fork, the required modifications are located at line 425 of
The patch works in the following way.
`megatron/model/transformer.py`
as follow.
```
Python
### Building the model
# MLP
if args.num_experts == 1:
self.mlp = ParallelMLP(init_method,
output_layer_init_method)
else:
from fmoe.megatron import create_moe_mlp
self.mlp = create_moe_mlp(args)
In
`pretrain_bert.py`
, the
`fmoe.megatron.fmoefy`
function is used as an
entrance to one-key introduce Fast MoE layer to replace the MLP layers in the
transformer language models.
```
python
from
fmoe.megatron
import
fmoefy
model
=
fmoefy
(
model
,
num_experts
=
4
)
```
```
When properly added
`--num-experts`
argument to
`megatron/arguments.py`
, FastMoE
Note that the
`fmoefy`
function currently only takes a standard Megatron-LM's
is enabled without extra burden.
top-level raw model as input, i.e. the MLP layers should be available at
`model.language_model.transformer.layers[i].mlp`
.
### Using expert parallellization
In
`megatron/training.py`
, the
`LocalDDP`
module is replaced by the one in
`fmoe.megatron`
to enable the sophiscated data parallel strategies that can
parallelize the experts across both the data parallel group and the (tensor)
model parallel model group.
```
python
# from megatron.model import DistributedDataParallel as LocalDDP
from
fmoe.megatron
import
DistributedDataParallel
as
LocalDDP
```
### Train as usual
Start traning with Fast MoE by using the scripts provided by Megatron-LM.
fmoe/__init__.py
View file @
59b27103
...
@@ -3,3 +3,4 @@ The fmoe package contains MoE Layers only.
...
@@ -3,3 +3,4 @@ The fmoe package contains MoE Layers only.
"""
"""
from
.layers
import
FMoELinear
,
FMoENaiveGate
,
FMoETransformerMLP
from
.layers
import
FMoELinear
,
FMoENaiveGate
,
FMoETransformerMLP
from
.distributed
import
DistributedGroupedDataParallel
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment