Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
FastMoE
Commits
8bb58982
Commit
8bb58982
authored
Jul 24, 2023
by
Rick Ho
Browse files
spell check
parent
e70726a9
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
4 additions
and
4 deletions
+4
-4
doc/parallelism/README.md
doc/parallelism/README.md
+4
-4
No files found.
doc/parallelism/README.md
View file @
8bb58982
...
...
@@ -15,7 +15,7 @@ In a group of data-parallel processes, models, including the experts, are replic
To have experts replicated, first, assign
`expert_dp_comm="dp"`
at
`mark_parallel_comm`
function of an
`FMoE`
instance.
(The string
`"dp"`
can be replaced by another name if you wish).
Then, wrap the MoE module with
`fmoe.distributed.DistributedGroupedDataParallel`
,
nd set
`dp_group`
in the constructor to the process group in PyTorch that you wish to perform data parallelism.
a
nd set
`dp_group`
in the constructor to the process group in PyTorch that you wish to perform data parallelism.
By default, the parameters are initially synchronized, unless disabled by
`need_sync=False`
.
Run
`model.allreduce_params`
every iteration after backward propagation.
...
...
@@ -25,10 +25,10 @@ Run `model.allreduce_params` every iteration after backward propagation.
In typical model parallelism (maybe called tensor-model parallelism), every single expert is split up.
FastMoE requires the external codebase to implement it by properly splitting the expert module that is provided to FastMoE.
An official example using Megatron-LM can be seen in our adapt
o
r.
An official example using Megatron-LM can be seen in our adapt
e
r.
The
`hidden_hidden_size`
of FastMoE's transformer module is divided by
`k`
which denotes the number of model-parallel processes.
In this way, each expert is split into
`k`
pieces.
Then, an
`all-reduce`
is performed over the feature matrix externally in the adapt
o
r, so that output of the experts is merged.
Then, an
`all-reduce`
is performed over the feature matrix externally in the adapt
e
r, so that output of the experts is merged.
#### Expert Parallel (MoE Group and Slice Group)
...
...
@@ -48,7 +48,7 @@ and perform `all-gather` after the expert-parallel NN operations to produce repl
An MoE layer is a part of any stage.
The external codebase shall handle the communication across stages.
Notice that the
`gate`
module is replicated across all the process of the above three ways of intra-layer parallelism.
So, for the inter-layer paralleism, users should specify
`gate_group`
in
`DistributedGroupedDataParallel`
as all processes in the same stage.
So, for the inter-layer paralle
l
ism, users should specify
`gate_group`
in
`DistributedGroupedDataParallel`
as all processes in the same stage.
#### Hybrid Parallel
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment