Commit 8bb58982 authored by Rick Ho's avatar Rick Ho
Browse files

spell check

parent e70726a9
...@@ -15,7 +15,7 @@ In a group of data-parallel processes, models, including the experts, are replic ...@@ -15,7 +15,7 @@ In a group of data-parallel processes, models, including the experts, are replic
To have experts replicated, first, assign `expert_dp_comm="dp"` at `mark_parallel_comm` function of an `FMoE` instance. To have experts replicated, first, assign `expert_dp_comm="dp"` at `mark_parallel_comm` function of an `FMoE` instance.
(The string `"dp"` can be replaced by another name if you wish). (The string `"dp"` can be replaced by another name if you wish).
Then, wrap the MoE module with `fmoe.distributed.DistributedGroupedDataParallel`, Then, wrap the MoE module with `fmoe.distributed.DistributedGroupedDataParallel`,
nd set `dp_group` in the constructor to the process group in PyTorch that you wish to perform data parallelism. and set `dp_group` in the constructor to the process group in PyTorch that you wish to perform data parallelism.
By default, the parameters are initially synchronized, unless disabled by `need_sync=False`. By default, the parameters are initially synchronized, unless disabled by `need_sync=False`.
Run `model.allreduce_params` every iteration after backward propagation. Run `model.allreduce_params` every iteration after backward propagation.
...@@ -25,10 +25,10 @@ Run `model.allreduce_params` every iteration after backward propagation. ...@@ -25,10 +25,10 @@ Run `model.allreduce_params` every iteration after backward propagation.
In typical model parallelism (maybe called tensor-model parallelism), every single expert is split up. In typical model parallelism (maybe called tensor-model parallelism), every single expert is split up.
FastMoE requires the external codebase to implement it by properly splitting the expert module that is provided to FastMoE. FastMoE requires the external codebase to implement it by properly splitting the expert module that is provided to FastMoE.
An official example using Megatron-LM can be seen in our adaptor. An official example using Megatron-LM can be seen in our adapter.
The `hidden_hidden_size` of FastMoE's transformer module is divided by `k` which denotes the number of model-parallel processes. The `hidden_hidden_size` of FastMoE's transformer module is divided by `k` which denotes the number of model-parallel processes.
In this way, each expert is split into `k` pieces. In this way, each expert is split into `k` pieces.
Then, an `all-reduce` is performed over the feature matrix externally in the adaptor, so that output of the experts is merged. Then, an `all-reduce` is performed over the feature matrix externally in the adapter, so that output of the experts is merged.
#### Expert Parallel (MoE Group and Slice Group) #### Expert Parallel (MoE Group and Slice Group)
...@@ -48,7 +48,7 @@ and perform `all-gather` after the expert-parallel NN operations to produce repl ...@@ -48,7 +48,7 @@ and perform `all-gather` after the expert-parallel NN operations to produce repl
An MoE layer is a part of any stage. An MoE layer is a part of any stage.
The external codebase shall handle the communication across stages. The external codebase shall handle the communication across stages.
Notice that the `gate` module is replicated across all the process of the above three ways of intra-layer parallelism. Notice that the `gate` module is replicated across all the process of the above three ways of intra-layer parallelism.
So, for the inter-layer paralleism, users should specify `gate_group` in `DistributedGroupedDataParallel` as all processes in the same stage. So, for the inter-layer parallelism, users should specify `gate_group` in `DistributedGroupedDataParallel` as all processes in the same stage.
#### Hybrid Parallel #### Hybrid Parallel
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment