spell check

8bb58982 · Rick Ho · e70726a9 · 8bb58982
Commit 8bb58982 authored Jul 24, 2023 by Rick Ho
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 4 deletions

doc/parallelism/README.md doc/parallelism/README.md +4 -4

No files found.
--- a/doc/parallelism/README.md
+++ b/doc/parallelism/README.md
@@ -15,7 +15,7 @@ In a group of data-parallel processes, models, including the experts, are replic
 To have experts replicated, first, assign `expert_dp_comm="dp"` at `mark_parallel_comm` function of an `FMoE` instance.
 (The string `"dp"` can be replaced by another name if you wish).
 Then, wrap the MoE module with `fmoe.distributed.DistributedGroupedDataParallel`, 
-nd set `dp_group` in the constructor to the process group in PyTorch that you wish to perform data parallelism.
+and set `dp_group` in the constructor to the process group in PyTorch that you wish to perform data parallelism.
 By default, the parameters are initially synchronized, unless disabled by `need_sync=False`.
 Run `model.allreduce_params` every iteration after backward propagation.

@@ -25,10 +25,10 @@ Run `model.allreduce_params` every iteration after backward propagation.

 In typical model parallelism (maybe called tensor-model parallelism), every single expert is split up.
 FastMoE requires the external codebase to implement it by properly splitting the expert module that is provided to FastMoE.
-An official example using Megatron-LM can be seen in our adaptor.
+An official example using Megatron-LM can be seen in our adapter.
 The `hidden_hidden_size` of FastMoE's transformer module is divided by `k` which denotes the number of model-parallel processes.
 In this way, each expert is split into `k` pieces.
-Then, an `all-reduce` is performed over the feature matrix externally in the adaptor, so that output of the experts is merged.
+Then, an `all-reduce` is performed over the feature matrix externally in the adapter, so that output of the experts is merged.

 #### Expert Parallel (MoE Group and Slice Group)

@@ -48,7 +48,7 @@ and perform `all-gather` after the expert-parallel NN operations to produce repl
 An MoE layer is a part of any stage.
 The external codebase shall handle the communication across stages.
 Notice that the `gate` module is replicated across all the process of the above three ways of intra-layer parallelism.
-So, for the inter-layer paralleism, users should specify `gate_group` in `DistributedGroupedDataParallel` as all processes in the same stage.
+So, for the inter-layer parallelism, users should specify `gate_group` in `DistributedGroupedDataParallel` as all processes in the same stage.

 #### Hybrid Parallel