update readme and inline docstring

02341276 · Rick Ho · a6d202a6 · 02341276 · 02341276 · 02341276
Commit 02341276 authored Jul 24, 2023 by Rick Ho
7 changed files
--- a/README.md
+++ b/README.md
@@ -75,7 +75,9 @@ the MLP layer by the `FMoE` layers.

 ### Using FastMoE in Parallel

-FastMoE supports both data parallel and model parallel. 
+FastMoE supports multiple ways of parallel training. See [a comprehensive
+document for parallelism](doc/parallelism) for details. Below shows the two
+simplest ways of using FastMoE in parallel.

 #### Data Parallel

@@ -83,27 +85,28 @@ In FastMoE's data parallel mode, both the gate and the experts are replicated on
 The following figure shows the forward pass of a 3-expert MoE with 2-way data parallel.

 <p align="center">
-<img src="doc/fastmoe_data_parallel.png" width="600">
+<img src="doc/parallelism/fastmoe_data_parallel.png" width="600">
 </p>

 For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's `DataParallel` or `DistributedDataParallel`.
 The only drawback of data parallel is that the number of experts is constrained by each worker's memory.

-#### Model Parallel
+#### Expert Parallel (also called Model Parlallel in some previous versions)

-In FastMoE's model parallel mode, the gate network is still replicated on each worker but
+In FastMoE's expert parallel mode, the gate network is still replicated on each worker but
 experts are placed separately across workers.
 Thus, by introducing additional communication cost, FastMoE enjoys a large expert pool whose size is proportional to the number of workers.

 The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.

 <p align="center">
-<img src="doc/fastmoe_model_parallel.png" width="600">
+<img src="doc/parallelism/fastmoe_expert_parallel.png" width="600">
 </p>

-FastMoE's model parallel requires sophiscated parallel strategies that neither PyTorch nor
-Megatron-LM provides. The `fmoe.DistributedGroupedDataParallel` module is
-introduced to replace PyTorch's DDP module.
+FastMoE's expert parallel requires sophiscated parallel strategies that neither
+PyTorch nor Megatron-LM provided when FastMoE was created. The
+`fmoe.DistributedGroupedDataParallel` module is introduced to replace PyTorch's
+DDP module.

 #### Faster Performance Features


--- a/doc/parallelism/README.md
+++ b/doc/parallelism/README.md
+Multi-Dimensional Parallelism Supported by FastMoE
+===
+
+这篇文档懒得写中文版了. 在获得来自社区的贡献前, 请自行谷歌翻译.
--- a/doc/fastmoe_data_parallel.png
+++ b/doc/fastmoe_data_parallel.png
--- a/doc/fastmoe_model_parallel.png
+++ b/doc/fastmoe_model_parallel.png
--- a/doc/parallelism/parallelism.png
+++ b/doc/parallelism/parallelism.png
--- a/doc/readme-cn.md
+++ b/doc/readme-cn.md
@@ -64,7 +64,8 @@ train(model, ...)

 ### 分布式地使用 FastMoE

-FastMoE 支持数据并行和模型并行.
+FastMoE 支持并行方式. 详见[并行方式详细说明](doc/parallelism).
+以下简单介绍两种最容易使用的并行方式.

 #### 数据并行.

@@ -73,29 +74,30 @@ FastMoE 支持数据并行和模型并行.
 下图展示了一个有三个专家的两路数据并行MoE模型进行前向计算的方式.

 <p align="center">
-<img src="fastmoe_data_parallel.png" width="600">
+<img src="parallelism/fastmoe_data_parallel.png" width="600">
 </p>

 对于数据并行, 额外的代码是不需要的. FastMoE 与 PyTorch 的 `DataParallel` 和
 `DistributedDataParallel` 模块都可以无缝对接. 该方式唯一的问题是,
 专家的数量受到单个计算单元(如GPU)的内存大小限制.

-#### 模型并行
+#### 专家并行 (也曾被叫作模型并行)

-在 FastMoE 的模型并行模式中, 门网络依然是复制地被放置在每个计算单元上的,
+在 FastMoE 的专家并行模式中, 门网络依然是复制地被放置在每个计算单元上的,
 但是专家网络被独立地分别放置在各个计算单元上. 因此, 通过引入额外的通信操作,
 FastMoE 可以允许更多的专家网络们同时被训练,
 而其数量限制与计算单元的数量是正相关的.

-下图展示了一个有六个专家网络的模型被两路模型并行地训练.
+下图展示了一个有六个专家网络的模型被两路专家并行地训练.
 注意专家1-3被放置在第一个计算单元上, 而专家4-6被放置在第二个计算单元上.

 <p align="center">
-<img src="fastmoe_model_parallel.png" width="600">
+<img src="parallelism/fastmoe_expert_parallel.png" width="600">
 </p>

-FastMoE 的模型并行模式需要专门的并行策略, 而 PyTorch 和 Megatron-LM
-都不支持这样的策略. 因此, 需要使用 `fmoe.DistributedGroupedDataParallel`
+FastMoE 的专家并行模式需要专门的并行策略, 而 PyTorch 和 Megatron-LM
+都不支持这样的策略 (在我们创建 FastMoE 时). 因此, 需要使用
+`fmoe.DistributedGroupedDataParallel`
 模块来代替 PyTorch 的 DDP 模块.

 ### 如何训练得更快

--- a/fmoe/layers.py
+++ b/fmoe/layers.py
@@ -97,6 +97,10 @@ class FMoE(nn.Module):
    the output. For each worker, FMoE only computes the output of a certain
    slice of the input batch, and will all-gather the outputs after
    computation.
+    * `mp_group` is a deprecated alias of `slice_group`
+    * `moe_group` stands for the group of process that performs expert
+    parallelism. The default value `None` means all processes. See the
+    parallelism document for more details of the groups.
    * `top_k` stands for the number of experts each token is going to.
    * `gate` is a gate class which can found in `fmoe.gates`.
    * `expert` can be specified as a module class, it is used to generate