For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's `DataParallel` or `DistributedDataParallel`.
For data parallel, no extra coding is needed. FastMoE works seamlessly with PyTorch's `DataParallel` or `DistributedDataParallel`.
The only drawback of data parallel is that the number of experts is constrained by each worker's memory.
The only drawback of data parallel is that the number of experts is constrained by each worker's memory.
...
@@ -85,7 +87,9 @@ Thus, by introducing additional communication cost, FastMoE enjoys a large exper
...
@@ -85,7 +87,9 @@ Thus, by introducing additional communication cost, FastMoE enjoys a large exper
The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.
The following figure shows the forward pass of a 6-expert MoE with 2-way model parallel. Note that experts 1-3 are located in worker 1 while experts 4-6 are located in worker 2.
