[doc] document MoE model approach and current solutions (#14725)

* document MoE model approach * additional info from Samyam * fix

[doc] document MoE model approach and current solutions (#14725)
* document MoE model approach * additional info from Samyam * fix
027074f4 · Stas Bekman · GitHub · 7cb1fdd4 · 027074f4 · 027074f4
Unverified Commit 027074f4 authored Dec 10, 2021 by Stas Bekman Committed by GitHub Dec 10, 2021
Show whitespace changes
Inline Side-by-side

Showing with 33 additions and 1 deletion

docs/source/imgs/perf-moe-transformer.png docs/source/imgs/perf-moe-transformer.png +0 -0

docs/source/performance.md docs/source/performance.md +33 -1

No files found.
--- a/docs/source/imgs/perf-moe-transformer.png
+++ b/docs/source/imgs/perf-moe-transformer.png
--- a/docs/source/performance.md
+++ b/docs/source/performance.md
@@ -55,7 +55,7 @@ Software:
 - fp16/bf16 (smaller data/faster throughput)
 - tf32 (faster throughput)
 - Gradient checkpointing
-
+- Sparsity


 ## Hardware
@@ -490,6 +490,38 @@ One of the important requirements to reach great training speed is the ability t
 pytorch-nightly introduced `torch.optim._multi_tensor` which should significantly speed up the optimizers for situations with lots of small feature tensors. It should eventually become the default, but if you want to experiment with it sooner and don't mind using the bleed-edge, see: https://github.com/huggingface/transformers/issues/9965


+### Sparsity
+
+#### Mixture of Experts
+
+Quite a few of the recent papers reported a 4-5x training speedup and a faster inference by integrating
+Mixture of Experts (MoE) into the Transformer models.
+
+Since it has been discovered that more parameters lead to better performance, this technique allows to increase the number of parameters by an order of magnitude without increasing training costs.
+
+In this approach every other FFN layer is replaced with a MoE Layer which consists of many experts, with a gated function that trains each expert in a balanced way depending on the input token's position in a sequence.
+
+![MoE Transformer 2x block](/imgs/perf-moe-transformer.png)
+
+(source: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html))
+
+You can find exhaustive details and comparison tables in the papers listed at the end of this section.
+
+The main drawback of this approach is that it requires staggering amounts of GPU memory - almost an order of magnitude larger than its dense equivalent. Various distillation and approaches are proposed to how to overcome the much higher memory requirements.
+
+There is direct trade-off though, you can use just a few experts with a 2-3x smaller base model instead of dozens or hundreds experts leading to a 5x smaller model and thus increase the training speed moderately while increasing the memory requirements moderately as well.
+
+Most related papers and implementations are built around Tensorflow/TPUs:
+
+- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668)
+- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
+- [GLaM: Generalist Language Model (GLaM)](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)
+
+And for Pytorch DeepSpeed has built one as well: [Mixture of Experts](https://www.deepspeed.ai/tutorials/mixture-of-experts/) - blog posts:  [1](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/), [2](https://www.microsoft.com/en-us/research/publication/scalable-and-efficient-moe-training-for-multitask-multilingual-models/) and specific deployment with large transformer-based natural language generation models: [blog post](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html), [Megatron-Deepspeed branch](Thttps://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training).
+
+
+
+
 ## Contribute

 This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there.