[docs] Performance docs refactor p.2 (#26791)

* initial edits * improvements for clarity and flow * improvements for clarity and flow, removed the repetead section * removed two docs that had no content * Revert "removed two docs that had no content" This reverts commit e98fa2fa0d8e171163f15cb8a04bdada1053543b. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * feedback addressed * more feedback addressed * feedback addressed --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[docs] Performance docs refactor p.2 (#26791)
* initial edits * improvements for clarity and flow * improvements for clarity and flow, removed the repetead section * removed two docs that had no content * Revert "removed two docs that had no content" This reverts commit e98fa2fa0d8e171163f15cb8a04bdada1053543b. * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * feedback addressed * more feedback addressed * feedback addressed --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
9333bf07 · Maria Khalusova · GitHub · 13ef14e1 · 9333bf07 · 9333bf07
Unverified Commit 9333bf07 authored Oct 24, 2023 by Maria Khalusova Committed by GitHub Oct 24, 2023
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 324 additions and 242 deletions

docs/source/en/glossary.md docs/source/en/glossary.md +33 -0

docs/source/en/perf_train_gpu_many.md docs/source/en/perf_train_gpu_many.md +291 -242

No files found.
--- a/docs/source/en/glossary.md
+++ b/docs/source/en/glossary.md
@@ -112,6 +112,12 @@ A type of layer in a neural network where the input matrix is multiplied element
 ## D
+### DataParallel (DP)
+Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance 
+receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step.
+Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel).
 ### decoder input IDs
 This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
@@ -340,6 +346,12 @@ A pipeline in 🤗 Transformers is an abstraction referring to a series of steps
 For more details, see [Pipelines for inference](https://huggingface.co/docs/transformers/pipeline_tutorial).
+### PipelineParallel (PP)
+Parallelism technique in which the model is split up vertically (layer-level) across multiple GPUs, so that only one or 
+several layers of the model are placed on a single GPU. Each GPU processes in parallel different stages of the pipeline 
+and working on a small chunk of the batch. Learn more about how PipelineParallel works [here](perf_train_gpu_many#from-naive-model-parallelism-to-pipeline-parallelism).
 ### pixel values
 A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from an image processor.
@@ -410,6 +422,10 @@ An example of a semi-supervised learning approach is "self-training", in which a
 Models that generate a new sequence from an input, like translation models, or summarization models (such as
 [Bart](model_doc/bart) or [T5](model_doc/t5)).
+### Sharded DDP
+Another name for the foundational [ZeRO](#zero-redundancy-optimizer--zero-) concept as used by various other implementations of ZeRO.
 ### stride
 In [convolution](#convolution) or [pooling](#pooling), the stride refers to the distance the kernel is moved over a matrix. A stride of 1 means the kernel is moved one pixel over at a time, and a stride of 2 means the kernel is moved two pixels over at a time.
@@ -420,6 +436,14 @@ A form of model training that directly uses labeled data to correct and instruct
 ## T
+### Tensor Parallelism (TP)
+Parallelism technique for training on multiple GPUs in which each tensor is split up into multiple chunks, so instead of 
+having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU. Shards gets 
+processed separately and in parallel on different GPUs and the results are synced at the end of the processing step. 
+This is what is sometimes called horizontal parallelism, as the splitting happens on horizontal level.
+Learn more about Tensor Parallelism [here](perf_train_gpu_many#tensor-parallelism).
 ### token
 A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a
@@ -489,3 +513,12 @@ Self-attention based deep learning model architecture.
 ### unsupervised learning
 A form of model training in which data provided to the model is not labeled. Unsupervised learning techniques leverage statistical information of the data distribution to find patterns useful for the task at hand.
+## Z
+### Zero Redundancy Optimizer (ZeRO)
+Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensorparallel--tp-), 
+except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need 
+to be modified. This method also supports various offloading techniques to compensate for limited GPU memory. 
+Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).
\ No newline at end of file
--- a/docs/source/en/perf_train_gpu_many.md
+++ b/docs/source/en/perf_train_gpu_many.md