Unverified Commit 9333bf07 authored by Maria Khalusova's avatar Maria Khalusova Committed by GitHub
Browse files

[docs] Performance docs refactor p.2 (#26791)



* initial edits

* improvements for clarity and flow

* improvements for clarity and flow, removed the repetead section

* removed two docs that had no content

* Revert "removed two docs that had no content"

This reverts commit e98fa2fa0d8e171163f15cb8a04bdada1053543b.

* Apply suggestions from code review
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>

* feedback addressed

* more feedback addressed

* feedback addressed

---------
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent 13ef14e1
...@@ -112,6 +112,12 @@ A type of layer in a neural network where the input matrix is multiplied element ...@@ -112,6 +112,12 @@ A type of layer in a neural network where the input matrix is multiplied element
## D ## D
### DataParallel (DP)
Parallelism technique for training on multiple GPUs where the same setup is replicated multiple times, with each instance
receiving a distinct data slice. The processing is done in parallel and all setups are synchronized at the end of each training step.
Learn more about how DataParallel works [here](perf_train_gpu_many#dataparallel-vs-distributeddataparallel).
### decoder input IDs ### decoder input IDs
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
...@@ -340,6 +346,12 @@ A pipeline in 🤗 Transformers is an abstraction referring to a series of steps ...@@ -340,6 +346,12 @@ A pipeline in 🤗 Transformers is an abstraction referring to a series of steps
For more details, see [Pipelines for inference](https://huggingface.co/docs/transformers/pipeline_tutorial). For more details, see [Pipelines for inference](https://huggingface.co/docs/transformers/pipeline_tutorial).
### PipelineParallel (PP)
Parallelism technique in which the model is split up vertically (layer-level) across multiple GPUs, so that only one or
several layers of the model are placed on a single GPU. Each GPU processes in parallel different stages of the pipeline
and working on a small chunk of the batch. Learn more about how PipelineParallel works [here](perf_train_gpu_many#from-naive-model-parallelism-to-pipeline-parallelism).
### pixel values ### pixel values
A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from an image processor. A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from an image processor.
...@@ -410,6 +422,10 @@ An example of a semi-supervised learning approach is "self-training", in which a ...@@ -410,6 +422,10 @@ An example of a semi-supervised learning approach is "self-training", in which a
Models that generate a new sequence from an input, like translation models, or summarization models (such as Models that generate a new sequence from an input, like translation models, or summarization models (such as
[Bart](model_doc/bart) or [T5](model_doc/t5)). [Bart](model_doc/bart) or [T5](model_doc/t5)).
### Sharded DDP
Another name for the foundational [ZeRO](#zero-redundancy-optimizer--zero-) concept as used by various other implementations of ZeRO.
### stride ### stride
In [convolution](#convolution) or [pooling](#pooling), the stride refers to the distance the kernel is moved over a matrix. A stride of 1 means the kernel is moved one pixel over at a time, and a stride of 2 means the kernel is moved two pixels over at a time. In [convolution](#convolution) or [pooling](#pooling), the stride refers to the distance the kernel is moved over a matrix. A stride of 1 means the kernel is moved one pixel over at a time, and a stride of 2 means the kernel is moved two pixels over at a time.
...@@ -420,6 +436,14 @@ A form of model training that directly uses labeled data to correct and instruct ...@@ -420,6 +436,14 @@ A form of model training that directly uses labeled data to correct and instruct
## T ## T
### Tensor Parallelism (TP)
Parallelism technique for training on multiple GPUs in which each tensor is split up into multiple chunks, so instead of
having the whole tensor reside on a single GPU, each shard of the tensor resides on its designated GPU. Shards gets
processed separately and in parallel on different GPUs and the results are synced at the end of the processing step.
This is what is sometimes called horizontal parallelism, as the splitting happens on horizontal level.
Learn more about Tensor Parallelism [here](perf_train_gpu_many#tensor-parallelism).
### token ### token
A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a
...@@ -489,3 +513,12 @@ Self-attention based deep learning model architecture. ...@@ -489,3 +513,12 @@ Self-attention based deep learning model architecture.
### unsupervised learning ### unsupervised learning
A form of model training in which data provided to the model is not labeled. Unsupervised learning techniques leverage statistical information of the data distribution to find patterns useful for the task at hand. A form of model training in which data provided to the model is not labeled. Unsupervised learning techniques leverage statistical information of the data distribution to find patterns useful for the task at hand.
## Z
### Zero Redundancy Optimizer (ZeRO)
Parallelism technique which performs sharding of the tensors somewhat similar to [TensorParallel](#tensorparallel--tp-),
except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need
to be modified. This method also supports various offloading techniques to compensate for limited GPU memory.
Learn more about ZeRO [here](perf_train_gpu_many#zero-data-parallelism).
\ No newline at end of file
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment