[docs] Performance docs tidy up, part 1 (#23963)

* first pass at the single gpu doc * overview: improved clarity and navigation * WIP * updated intro and deepspeed sections * improved torch.compile section * more improvements * minor improvements * make style * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * feedback addressed * mdx -> md * link fix * feedback addressed --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

[docs] Performance docs tidy up, part 1 (#23963)
* first pass at the single gpu doc * overview: improved clarity and navigation * WIP * updated intro and deepspeed sections * improved torch.compile section * more improvements * minor improvements * make style * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * feedback addressed * mdx -> md * link fix * feedback addressed --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
75317aef · Maria Khalusova · GitHub · 54ba8608 · 75317aef · 75317aef
Unverified Commit 75317aef authored Jul 24, 2023 by Maria Khalusova Committed by GitHub Jul 24, 2023
4 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -111,36 +111,40 @@
 - sections:
    - local: performance
      title: Overview
-    - local: perf_train_gpu_one
-      title: Training on one GPU
-    - local: perf_train_gpu_many
-      title: Training on many GPUs
-    - local: perf_train_cpu
-      title: Training on CPU
-    - local: perf_train_cpu_many
-      title: Training on many CPUs
-    - local: perf_train_tpu
-      title: Training on TPUs
-    - local: perf_train_tpu_tf
-      title: Training on TPU with TensorFlow
-    - local: perf_train_special
-      title: Training on Specialized Hardware
-    - local: perf_infer_cpu
-      title: Inference on CPU
-    - local: perf_infer_gpu_one
-      title: Inference on one GPU
-    - local: perf_infer_gpu_many
-      title: Inference on many GPUs
-    - local: perf_infer_special
-      title: Inference on Specialized Hardware
-    - local: perf_hardware
-      title: Custom hardware for training
+    - sections:
+        - local: perf_train_gpu_one
+          title: Methods and tools for efficient training on a single GPU
+        - local: perf_train_gpu_many
+          title: Multiple GPUs and parallelism
+        - local: perf_train_cpu
+          title: Efficient training on CPU
+        - local: perf_train_cpu_many
+          title: Distributed CPU training
+        - local: perf_train_tpu
+          title: Training on TPUs
+        - local: perf_train_tpu_tf
+          title: Training on TPU with TensorFlow
+        - local: perf_train_special
+          title: Training on Specialized Hardware
+        - local: perf_hardware
+          title: Custom hardware for training
+        - local: hpo_train
+          title: Hyperparameter Search using Trainer API
+      title: Efficient training techniques
+    - sections:
+        - local: perf_infer_cpu
+          title: Inference on CPU
+        - local: perf_infer_gpu_one
+          title: Inference on one GPU
+        - local: perf_infer_gpu_many
+          title: Inference on many GPUs
+        - local: perf_infer_special
+          title: Inference on Specialized Hardware
+      title: Optimizing inference
    - local: big_models
      title: Instantiating a big model
    - local: debugging
-      title: Debugging
-    - local: hpo_train
-      title: Hyperparameter Search using Trainer API
+      title: Troubleshooting
    - local: tf_xla
      title: XLA Integration for TensorFlow Models
  title: Performance and scalability
@@ -182,6 +186,8 @@
    title: Perplexity of fixed-length models
  - local: pipeline_webserver
    title: Pipelines for webserver inference
+  - local: model_memory_anatomy
+    title: Model training anatomy
  title: Conceptual guides
 - sections:
  - sections:

--- a/docs/source/en/model_memory_anatomy.md
+++ b/docs/source/en/model_memory_anatomy.md
+<!---
+Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Model training anatomy
+
+To understand performance optimization techniques that one can apply to improve efficiency of model training 
+speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute 
+intensity varies depending on an operation performed.
+
+Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration, 
+we'll need to install a few libraries: 
+
+```bash
+pip install transformers datasets accelerate nvidia-ml-py3
+```
+
+The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar 
+with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly.
+
+Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. 
+In total, we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format.
+
+
+```py
+>>> import numpy as np
+>>> from datasets import Dataset
+
+
+>>> seq_len, dataset_size = 512, 512
+>>> dummy_data = {
+...     "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
+...     "labels": np.random.randint(0, 1, (dataset_size)),
+... }
+>>> ds = Dataset.from_dict(dummy_data)
+>>> ds.set_format("pt")
+```
+
+To print summary statistics for the GPU utilization and the training run with the [`Trainer`] we define two helper functions:
+
+```py
+>>> from pynvml import *
+
+
+>>> def print_gpu_utilization():
+...     nvmlInit()
+...     handle = nvmlDeviceGetHandleByIndex(0)
+...     info = nvmlDeviceGetMemoryInfo(handle)
+...     print(f"GPU memory occupied: {info.used//1024**2} MB.")
+
+
+>>> def print_summary(result):
+...     print(f"Time: {result.metrics['train_runtime']:.2f}")
+...     print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
+...     print_gpu_utilization()
+```
+
+Let's verify that we start with a free GPU memory:
+
+```py
+>>> print_gpu_utilization()
+GPU memory occupied: 0 MB.
+```
+
+That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on 
+your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by 
+the user. When a model is loaded to the GPU also the kernels are loaded which can take up 1-2GB of memory. To see how 
+much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well.
+
+```py
+>>> import torch
+
+
+>>> torch.ones((1, 1)).to("cuda")
+>>> print_gpu_utilization()
+GPU memory occupied: 1343 MB.
+```
+
+We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses.
+
+## Load Model
+
+First, we load the `bert-large-uncased` model. We load the model weights directly to the GPU so that we can check 
+how much space just the weights use.
+
+
+```py
+>>> from transformers import AutoModelForSequenceClassification
+
+
+>>> model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda")
+>>> print_gpu_utilization()
+GPU memory occupied: 2631 MB.
+```
+
+We can see that the model weights alone take up 1.3 GB of the GPU memory. The exact number depends on the specific 
+GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an 
+optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result 
+as with `nvidia-smi` CLI:
+
+
+```bash
+nvidia-smi
+```
+
+```bash
+Tue Jan 11 08:58:05 2022
+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                               |                      |               MIG M. |
+|===============================+======================+======================|
+|   0  Tesla V100-SXM2...  On   | 00000000:00:04.0 Off |                    0 |
+| N/A   37C    P0    39W / 300W |   2631MiB / 16160MiB |      0%      Default |
+|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+
+-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|    0   N/A  N/A      3721      C   ...nvs/codeparrot/bin/python     2629MiB |
+-----------------------------------------------------------------------------+
+```
+
+We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can 
+start training the model and see how the GPU memory consumption changes. First, we set up a few standard training 
+arguments:
+
+```py
+default_args = {
+    "output_dir": "tmp",
+    "evaluation_strategy": "steps",
+    "num_train_epochs": 1,
+    "log_level": "error",
+    "report_to": "none",
+}
+```
+
+<Tip>
+
+ If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python 
+ kernel between experiments.
+
+</Tip>
+
+## Memory utilization at vanilla training
+
+Let's use the [`Trainer`] and train the model without using any GPU performance optimization techniques and a batch size of 4:
+
+```py
+>>> from transformers import TrainingArguments, Trainer, logging
+
+>>> logging.set_verbosity_error()
+
+
+>>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
+>>> trainer = Trainer(model=model, args=training_args, train_dataset=ds)
+>>> result = trainer.train()
+>>> print_summary(result)
+```
+
+```
+Time: 57.82
+Samples/second: 8.86
+GPU memory occupied: 14949 MB.
+```
+
+We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size 
+can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our
+model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. 
+To understand a bit better why this is the case let's have look at a model's operations and memory needs.
+
+## Anatomy of Model's Operations
+
+Transformers architecture includes 3 main groups of operations grouped below by compute-intensity.
+
+1. **Tensor Contractions**
+
+    Linear layers and components of Multi-Head Attention all do batched **matrix-matrix multiplications**. These operations are the most compute-intensive part of training a transformer.
+
+2. **Statistical Normalizations**
+
+    Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more **reduction operations**, the result of which is then applied via a map.
+
+3. **Element-wise Operators**
+
+    These are the remaining operators: **biases, dropout, activations, and residual connections**. These are the least compute-intensive operations.
+
+This knowledge can be helpful to know when analyzing performance bottlenecks.
+
+This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072)
+
+
+## Anatomy of Model's Memory
+
+We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there 
+are many components during training that use GPU memory. The components on GPU memory are the following:
+
+1. model weights
+2. optimizer states
+3. gradients
+4. forward activations saved for gradient computation
+5. temporary buffers
+6. functionality-specific memory
+
+A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For 
+inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per 
+model parameter for mixed precision inference, plus activation memory.
+
+Let's look at the details.
+
+**Model Weights:**
+
+- 4 bytes * number of parameters for fp32 training
+- 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory)
+
+**Optimizer States:**
+
+- 8 bytes * number of parameters for normal AdamW (maintains 2 states)
+- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
+- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)
+
+**Gradients**
+
+- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32)
+
+**Forward Activations**
+
+- size depends on many factors, the key ones being sequence length, hidden size and batch size.
+
+There are the input and output that are being passed and returned by the forward and the backward functions and the 
+forward activations saved for gradient computation.
+
+**Temporary Memory**
+
+Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the 
+moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think 
+strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed.
+
+**Functionality-specific memory**
+
+Then, your software could have special memory needs. For example, when generating text using beam search, the software 
+needs to maintain multiple copies of inputs and outputs.
+
+**`forward` vs `backward` Execution Speed**
+
+For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates 
+into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually 
+bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward 
+(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, 
+and writes once, gradInput).
+
+As you can see, there are potentially a few places where we could save GPU memory or speed up operations. 
+Now that you understand what affects GPU utilization and computation speed, refer to 
+the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about 
+performance optimization techniques. 
--- a/docs/source/en/perf_train_gpu_one.md
+++ b/docs/source/en/perf_train_gpu_one.md
--- a/docs/source/en/performance.md
+++ b/docs/source/en/performance.md
@@ -20,77 +20,54 @@ rendered properly in your Markdown viewer.

 # Performance and Scalability

-Training larger and larger transformer models and deploying them to production comes with a range of challenges. During training your model can require more GPU memory than is available or be very slow to train and when you deploy it for inference it can be overwhelmed with the throughput that is required in the production environment. This documentation is designed to help you navigate these challenges and find the best setting for your use-case. We split the guides into training and inference as they come with different challenges and solutions. Then within each of them we have separate guides for different kinds of hardware setting (e.g. single vs. multi-GPU for training or CPU vs. GPU for infrence).
+Training large transformer models and deploying them to production present various challenges.  
+During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment 
+phase, the model can struggle to handle the required throughput in a production environment.

-![perf_overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf_overview.png)
+This documentation aims to assist you in overcoming these challenges and finding the optimal setting for your use-case. 
+The guides are divided into training and inference sections, as each comes with different challenges and solutions. 
+Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU 
+for training or CPU vs. GPU for inference.

-This document serves as an overview and entry point for the methods that could be useful for your scenario.
+Use this document as your starting point to navigate further to the methods that match your scenario.

 ## Training

-Training transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where you only have a single GPU, but there is also a section about multi-GPU and CPU training (with more coming soon).
+Training large transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where 
+you have a single GPU. The methods that you can apply to improve training efficiency on a single GPU extend to other setups 
+such as multiple GPU. However, there are also techniques that are specific to multi-GPU or CPU training. We cover them in 
+separate sections.

-<Tip>
-
- Note: Most of the strategies introduced in the single GPU sections (such as mixed precision training or gradient accumulation) are generic and apply to training models in general so make sure to have a look at it before diving into the following sections such as multi-GPU or CPU training.
-
-</Tip>
-
-### Single GPU
-
-Training large models on a single GPU can be challenging but there are a number of tools and methods that make it feasible. In this section methods such as mixed precision training, gradient accumulation and checkpointing, efficient optimizers, as well as strategies to determine the best batch size are discussed.
-
-[Go to single GPU training section](perf_train_gpu_one)
-
-### Multi-GPU
-
-In some cases training on a single GPU is still too slow or won't fit the large model. Moving to a multi-GPU setup is the logical step, but training on multiple GPUs at once comes with new decisions: does each GPU have a full copy of the model or is the model itself also distributed? In this section we look at data, tensor, and pipeline parallism.
-
-[Go to multi-GPU training section](perf_train_gpu_many)
-
-### CPU
-
-
-[Go to CPU training section](perf_train_cpu)
-
-
-### TPU
-
-[_Coming soon_](perf_train_tpu)
-
-### Specialized Hardware
-
-[_Coming soon_](perf_train_special)
+* [Methods and tools for efficient training on a single GPU](perf_train_gpu_one): start here to learn common approaches that can help optimize GPU memory utilization, speed up the training, or both. 
+* [Multi-GPU training section](perf_train_gpu_many): explore this section to learn about further optimization methods that apply to a multi-GPU settings, such as data, tensor, and pipeline parallelism.
+* [CPU training section](perf_train_cpu): learn about mixed precision training on CPU.
+* [Efficient Training on Multiple CPUs](perf_train_cpu_many): learn about distributed CPU training.
+* [Training on TPU with TensorFlow](perf_train_tpu_tf): if you are new to TPUs, refer to this section for an opinionated introduction to training on TPUs and using XLA. 
+* [Custom hardware for training](perf_hardware): find tips and tricks when building your own deep learning rig.
+* [Hyperparameter Search using Trainer API](hpo_train)

 ## Inference

-Efficient inference with large models in a production environment can be as challenging as training them. In the following sections we go through the steps to run inference on CPU and single/multi-GPU setups.
-
-### CPU
-
-[Go to CPU inference section](perf_infer_cpu)
-
-### Single GPU
-
-[Go to single GPU inference section](perf_infer_gpu_one)
-
-### Multi-GPU
-
-[Go to multi-GPU inference section](perf_infer_gpu_many)
-
-### Specialized Hardware
+Efficient inference with large models in a production environment can be as challenging as training them. In the following 
+sections we go through the steps to run inference on CPU and single/multi-GPU setups.

-[_Coming soon_](perf_infer_special)
+* [Inference on a single CPU](perf_infer_cpu)
+* [Inference on a single GPU](perf_infer_gpu_one)
+* [Multi-GPU inference](perf_infer_gpu_many)
+* [XLA Integration for TensorFlow Models](tf_xla)

-## Hardware

-In the hardware section you can find tips and tricks when building your own deep learning rig.
+## Training and inference

-[Go to hardware section](perf_hardware)
+Here you'll find techniques, tips and tricks that apply whether you are training a model, or running inference with it.

+* [Instantiating a big model](big_models)
+* [Troubleshooting performance issues](debugging)

 ## Contribute

-This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there.
+This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to 
+make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there.

-When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the source of that information (unless it comes directly from you).
+When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the 
+source of that information (unless it comes directly from you).