Improve performance docs (#17750)

* add skeleton files * fix cpu inference link * add hint to make clear that single gpu section contains general info * add new files to ToC * update toctree to have subsection for performance * add "coming soon" to the still empty sections * fix missing title * fix typo * add reference to empty documents * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Improve performance docs (#17750)
* add skeleton files * fix cpu inference link * add hint to make clear that single gpu section contains general info * add new files to ToC * update toctree to have subsection for performance * add "coming soon" to the still empty sections * fix missing title * fix typo * add reference to empty documents * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
6f29029b · Leandro von Werra · GitHub · 5bc779ae · 6f29029b · 6f29029b
Unverified Commit 6f29029b authored Jun 23, 2022 by Leandro von Werra Committed by GitHub Jun 23, 2022
8 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -59,7 +59,29 @@
    title: Converting TensorFlow Checkpoints
  - local: serialization
    title: Export 🤗 Transformers models
-  - local: performance
+  - sections:
+    - local: performance
+      title: Overview
+    - local: perf_train_gpu_one
+      title: Training on one GPU
+    - local: perf_train_gpu_many
+      title: Training on many GPUs
+    - local: perf_train_cpu
+      title: Training on CPU
+    - local: perf_train_tpu
+      title: Training on TPUs
+    - local: perf_train_special
+      title: Training on Specialized Hardware
+    - local: perf_infer_cpu
+      title: Inference on CPU
+    - local: perf_infer_gpu_one
+      title: Inference on one GPU
+    - local: perf_infer_gpu_many
+      title: Inference on many GPUs
+    - local: perf_infer_special
+      title: Inference on Specialized Hardware
+    - local: perf_hardware
+      title: Custom hardware for training
    title: Performance and scalability
  - local: big_models
    title: Instantiating a big model
@@ -81,16 +103,6 @@
    title: "How to add a model to 🤗 Transformers?"
  - local: add_new_pipeline
    title: "How to add a pipeline to 🤗 Transformers?"
-  - local: perf_train_gpu_one
-    title: Training on one GPU
-  - local: perf_train_gpu_many
-    title: Training on many GPUs
-  - local: perf_train_cpu
-    title: Training on CPU
-  - local: perf_infer_cpu
-    title: Inference on CPU
-  - local: perf_hardware
-    title: Custom hardware for training
  - local: testing
    title: Testing
  - local: pr_checks

--- a/docs/source/en/perf_infer_gpu_many.mdx
+++ b/docs/source/en/perf_infer_gpu_many.mdx
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+-->
+
+# Efficient Inference on a Multiple GPUs
+
+This document will be completed soon with information on how to infer on a multiple GPUs. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
\ No newline at end of file
--- a/docs/source/en/perf_infer_gpu_one.mdx
+++ b/docs/source/en/perf_infer_gpu_one.mdx
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+-->
+
+# Efficient Inference on a Single GPU
+
+This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu).
\ No newline at end of file
--- a/docs/source/en/perf_infer_special.mdx
+++ b/docs/source/en/perf_infer_special.mdx
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+-->
+
+# Inference on Specialized Hardware
+
+This document will be completed soon with information on how to infer on specialized hardware. In the meantime you can check out [the guide for inference on CPUs](perf_infer_cpu).
\ No newline at end of file
--- a/docs/source/en/perf_train_gpu_many.mdx
+++ b/docs/source/en/perf_train_gpu_many.mdx
@@ -13,6 +13,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o

 When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a mutli-GPU setup. Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed. There are several techniques to achieve parallism such as data, tensor, or pipeline parallism. However, there is no one solution to fit them all and which settings works best depends on the hardware you are running on. While the main concepts most likely will apply to any other framework, this article is focused on PyTorch-based implementations.

+<Tip>
+
+ Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) are generic and apply to training models in general so make sure to have a look at it before diving into the following sections such as multi-GPU or CPU training.
+
+</Tip>
+
 We will first discuss in depth various 1D parallelism techniques and their pros and cons and then look at how they can be combined into 2D and 3D parallelism to enable an even faster training and to support even bigger models. Various other powerful alternative approaches will be presented.

 ## Concepts

--- a/docs/source/en/perf_train_special.mdx
+++ b/docs/source/en/perf_train_special.mdx
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+-->
+
+# Training on Specialized Hardware
+
+<Tip>
+
+ Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) and [mutli-GPU section](perf_train_gpu_many) are generic and apply to training models in general so make sure to have a look at it before diving into this section.
+
+</Tip>
+
+This document will be completed soon with information on how to train on specialized hardware.
\ No newline at end of file
--- a/docs/source/en/perf_train_tpu.mdx
+++ b/docs/source/en/perf_train_tpu.mdx
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+-->
+
+# Training on TPUs
+
+<Tip>
+
+ Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) and [mutli-GPU section](perf_train_gpu_many) are generic and apply to training models in general so make sure to have a look at it before diving into this section.
+
+</Tip>
+
+This document will be completed soon with information on how to train on TPUs.
\ No newline at end of file
--- a/docs/source/en/performance.mdx
+++ b/docs/source/en/performance.mdx
@@ -24,7 +24,13 @@ This document serves as an overview and entry point for the methods that could b

 ## Training

-Training transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where you only have a single GPU.
+Training transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where you only have a single GPU, but there is also a section about mutli-GPU and CPU training (with more coming soon).
+
+<Tip>
+
+ Note: Most of the strategies introduced in the single GPU sections (such as mixed precision training or gradient accumulation) are generic and apply to training models in general so make sure to have a look at it before diving into the following sections such as multi-GPU or CPU training.
+
+</Tip>

 ### Single GPU

@@ -46,11 +52,11 @@ In some cases training on a single GPU is still too slow or won't fit the large

 ### TPU

-_Coming soon_
+[_Coming soon_](perf_train_tpu)

 ### Specialized Hardware

-_Coming soon_
+[_Coming soon_](perf_train_special)

 ## Inference

@@ -58,19 +64,19 @@ Efficient inference with large models in a production environment can be as chal

 ### CPU

-[Go to CPU inference section](perf_infer_cpu.mdx)
+[Go to CPU inference section](perf_infer_cpu)

 ### Single GPU

-_Coming soon_
+[Go to single GPU inference section](perf_infer_gpu_one)

 ### Multi-GPU

-_Coming soon_
+[Go to multi-GPU inference section](perf_infer_gpu_many)

 ### Specialized Hardware

-_Coming soon_
+[_Coming soon_](perf_infer_special)

 ## Hardware