Tensor Parallelism conceptual guide (#886)

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

Tensor Parallelism conceptual guide (#886)
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
1f69fb9e · Merve Noyan · GitHub · 4cce8430 · 1f69fb9e · 1f69fb9e
Unverified Commit 1f69fb9e authored Sep 12, 2023 by Merve Noyan Committed by GitHub Sep 12, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 0 deletions

docs/source/_toctree.yml docs/source/_toctree.yml +2 -0

docs/source/conceptual/tensor_parallelism.md docs/source/conceptual/tensor_parallelism.md +14 -0

No files found.
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -21,6 +21,8 @@
 - sections:
  - local: conceptual/streaming
    title: Streaming
+  - local: conceptual/tensor_parallelism
+    title: Tensor Parallelism
  - local: conceptual/paged_attention
    title: PagedAttention
  - local: conceptual/safetensors

--- a/docs/source/conceptual/tensor_parallelism.md
+++ b/docs/source/conceptual/tensor_parallelism.md
+# Tensor Parallelism
+Tensor parallelism is a technique used to fit a large model in multiple GPUs. For example, when multiplying the input tensors with the first weight tensor, the matrix multiplication is equivalent to splitting the weight tensor column-wise, multiplying each column with the input separately, and then concatenating the separate outputs. These outputs are then transferred from the GPUs and concatenated together to get the final result, like below 👇 
+![Image courtesy of Anton Lozkhov](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/TP.png)
+<Tip warning={true}>
+Tensor Parallelism only works for [models officially supported](../supported_models), it will not work when falling back to `transformers`. You can get more information about unsupported models [here](../basic_tutorials/non_core_models).
+</Tip>
+You can learn a lot more details about tensor-parallelism from [the `transformers` docs](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_many#tensor-parallelism).