Unverified Commit a52e180a authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

[docs] General doc fixes (#28087)

* doc fix friday

* deprecated objects

* update not_doctested

* update toctree
parent 08a6e7a7
...@@ -146,8 +146,6 @@ ...@@ -146,8 +146,6 @@
title: Efficient training on CPU title: Efficient training on CPU
- local: perf_train_cpu_many - local: perf_train_cpu_many
title: Distributed CPU training title: Distributed CPU training
- local: perf_train_tpu
title: Training on TPUs
- local: perf_train_tpu_tf - local: perf_train_tpu_tf
title: Training on TPU with TensorFlow title: Training on TPU with TensorFlow
- local: perf_train_special - local: perf_train_special
......
...@@ -400,12 +400,6 @@ Pipelines available for natural language processing tasks include the following. ...@@ -400,12 +400,6 @@ Pipelines available for natural language processing tasks include the following.
- __call__ - __call__
- all - all
### NerPipeline
[[autodoc]] NerPipeline
See [`TokenClassificationPipeline`] for all details.
### QuestionAnsweringPipeline ### QuestionAnsweringPipeline
[[autodoc]] QuestionAnsweringPipeline [[autodoc]] QuestionAnsweringPipeline
......
...@@ -56,9 +56,24 @@ FlashAttention-2 is currently supported for the following architectures: ...@@ -56,9 +56,24 @@ FlashAttention-2 is currently supported for the following architectures:
You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request. You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request.
Before you begin, make sure you have FlashAttention-2 installed. For NVIDIA GPUs, the library is installable through pip: `pip install flash-attn --no-build-isolation`. We strongly suggest to refer to the [detailed installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features). Before you begin, make sure you have FlashAttention-2 installed.
FlashAttention-2 is also supported on AMD GPUs, with the current support limited to **Instinct MI210 and Instinct MI250**. We strongly suggest to use the following [Dockerfile](https://github.com/huggingface/optimum-amd/tree/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile) to use FlashAttention-2 on AMD GPUs. <hfoptions id="install">
<hfoption id="NVIDIA">
```bash
pip install flash-attn --no-build-isolation
```
We strongly suggest referring to the detailed [installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to learn more about supported hardware and data types!
</hfoption>
<hfoption id="AMD">
FlashAttention-2 is also supported on AMD GPUs and current support is limited to **Instinct MI210** and **Instinct MI250**. We strongly suggest using this [Dockerfile](https://github.com/huggingface/optimum-amd/tree/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile) to use FlashAttention-2 on AMD GPUs.
</hfoption>
</hfoptions>
To enable FlashAttention-2, pass the argument `attn_implementation="flash_attention_2"` to [`~AutoModelForCausalLM.from_pretrained`]: To enable FlashAttention-2, pass the argument `attn_implementation="flash_attention_2"` to [`~AutoModelForCausalLM.from_pretrained`]:
...@@ -80,7 +95,9 @@ model = AutoModelForCausalLM.from_pretrained( ...@@ -80,7 +95,9 @@ model = AutoModelForCausalLM.from_pretrained(
FlashAttention-2 can only be used when the model's dtype is `fp16` or `bf16`. Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2. FlashAttention-2 can only be used when the model's dtype is `fp16` or `bf16`. Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2.
Note that `use_flash_attention_2=True` can also be used to enable Flash Attention 2, but is deprecated in favor of `attn_implementation="flash_attention_2"`. <br>
You can also set `use_flash_attention_2=True` to enable FlashAttention-2 but it is deprecated in favor of `attn_implementation="flash_attention_2"`.
</Tip> </Tip>
...@@ -144,11 +161,11 @@ FlashAttention is more memory efficient, meaning you can train on much larger se ...@@ -144,11 +161,11 @@ FlashAttention is more memory efficient, meaning you can train on much larger se
<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/llama-2-large-seqlen-padding.png"> <img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/llama-2-large-seqlen-padding.png">
</div> </div>
## FlashAttention and memory-efficient attention through PyTorch's scaled_dot_product_attention ## PyTorch scaled dot product attention
PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers, and is used by default for `torch>=2.1.1` when an implementation is available. PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available.
For now, Transformers supports inference and training through SDPA for the following architectures: For now, Transformers supports SDPA inference and training for the following architectures:
* [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel) * [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel)
* [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel) * [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel)
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel) * [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
...@@ -156,9 +173,13 @@ For now, Transformers supports inference and training through SDPA for the follo ...@@ -156,9 +173,13 @@ For now, Transformers supports inference and training through SDPA for the follo
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel) * [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel) * [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
Note that FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type before using it. <Tip>
FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first.
By default, `torch.nn.functional.scaled_dot_product_attention` selects the most performant kernel available, but to check whether a backend is available in a given setting (hardware, problem size), you can use [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager: </Tip>
By default, SDPA selects the most performant kernel available but you can check whether a backend is available in a given setting (hardware, problem size) with [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager:
```diff ```diff
import torch import torch
...@@ -178,7 +199,7 @@ inputs = tokenizer(input_text, return_tensors="pt").to("cuda") ...@@ -178,7 +199,7 @@ inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
``` ```
If you see a bug with the traceback below, try using nightly version of PyTorch which may have broader coverage for FlashAttention: If you see a bug with the traceback below, try using the nightly version of PyTorch which may have broader coverage for FlashAttention:
```bash ```bash
RuntimeError: No available kernel. Aborting execution. RuntimeError: No available kernel. Aborting execution.
...@@ -191,11 +212,10 @@ pip3 install -U --pre torch torchvision torchaudio --index-url https://download. ...@@ -191,11 +212,10 @@ pip3 install -U --pre torch torchvision torchaudio --index-url https://download.
<Tip warning={true}> <Tip warning={true}>
Part of BetterTransformer features are being upstreamed in Transformers, with native `torch.nn.scaled_dot_product_attention` default support. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to support natively SDPA in Transformers. Some BetterTransformer features are being upstreamed to Transformers with default support for native `torch.nn.scaled_dot_product_attention`. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers.
</Tip> </Tip>
<Tip> <Tip>
Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post. Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post.
......
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Training on TPUs
<Tip>
Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) and [multi-GPU section](perf_train_gpu_many) are generic and apply to training models in general so make sure to have a look at it before diving into this section.
</Tip>
This document will be completed soon with information on how to train on TPUs.
...@@ -548,6 +548,7 @@ The benchmarks indicate AWQ quantization is the fastest for inference, text gene ...@@ -548,6 +548,7 @@ The benchmarks indicate AWQ quantization is the fastest for inference, text gene
The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules. The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules.
<figcaption class="text-center text-gray-500 text-lg">Unfused module</figcaption> <figcaption class="text-center text-gray-500 text-lg">Unfused module</figcaption>
| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) | | Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
| 1 | 32 | 32 | 60.0984 | 38.4537 | 4.50 GB (5.68%) | | 1 | 32 | 32 | 60.0984 | 38.4537 | 4.50 GB (5.68%) |
...@@ -559,6 +560,7 @@ The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7 ...@@ -559,6 +560,7 @@ The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7
| 1 | 2048 | 2048 | 2927.33 | 35.2676 | 5.73 GB (7.23%) | | 1 | 2048 | 2048 | 2927.33 | 35.2676 | 5.73 GB (7.23%) |
<figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption> <figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption>
| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) | | Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| |-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
| 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) | | 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) |
......
...@@ -913,6 +913,7 @@ DEPRECATED_OBJECTS = [ ...@@ -913,6 +913,7 @@ DEPRECATED_OBJECTS = [
"LineByLineTextDataset", "LineByLineTextDataset",
"LineByLineWithRefDataset", "LineByLineWithRefDataset",
"LineByLineWithSOPTextDataset", "LineByLineWithSOPTextDataset",
"NerPipeline",
"PretrainedBartModel", "PretrainedBartModel",
"PretrainedFSMTModel", "PretrainedFSMTModel",
"SingleSentenceClassificationProcessor", "SingleSentenceClassificationProcessor",
......
...@@ -282,7 +282,6 @@ docs/source/en/perf_train_cpu_many.md ...@@ -282,7 +282,6 @@ docs/source/en/perf_train_cpu_many.md
docs/source/en/perf_train_gpu_many.md docs/source/en/perf_train_gpu_many.md
docs/source/en/perf_train_gpu_one.md docs/source/en/perf_train_gpu_one.md
docs/source/en/perf_train_special.md docs/source/en/perf_train_special.md
docs/source/en/perf_train_tpu.md
docs/source/en/perf_train_tpu_tf.md docs/source/en/perf_train_tpu_tf.md
docs/source/en/performance.md docs/source/en/performance.md
docs/source/en/perplexity.md docs/source/en/perplexity.md
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment