"examples/git@developer.sourcefind.cn:OpenDAS/nni.git" did not exist on "73a6990b1b4e74b7d6d837d2864b35317de5fd70"
Unverified Commit c13a43aa authored by fxmarty's avatar fxmarty Committed by GitHub
Browse files

Reflect RoCm support in the documentation (#27636)



* reflect RoCm support in the documentation

* Update docs/source/en/main_classes/trainer.md
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>

* fix review comments

* use ROCm instead of RoCm

---------
Co-authored-by: default avatarLysandre Debut <hi@lysand.re>
parent a6d178e2
...@@ -26,7 +26,7 @@ If you're looking to fine-tune a language model like Llama-2 or Mistral on a tex ...@@ -26,7 +26,7 @@ If you're looking to fine-tune a language model like Llama-2 or Mistral on a tex
Before instantiating your [`Trainer`], create a [`TrainingArguments`] to access all the points of customization during training. Before instantiating your [`Trainer`], create a [`TrainingArguments`] to access all the points of customization during training.
The API supports distributed training on multiple GPUs/TPUs, mixed precision through [NVIDIA Apex](https://github.com/NVIDIA/apex) and Native AMP for PyTorch. The API supports distributed training on multiple GPUs/TPUs, mixed precision through [NVIDIA Apex] for NVIDIA GPUs, [ROCm APEX](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs, and Native AMP for PyTorch.
The [`Trainer`] contains the basic training loop which supports the above features. To inject custom behavior you can subclass them and override the following methods: The [`Trainer`] contains the basic training loop which supports the above features. To inject custom behavior you can subclass them and override the following methods:
...@@ -272,7 +272,7 @@ but this approach can be confusing since you may forget you set up the environme ...@@ -272,7 +272,7 @@ but this approach can be confusing since you may forget you set up the environme
There is an additional environment variable `CUDA_DEVICE_ORDER` that controls how the physical devices are ordered. The two choices are: There is an additional environment variable `CUDA_DEVICE_ORDER` that controls how the physical devices are ordered. The two choices are:
1. ordered by PCIe bus IDs (matches `nvidia-smi`'s order) - this is the default. 1. ordered by PCIe bus IDs (matches `nvidia-smi` and `rocm-smi`'s order) - this is the default.
```bash ```bash
export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_DEVICE_ORDER=PCI_BUS_ID
...@@ -284,7 +284,7 @@ export CUDA_DEVICE_ORDER=PCI_BUS_ID ...@@ -284,7 +284,7 @@ export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_DEVICE_ORDER=FASTEST_FIRST export CUDA_DEVICE_ORDER=FASTEST_FIRST
``` ```
Most of the time you don't need to care about this environment variable, but it's very helpful if you have a lopsided setup where you have an old and a new GPUs physically inserted in such a way so that the slow older card appears to be first. One way to fix that is to swap the cards. But if you can't swap the cards (e.g., if the cooling of the devices gets impacted) then setting `CUDA_DEVICE_ORDER=FASTEST_FIRST` will always put the newer faster card first. It'll be somewhat confusing though since `nvidia-smi` will still report them in the PCIe order. Most of the time you don't need to care about this environment variable, but it's very helpful if you have a lopsided setup where you have an old and a new GPUs physically inserted in such a way so that the slow older card appears to be first. One way to fix that is to swap the cards. But if you can't swap the cards (e.g., if the cooling of the devices gets impacted) then setting `CUDA_DEVICE_ORDER=FASTEST_FIRST` will always put the newer faster card first. It'll be somewhat confusing though since `nvidia-smi` (or `rocm-smi`) will still report them in the PCIe order.
The other solution to swapping the order is to use: The other solution to swapping the order is to use:
......
...@@ -314,7 +314,7 @@ The predicted tokens will then be placed between the sentinel tokens. ...@@ -314,7 +314,7 @@ The predicted tokens will then be placed between the sentinel tokens.
## Performance ## Performance
If you'd like a faster training and inference performance, install [apex](https://github.com/NVIDIA/apex#quick-start) and then the model will automatically use `apex.normalization.FusedRMSNorm` instead of `T5LayerNorm`. The former uses an optimized fused kernel which is several times faster than the latter. If you'd like a faster training and inference performance, install [NVIDIA APEX](https://github.com/NVIDIA/apex#quick-start) for NVIDIA GPUs, or [ROCm APEX](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs and then the model will automatically use `apex.normalization.FusedRMSNorm` instead of `T5LayerNorm`. The former uses an optimized fused kernel which is several times faster than the latter.
## Resources ## Resources
......
...@@ -15,7 +15,7 @@ rendered properly in your Markdown viewer. ...@@ -15,7 +15,7 @@ rendered properly in your Markdown viewer.
# GPU inference # GPU inference
GPUs are the standard choice of hardware for machine learning, unlike CPUs, because they are optimized for memory bandwidth and parallelism. To keep up with the larger sizes of modern models or to run these large models on existing and older hardware, there are several optimizations you can use to speed up GPU inference. In this guide, you'll learn how to use FlashAttention-2 (a more memory-efficient attention mechanism), BetterTransformer (a PyTorch native fastpath execution), and bitsandbytes to quantize your model to a lower precision. Finally, learn how to use 🤗 Optimum to accelerate inference with ONNX Runtime on Nvidia GPUs. GPUs are the standard choice of hardware for machine learning, unlike CPUs, because they are optimized for memory bandwidth and parallelism. To keep up with the larger sizes of modern models or to run these large models on existing and older hardware, there are several optimizations you can use to speed up GPU inference. In this guide, you'll learn how to use FlashAttention-2 (a more memory-efficient attention mechanism), BetterTransformer (a PyTorch native fastpath execution), and bitsandbytes to quantize your model to a lower precision. Finally, learn how to use 🤗 Optimum to accelerate inference with ONNX Runtime on Nvidia and AMD GPUs.
<Tip> <Tip>
...@@ -276,13 +276,13 @@ Feel free to try running a 11 billion parameter [T5 model](https://colab.researc ...@@ -276,13 +276,13 @@ Feel free to try running a 11 billion parameter [T5 model](https://colab.researc
<Tip> <Tip>
Learn more details about using ORT with 🤗 Optimum in the [Accelerated inference on NVIDIA GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#accelerated-inference-on-nvidia-gpus) guide. This section only provides a brief and simple example. Learn more details about using ORT with 🤗 Optimum in the [Accelerated inference on NVIDIA GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#accelerated-inference-on-nvidia-gpus) and [Accelerated inference on AMD GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu#accelerated-inference-on-amd-gpus) guides. This section only provides a brief and simple example.
</Tip> </Tip>
ONNX Runtime (ORT) is a model accelerator that supports accelerated inference on Nvidia GPUs. ORT uses optimization techniques like fusing common operations into a single node and constant folding to reduce the number of computations performed and speedup inference. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices. ONNX Runtime (ORT) is a model accelerator that supports accelerated inference on Nvidia GPUs, and AMD GPUs that use [ROCm](https://www.amd.com/en/products/software/rocm.html) stack. ORT uses optimization techniques like fusing common operations into a single node and constant folding to reduce the number of computations performed and speedup inference. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices.
ORT is supported by 🤗 Optimum which can be used in 🤗 Transformers. You'll need to use an [`~optimum.onnxruntime.ORTModel`] for the task you're solving, and specify the `provider` parameter which can be set to either [`CUDAExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#cudaexecutionprovider) or [`TensorrtExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#tensorrtexecutionprovider). If you want to load a model that was not yet exported to ONNX, you can set `export=True` to convert your model on-the-fly to the ONNX format : ORT is supported by 🤗 Optimum which can be used in 🤗 Transformers. You'll need to use an [`~optimum.onnxruntime.ORTModel`] for the task you're solving, and specify the `provider` parameter which can be set to either [`CUDAExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#cudaexecutionprovider), [`ROCMExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu) or [`TensorrtExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#tensorrtexecutionprovider). If you want to load a model that was not yet exported to ONNX, you can set `export=True` to convert your model on-the-fly to the ONNX format:
```py ```py
from optimum.onnxruntime import ORTModelForSequenceClassification from optimum.onnxruntime import ORTModelForSequenceClassification
......
...@@ -237,7 +237,7 @@ You can speedup the training throughput by using Flash Attention 2 integration i ...@@ -237,7 +237,7 @@ You can speedup the training throughput by using Flash Attention 2 integration i
The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay). Adam achieves The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay). Adam achieves
good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory
footprint of the order of the number of model parameters. To remedy this, you can use an alternative optimizer. footprint of the order of the number of model parameters. To remedy this, you can use an alternative optimizer.
For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed, `adamw_apex_fused` will give you the For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed for NVIDIA GPUs, or [ROCmSoftwarePlatform/apex](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs, `adamw_apex_fused` will give you the
fastest training experience among all supported AdamW optimizers. fastest training experience among all supported AdamW optimizers.
[`Trainer`] integrates a variety of optimizers that can be used out of box: `adamw_hf`, `adamw_torch`, `adamw_torch_fused`, [`Trainer`] integrates a variety of optimizers that can be used out of box: `adamw_hf`, `adamw_torch`, `adamw_torch_fused`,
...@@ -529,4 +529,4 @@ By default, in training mode, the BetterTransformer integration **drops the mask ...@@ -529,4 +529,4 @@ By default, in training mode, the BetterTransformer integration **drops the mask
</Tip> </Tip>
Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA. Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA.
\ No newline at end of file
...@@ -74,6 +74,12 @@ To follow this guide you will need a GPU. If you're working in a notebook, run t ...@@ -74,6 +74,12 @@ To follow this guide you will need a GPU. If you're working in a notebook, run t
!nvidia-smi !nvidia-smi
``` ```
or alternatively for AMD GPUs:
```bash
!rocm-smi
```
</Tip> </Tip>
We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in: We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in:
...@@ -630,4 +636,4 @@ see if this improves the results. ...@@ -630,4 +636,4 @@ see if this improves the results.
Finally, it is essential to consider ethical considerations. Although TTS technology has numerous useful applications, it Finally, it is essential to consider ethical considerations. Although TTS technology has numerous useful applications, it
may also be used for malicious purposes, such as impersonating someone's voice without their knowledge or consent. Please may also be used for malicious purposes, such as impersonating someone's voice without their knowledge or consent. Please
use TTS judiciously and responsibly. use TTS judiciously and responsibly.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment