"git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "17981faf6791c1549e7dcb62970c9c699e619ea7"
Unverified Commit 5cbf1fa8 authored by Dhruv Singal's avatar Dhruv Singal Committed by GitHub
Browse files

fixed typo in fp16 training section for perf_train_gpu_one (#19736)

parent 8db92dbe
...@@ -311,7 +311,7 @@ We can see that this saved some more memory but at the same time training became ...@@ -311,7 +311,7 @@ We can see that this saved some more memory but at the same time training became
## Floating Data Types ## Floating Data Types
The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput: The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput:
- fp32 (`float32`) - fp32 (`float32`)
- fp16 (`float16`) - fp16 (`float16`)
...@@ -328,7 +328,7 @@ While fp16 and fp32 have been around for quite some time, bf16 and tf32 are only ...@@ -328,7 +328,7 @@ While fp16 and fp32 have been around for quite some time, bf16 and tf32 are only
### FP16 Training ### FP16 Training
The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`: The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`:
```py ```py
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment