fixed typo in fp16 training section for perf_train_gpu_one (#19736)

5cbf1fa8 · Dhruv Singal · GitHub · 8db92dbe · 5cbf1fa8
Unverified Commit 5cbf1fa8 authored Oct 24, 2022 by Dhruv Singal Committed by GitHub Oct 24, 2022
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 2 deletions

docs/source/en/perf_train_gpu_one.mdx docs/source/en/perf_train_gpu_one.mdx +2 -2

No files found.
--- a/docs/source/en/perf_train_gpu_one.mdx
+++ b/docs/source/en/perf_train_gpu_one.mdx
@@ -311,7 +311,7 @@ We can see that this saved some more memory but at the same time training became
 ## Floating Data Types
-The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput:
+The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput:
 - fp32 (`float32`)
 - fp16 (`float16`)
@@ -328,7 +328,7 @@ While fp16 and fp32 have been around for quite some time, bf16 and tf32 are only
 ### FP16 Training
-The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`:
+The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`:
 ```py
 training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)