Remove typo in perf_train_gpu_many.mdx (#23144)

- Excess `w` in the word `bottom`

Remove typo in perf_train_gpu_many.mdx (#23144)
- Excess `w` in the word `bottom`
3b74889e · Victor Geislinger · GitHub · 5eeb5564 · 3b74889e
Unverified Commit 3b74889e authored May 04, 2023 by Victor Geislinger Committed by GitHub May 04, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

docs/source/en/perf_train_gpu_many.mdx docs/source/en/perf_train_gpu_many.mdx +1 -1

No files found.
--- a/docs/source/en/perf_train_gpu_many.mdx
+++ b/docs/source/en/perf_train_gpu_many.mdx
@@ -272,7 +272,7 @@ It's easy to see from the bottom diagram how PP has less dead zones, where GPUs

 Both parts of the diagram show a parallelism that is of degree 4. That is 4 GPUs are participating in the pipeline. So there is the forward path of 4 pipe stages F0, F1, F2 and F3 and then the return reverse order backward path of B3, B2, B1 and B0.

-PP introduces a new hyper-parameter to tune and it's `chunks` which defines how many chunks of data are sent in a sequence through the same pipe stage. For example, in the bottomw diagram you can see that `chunks=4`. GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do their work and only when their work is starting to be complete, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0).
+PP introduces a new hyper-parameter to tune and it's `chunks` which defines how many chunks of data are sent in a sequence through the same pipe stage. For example, in the bottom diagram you can see that `chunks=4`. GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do their work and only when their work is starting to be complete, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0).

 Note that conceptually this is the same concept as gradient accumulation steps (GAS). Pytorch uses `chunks`, whereas DeepSpeed refers to the same hyper-parameter as GAS.