docs: minor spelling tweaks (#623)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

docs: minor spelling tweaks (#623)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
46d2e287 · brett koonce · GitHub · d38ad6a1 · 46d2e287 · 46d2e287
Unverified Commit 46d2e287 authored Jan 05, 2021 by brett koonce Committed by GitHub Jan 05, 2021
6 changed files
--- a/docs/_pages/features.md
+++ b/docs/_pages/features.md
@@ -79,7 +79,7 @@ DeepSpeed.
 ### Optimizer State and Gradient Partitioning
 Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the
-model states (optimizer states, gradients and parmaeters) by 8x compared to standard
+model states (optimizer states, gradients and parameters) by 8x compared to standard
 data parallelism by partitioning these states across data parallel process instead of
 replicating them.
@@ -150,8 +150,8 @@ Please see the [core API doc](https://deepspeed.readthedocs.io/) for more detail
 ### Activation Checkpointing API
-DeepSpeed's Activation Checkpoinitng API supports activation checkpoint partitioning,
+DeepSpeed's Activation Checkpointing API supports activation checkpoint partitioning,
-cpu checkpoiniting, and contiguous memory optimizations, while also allowing layerwise
+cpu checkpointing, and contiguous memory optimizations, while also allowing layerwise
 profiling. Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
@@ -190,7 +190,7 @@ NVIDIA, or any training optimizer that extends torch's `torch.optim.Optimizer` c
 We introduce an efficient implementation of Adam optimizer on CPU that improves the parameter-update
 performance by nearly an order of magnitude. We use the AVX SIMD instructions on Intel-x86 architecture
 for the CPU-Adam implementation. We support both AVX-512 and AVX-2 instruction sets. DeepSpeed uses
-AVX-2 by defualt which can be switched to AVX-512 by setting the build flag, `DS_BUILD_AVX512` to 1 when
+AVX-2 by default which can be switched to AVX-512 by setting the build flag, `DS_BUILD_AVX512` to 1 when
 installing DeepSpeed. Using AVX-512, we observe 5.1x to 6.5x speedups considering the model-size between
 1 to 10 billion parameters with respect to torch-adam.

--- a/docs/_posts/2020-09-08-sparse-attention-news.md
+++ b/docs/_posts/2020-09-08-sparse-attention-news.md
@@ -12,4 +12,4 @@ DeepSpeed offers sparse attention kernels, an instrumental technology to support
 * Brief overview, see our [press release]({{ site.press_release_v3 }}).
 * Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html).
 * Tutorial on how to use sparse attention, see our [Sparse attention tutorial](https://www.deepspeed.ai/tutorials/sparse-attention/).
-* The source code for our sparse attention kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code useing sparse attention can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
+* The source code for our sparse attention kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code using sparse attention can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
--- a/docs/_tutorials/onebit-adam.md
+++ b/docs/_tutorials/onebit-adam.md
@@ -120,7 +120,7 @@ Alternatively, we show how the standard `mpirun` launcher can be used for launch
 mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash run_squad_mpi_onebitadam.sh
 ```
-For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use the `mpirun` launcher packaged with the MVAPICH2 library. Please run the folowing command:
+For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use the `mpirun` launcher packaged with the MVAPICH2 library. Please run the following command:
 ```shell
 mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash run_squad_mpi_onebitadam.sh
@@ -166,7 +166,7 @@ We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scor
 ***Training Speed and Scalability:***
-1-bit Adam enables up to 2.7x overall speedup in training speed for SQuAD fine-tuning. This is made possible by up to 6.2x faster througput during the compressed stage of the algorithm as shown in Figure 1.
+1-bit Adam enables up to 2.7x overall speedup in training speed for SQuAD fine-tuning. This is made possible by up to 6.2x faster throughput during the compressed stage of the algorithm as shown in Figure 1.
 ![SQuAD Finetuning](/assets/images/squad-scaling.png){: .align-center}

--- a/docs/_tutorials/pipeline.md
+++ b/docs/_tutorials/pipeline.md
@@ -75,7 +75,7 @@ net = PipelineModule(layers=net, num_stages=2)
 ```
 `PipelineModule` uses its `layers` argument as the sequence of layers that
 comprise the model. After initialization, `net` is divided into two pipeline
-stages and its layers moved to the correpsonding GPUs. If more than two GPUs
+stages and its layers moved to the corresponding GPUs. If more than two GPUs
 are present, DeepSpeed will also use hybrid data parallelism.
 **Note:** The total number of GPUs must be divisible by the number of pipeline

--- a/docs/_tutorials/progressive_layer_dropping.md
+++ b/docs/_tutorials/progressive_layer_dropping.md
@@ -95,7 +95,7 @@ Note that the above configuration assumes training on 64 X 32GB V100 GPUs. Each
 Table 1. Pre-training hyperparameters
-**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stablize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py".
+**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stabilize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py".
 ## Fine-tuning with DeepSpeed on GLUE Tasks

--- a/docs/_tutorials/zero.md
+++ b/docs/_tutorials/zero.md
@@ -79,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
 }
 ```
-In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
+In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmentation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
 Here is a screenshot of the training log: