Unverified Commit 46d2e287 authored by brett koonce's avatar brett koonce Committed by GitHub
Browse files

docs: minor spelling tweaks (#623)


Co-authored-by: default avatarJeff Rasley <jerasley@microsoft.com>
parent d38ad6a1
...@@ -79,7 +79,7 @@ DeepSpeed. ...@@ -79,7 +79,7 @@ DeepSpeed.
### Optimizer State and Gradient Partitioning ### Optimizer State and Gradient Partitioning
Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the Optimizer State and Gradient Partitioning in ZeRO reduces the memory consumption of the
model states (optimizer states, gradients and parmaeters) by 8x compared to standard model states (optimizer states, gradients and parameters) by 8x compared to standard
data parallelism by partitioning these states across data parallel process instead of data parallelism by partitioning these states across data parallel process instead of
replicating them. replicating them.
...@@ -150,8 +150,8 @@ Please see the [core API doc](https://deepspeed.readthedocs.io/) for more detail ...@@ -150,8 +150,8 @@ Please see the [core API doc](https://deepspeed.readthedocs.io/) for more detail
### Activation Checkpointing API ### Activation Checkpointing API
DeepSpeed's Activation Checkpoinitng API supports activation checkpoint partitioning, DeepSpeed's Activation Checkpointing API supports activation checkpoint partitioning,
cpu checkpoiniting, and contiguous memory optimizations, while also allowing layerwise cpu checkpointing, and contiguous memory optimizations, while also allowing layerwise
profiling. Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details. profiling. Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
...@@ -190,7 +190,7 @@ NVIDIA, or any training optimizer that extends torch's `torch.optim.Optimizer` c ...@@ -190,7 +190,7 @@ NVIDIA, or any training optimizer that extends torch's `torch.optim.Optimizer` c
We introduce an efficient implementation of Adam optimizer on CPU that improves the parameter-update We introduce an efficient implementation of Adam optimizer on CPU that improves the parameter-update
performance by nearly an order of magnitude. We use the AVX SIMD instructions on Intel-x86 architecture performance by nearly an order of magnitude. We use the AVX SIMD instructions on Intel-x86 architecture
for the CPU-Adam implementation. We support both AVX-512 and AVX-2 instruction sets. DeepSpeed uses for the CPU-Adam implementation. We support both AVX-512 and AVX-2 instruction sets. DeepSpeed uses
AVX-2 by defualt which can be switched to AVX-512 by setting the build flag, `DS_BUILD_AVX512` to 1 when AVX-2 by default which can be switched to AVX-512 by setting the build flag, `DS_BUILD_AVX512` to 1 when
installing DeepSpeed. Using AVX-512, we observe 5.1x to 6.5x speedups considering the model-size between installing DeepSpeed. Using AVX-512, we observe 5.1x to 6.5x speedups considering the model-size between
1 to 10 billion parameters with respect to torch-adam. 1 to 10 billion parameters with respect to torch-adam.
......
...@@ -12,4 +12,4 @@ DeepSpeed offers sparse attention kernels, an instrumental technology to support ...@@ -12,4 +12,4 @@ DeepSpeed offers sparse attention kernels, an instrumental technology to support
* Brief overview, see our [press release]({{ site.press_release_v3 }}). * Brief overview, see our [press release]({{ site.press_release_v3 }}).
* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html). * Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/news/2020/09/08/sparse-attention.html).
* Tutorial on how to use sparse attention, see our [Sparse attention tutorial](https://www.deepspeed.ai/tutorials/sparse-attention/). * Tutorial on how to use sparse attention, see our [Sparse attention tutorial](https://www.deepspeed.ai/tutorials/sparse-attention/).
* The source code for our sparse attention kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code useing sparse attention can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples). * The source code for our sparse attention kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code using sparse attention can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).
...@@ -120,7 +120,7 @@ Alternatively, we show how the standard `mpirun` launcher can be used for launch ...@@ -120,7 +120,7 @@ Alternatively, we show how the standard `mpirun` launcher can be used for launch
mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash run_squad_mpi_onebitadam.sh mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash run_squad_mpi_onebitadam.sh
``` ```
For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use the `mpirun` launcher packaged with the MVAPICH2 library. Please run the folowing command: For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use the `mpirun` launcher packaged with the MVAPICH2 library. Please run the following command:
```shell ```shell
mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash run_squad_mpi_onebitadam.sh mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash run_squad_mpi_onebitadam.sh
...@@ -166,7 +166,7 @@ We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scor ...@@ -166,7 +166,7 @@ We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scor
***Training Speed and Scalability:*** ***Training Speed and Scalability:***
1-bit Adam enables up to 2.7x overall speedup in training speed for SQuAD fine-tuning. This is made possible by up to 6.2x faster througput during the compressed stage of the algorithm as shown in Figure 1. 1-bit Adam enables up to 2.7x overall speedup in training speed for SQuAD fine-tuning. This is made possible by up to 6.2x faster throughput during the compressed stage of the algorithm as shown in Figure 1.
![SQuAD Finetuning](/assets/images/squad-scaling.png){: .align-center} ![SQuAD Finetuning](/assets/images/squad-scaling.png){: .align-center}
......
...@@ -75,7 +75,7 @@ net = PipelineModule(layers=net, num_stages=2) ...@@ -75,7 +75,7 @@ net = PipelineModule(layers=net, num_stages=2)
``` ```
`PipelineModule` uses its `layers` argument as the sequence of layers that `PipelineModule` uses its `layers` argument as the sequence of layers that
comprise the model. After initialization, `net` is divided into two pipeline comprise the model. After initialization, `net` is divided into two pipeline
stages and its layers moved to the correpsonding GPUs. If more than two GPUs stages and its layers moved to the corresponding GPUs. If more than two GPUs
are present, DeepSpeed will also use hybrid data parallelism. are present, DeepSpeed will also use hybrid data parallelism.
**Note:** The total number of GPUs must be divisible by the number of pipeline **Note:** The total number of GPUs must be divisible by the number of pipeline
......
...@@ -95,7 +95,7 @@ Note that the above configuration assumes training on 64 X 32GB V100 GPUs. Each ...@@ -95,7 +95,7 @@ Note that the above configuration assumes training on 64 X 32GB V100 GPUs. Each
Table 1. Pre-training hyperparameters Table 1. Pre-training hyperparameters
**Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stablize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py". **Note:** DeepSpeed now supports PreLayerNorm as the default way for training BERT, because of its ability to avoid vanishing gradient, stabilize optimization, and performance gains, as described in our fastest BERT training [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html). We therefore support the switchable Transformer block directly on the the BERT with PreLayerNorm. The implementation can be found at "example\bing_bert\nvidia\modelingpreln_layerdrop.py".
## Fine-tuning with DeepSpeed on GLUE Tasks ## Fine-tuning with DeepSpeed on GLUE Tasks
......
...@@ -79,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena ...@@ -79,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
} }
``` ```
In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run. In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmentation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
Here is a screenshot of the training log: Here is a screenshot of the training log:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment