Features links (#252)

* links and formatting

Features links (#252)
* links and formatting
2a1c5db1 · Shaden Smith · GitHub · fd2a8fd5 · 2a1c5db1 · 2a1c5db1
Unverified Commit 2a1c5db1 authored Jun 04, 2020 by Shaden Smith Committed by GitHub Jun 04, 2020
Showing with 25 additions and 23 deletions

docs/_tutorials/bert-pretraining.md docs/_tutorials/bert-pretraining.md +6 -6

docs/_tutorials/transformer_kernel.md docs/_tutorials/transformer_kernel.md +8 -6

docs/index.md docs/index.md +11 -11

No files found.
--- a/docs/_tutorials/bert-pretraining.md
+++ b/docs/_tutorials/bert-pretraining.md
@@ -333,14 +333,14 @@ launching DeepSpeed.
 ## Reproducing Fastest BERT Training Results with DeepSpeed
-We achieve the fastest BERT training time while remaining competitive across the industry in terms of achieving F1 score of 90.5 or better on the SQUAD 1.1 dev set. Please follow the [Fine-tuning](/bert-finetuning/) tutorial to finetune your model pretrained by transformer kernel and reprodue the SQUAD F1 score.
+We achieve the fastest BERT training time while remaining competitive across the industry in terms of achieving F1 score of 90.5 or better on the SQUAD 1.1 dev set. Please follow the [BERT fine-tuning](/tutorials/bert-finetuning/) tutorial to fine-tune your model that was pre-trained by transformer kernel and reproduce the SQUAD F1 score.
- We complete BERT pretraining in 44 minutes using 1024 V100 GPUs (64 NVIDIA DGX-2 nodes). In comparison, the previous SOTA from NVIDIA takes 47 mins using 1472 V100 GPUs. DeepSpeed is not only faster but also uses 30% less resources. Using the same 1024 GPUS, NVIDIA BERT is 52% slower than DeepSpeed, taking 67 minutes to train.
+- We complete BERT pre-training in 44 minutes using 1024 V100 GPUs (64 NVIDIA DGX-2 nodes). In comparison, the previous SOTA from NVIDIA takes 47 mins using 1472 V100 GPUs. DeepSpeed is not only faster but also uses 30% less resources. Using the same 1024 GPUS, NVIDIA BERT is 52% slower than DeepSpeed, taking 67 minutes to train.
- Comparing with the original BERT training time from Google, it took them
+- Comparing with the original BERT training time from Google in which it took
-about 96 hours to reach parity on 64 TPU2 chips, while it took us less than 9 hours on
+about 96 hours to reach parity on 64 TPU2 chips, we train in less than 9 hours on
 4 DGX-2 nodes of 64 V100 GPUs.
- On 256 GPUs, it took us 2.4, faster than state-of-art result (3.9
+- On 256 GPUs, it took us 2.4 hours, faster than state-of-art result (3.9
-hours) from Nvidia using their superpod on the same number of GPUs
+hours) from NVIDIA using their superpod on the same number of GPUs
 ([link](https://devblogs.nvidia.com/training-bert-with-gpus/)).
 ![DeepSpeed BERT Training Time](/assets/images/end-to-end-bert-training.png){: .align-center}

--- a/docs/_tutorials/transformer_kernel.md
+++ b/docs/_tutorials/transformer_kernel.md
@@ -11,7 +11,9 @@ requires to be highly efficient in term of performance, in order to allow scient
 explore different models across various application domains in a reasonable amount of time.
 To this end, we have developed a new kernel for transformer networks which includes several
 optimizations specific to these layers, which boost the training throughput on single GPU and scales
-well as we increase the number of GPUs. For more information on the details of transformer kernel, please visit our recent blog post on the [fastest Bert training](/fast_bert/).
+well as we increase the number of GPUs. For more information on the details
+of transformer kernel, please visit our recent blog post on the [fastest BERT
+training](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html).
 ## Prerequisites
@@ -68,11 +70,11 @@ The environment parameters of the transformer kernel includes:
 1. `local_rank`: The rank of the current GPU running the transformer kernel
 2. `seed`: The random seed for the dropout layer
 3. `fp16`: Enable half-precision computation
-4. `initializer_range`: Bert's initializer range
+4. `initializer_range`: BERT's initializer range
 High-performance optimization flag:
-1. `stochastic_mode`: By turning on this flag, the training can run faster by 2% on average. Note, that this flag has some level of non-determinism and can produce different results on different runs. However, we have seen that by enabling it, the pretraining tasks such as BERT are not affected and can obtain a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend to turn it off in order to be able to reproduce the same result through the regular kernel execution.
+1. `stochastic_mode`: By turning on this flag, the training can run faster by 2% on average. Note, that this flag has some level of non-determinism and can produce different results on different runs. However, we have seen that by enabling it, the pre-training tasks such as BERT are not affected and can obtain a high accuracy level. On the other hand, for the downstream tasks, such as fine-tuning, we recommend to turn it off in order to be able to reproduce the same result through the regular kernel execution.
 The memory-optimization flags consist of:
@@ -80,7 +82,7 @@ The memory-optimization flags consist of:
 2. `normalize_invertible`: Enable invertible LayerNorm execution (dropping the input activation)
 3. `gelu_checkpoint`: Enable checkpointing of Gelu activation output to save memory
-To illustrate the required model configuration changes to use transformer kernel in model training, we use a Bert model and go through the different configurations in order to support the different sequence lengths and batch sizes. Please see the instruction at [Bert training tutorial](/bert-pretraining/).
+To illustrate the required model configuration changes to use transformer kernel in model training, we use a BERT model and go through the different configurations in order to support the different sequence lengths and batch sizes. Please see the instruction at [BERT training tutorial](/tutorials/bert-pretraining/).
 ### **Memory Optimization Flags**
@@ -90,7 +92,7 @@ By setting the `normalize_invertible` flag, we force the kernel to drop the inpu
 The `attn_dropout_checkpoint` and `gelu_checkpoint` flags refer to the checkpointing approach, in which we drop the inputs to some parts of the transformer layer, attention dropout and Gelu, in order to save an important part of the activation memory. Based on our performance profiling, the performance cost of rematerializing these two are negligible and finally the performance benefit that we gain from running larger batch size compensate for that.
-The following table shows which memory optimization flags need to be turned on when running Bert-Large on NVIDIA V100 GPU with 32GB of memory, considering different micro-batch sizes and sequence lengths. For the two sequence lengths, 128 and 512, used in our experiments, we have seen that larger batch size improves the overall training performance for both. Please see our [Bert-Fast](/fast-bert/) blog post for more information regarding the performance evaluation of these configurations.
+The following table shows which memory optimization flags need to be turned on when running BERT-Large on NVIDIA V100 GPU with 32GB of memory, considering different micro-batch sizes and sequence lengths. For the two sequence lengths, 128 and 512, used in our experiments, we have seen that larger batch size improves the overall training performance for both. Please see our [blog post](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) for more information regarding the performance evaluation of these configurations.
 | Micro-batch size |    128 sequence-length    |           512 sequence-length            |
 | :--------------: | :-----------------------: | :--------------------------------------: |
@@ -103,7 +105,7 @@ The following table shows which memory optimization flags need to be turned on w
 ### **Enable Transformer Kernel**
-As mentioned earlier, in order to run the transformer network using the custom DeepSpeed kernel, we only need to pass the `deepspeed_transformer_kernel` option when running the training script. Below, we show an example of how we pass this parameter to the `deepspeed` launcher, besides the rest of parameters for the Bert pre-training task.
+As mentioned earlier, in order to run the transformer network using the custom DeepSpeed kernel, we only need to pass the `deepspeed_transformer_kernel` option when running the training script. Below, we show an example of how we pass this parameter to the `deepspeed` launcher, besides the rest of parameters for the BERT pre-training task.
 ```bash
 deepspeed deepspeed_train.py \

--- a/docs/index.md
+++ b/docs/index.md
@@ -135,38 +135,38 @@ Only a few lines of code changes are needed to enable a PyTorch model to use Dee
 ## Features
 Below we provide a brief feature list, see our detailed [feature
-overview](features) for descriptions and usage.
+overview](/features/) for descriptions and usage.
-* [Distributed Training with Mixed Precision](features.md#distributed-training-with-mixed-precision)
+* [Distributed Training with Mixed Precision](/features/#distributed-training-with-mixed-precision)
    * 16-bit mixed precision
    * Single-GPU/Multi-GPU/Multi-Node
-* [Model Parallelism](features.md#model-parallelism)
+* [Model Parallelism](/features/#model-parallelism)
    * Support for Custom Model Parallelism
    * Integration with Megatron-LM
-* [The Zero Redundancy Optimizer (ZeRO)](features.md#the-zero-redundancy-optimizer)
+* [The Zero Redundancy Optimizer (ZeRO)](/features/#the-zero-redundancy-optimizer)
    * Optimizer State and Gradient Partitioning
    * Activation Partitioning
    * Constant Buffer Optimization
    * Contiguous Memory Optimization
-* [Additional Memory and Bandwidth Optimizations](features.md#additional-memory-and-bandwidth-optimizations)
+* [Additional Memory and Bandwidth Optimizations](/features/#additional-memory-and-bandwidth-optimizations)
    * Smart Gradient Accumulation
    * Communication/Computation Overlap
-* [Training Features](features.md#training-features)
+* [Training Features](/features/#training-features)
    * Simplified training API
    * Activation Checkpointing API
    * Gradient Clipping
    * Automatic loss scaling with mixed precision
-* [Training Optimizers](features.md#training-optimizers)
+* [Training Optimizers](/features/#training-optimizers)
    * Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
    * Memory bandwidth optimized FP16 Optimizer
    * Large Batch Training with LAMB Optimizer
    * Memory efficient Training with ZeRO Optimizer
-* [Training Agnostic Checkpointing](features.md#training-agnostic-checkpointing)
+* [Training Agnostic Checkpointing](/features/#training-agnostic-checkpointing)
-* [Advanced Parameter Search](features.md#advanced-parameter-search)
+* [Advanced Parameter Search](/features/#advanced-parameter-search)
    * Learning Rate Range Test
    * 1Cycle Learning Rate Schedule
-* [Simplified Data Loader](features.md#simplified-data-loader)
+* [Simplified Data Loader](/features/#simplified-data-loader)
-* [Performance Analysis and Debugging](features.md#performance-analysis-and-debugging)
+* [Performance Analysis and Debugging](/features/#performance-analysis-and-debugging)
 # Contributing