Improving deepspeed.ai website (#269)

* syntax/typo fix * add README for documentation * fix links * update navigation * typo fix * docs readme fix

Improving deepspeed.ai website (#269)
* syntax/typo fix * add README for documentation * fix links * update navigation * typo fix * docs readme fix
6379292c · Conglong Li · GitHub · 88c319aa · 6379292c · 6379292c
Unverified Commit 6379292c authored Jun 23, 2020 by Conglong Li Committed by GitHub Jun 23, 2020
8 changed files
--- a/docs/README.md
+++ b/docs/README.md
+# DeepSpeed Documentation
+
+This directory includes the documentation of DeepSpeed. There are three ways to read it:
+
+## 1. Access [deepspeed.ai](https://www.deepspeed.ai/)
+
+This is the most recommended way to read the documentation.
+
+## 2. Directly read files in this directory
+
+We do not recommend this way because this directory is organized to build the [deepspeed.ai](https://www.deepspeed.ai/) website using Jekyll. Thus some of the files actually are not DeepSpeed documentation. In addition, some of the url links in the documentation only work through the webpages generated by Jekyll.
+
+## 3. Build [deepspeed.ai](https://www.deepspeed.ai/) website locally using Jekyll
+
+This is recommended for local website development or when you do not have internet access. You can follow the instruction at [here](https://help.github.com/en/github/working-with-github-pages/testing-your-github-pages-site-locally-with-jekyll) to install Ruby, Bundler, and Jekyll. Then run `bundle exec jekyll serve` at this directory so that you can view the website in your web browser at `http://localhost:4000`.
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -34,6 +34,8 @@ collections:
      - azure.md
      - cifar-10.md
      - bert-pretraining.md
+      - bert-finetuning.md
+      - transformer_kernel.md
      - megatron.md
      - 1Cycle.md
      - lrrt.md

--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -54,6 +54,10 @@ lnav:
        url: /tutorials/cifar-10/
      - title: "BERT Pre-training"
        url: /tutorials/bert-pretraining/
+      - title: "BingBertSQuAD Fine-tuning"
+        url: /tutorials/bert-finetuning/
+      - title: "DeepSpeed Transformer Kernel"
+        url: /tutorials/transformer_kernel/
      - title: "Megatron-LM GPT2"
        url: /tutorials/megatron/
      - title: "1-Cycle Schedule"

--- a/docs/_pages/features.md
+++ b/docs/_pages/features.md
@@ -34,10 +34,10 @@ The script `<client_entry.py>` will execute on the resources specified in `<host
 ## Model Parallelism

 ### Support for Custom Model Parallelism
-DeepSpeed is supports all forms of model parallelism including tensor slicing based
-approaches such as the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), or a
-pipelined parallelism approach such as
-[PipeDream](https://github.com/msr-fiddle/pipedream) or
+DeepSpeed supports all forms of model parallelism including tensor slicing based
+approaches such as the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), or
+pipelined parallelism approaches such as
+[PipeDream](https://github.com/msr-fiddle/pipedream) and
 [GPipe](https://github.com/kakaobrain/torchgpipe). It does so by only requiring the model
 parallelism framework to provide a *model parallelism unit* (`mpu`) that implements a few
 bookkeeping functionalities:
@@ -118,7 +118,7 @@ micro-batch, specially when the number of micro-batches per effective batch is l
 ### Communication Overlapping
 During back propagation, DeepSpeed can overlap the communication required for averaging
 parameter gradients that have already been computed with the ongoing gradient computation.
-This computation communication overlap, allows DeepSpeed to achieve higher throughput even
+This computation-communication overlap allows DeepSpeed to achieve higher throughput even
 at modest batch sizes.  

 ## Training Features
@@ -177,9 +177,9 @@ DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Opt
 For more details on LAMB, see the [LAMB paper](https://arxiv.org/pdf/1904.00962.pdf).

 ### Memory-Efficient Training with ZeRO Optimizer
-DeepSpeed can train models up with up to 13 billion parameters without parallelism, and
+DeepSpeed can train models with up to 13 billion parameters without model parallelism, and
 models with up to 200 billion parameters with 16-way model parallelism. This leap in
-model size is possible though the memory efficiency achieved via the ZeRO Optimizer. For
+model size is possible through the memory efficiency achieved via the ZeRO Optimizer. For
 more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .


@@ -189,8 +189,7 @@ DeepSpeed can simplify checkpointing for you regardless of whether you are using
 parallel training, model parallel training, mixed-precision training, a mix of these
 three, or using the zero optimizer to enable larger model sizes.
 Please see the [Getting Started](/getting-started/) guide
-and the
-Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
+and the [core API doc](https://deepspeed.readthedocs.io/) for more details.

 ## Advanced parameter search
 DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for
@@ -210,7 +209,7 @@ can automatically handle batch creation appropriately.

 ## Performance Analysis and Debugging
 For performance debugging, DeepSpeed can give you a detailed breakdown of the time spent
-in different parts of the training with by simply enabling it in the `deepspeed_config`
+in different parts of the training by simply enabling it in the `deepspeed_config`
 file.
 Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
 ```json

--- a/docs/_tutorials/azure.md
+++ b/docs/_tutorials/azure.md
@@ -13,8 +13,8 @@ If you don't already have an Azure account please see more details here: [https:
 To help with launching Azure instances we suggest using the [Azure
 CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
 several helper scripts to get you quickly started using DeepSpeed with Azure.
- * Install Azure CLI on your local box: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
- * Alternatively you can use the Azure in-browser shell: https://shell.azure.com/
+ * Install Azure CLI on your local box: [https://docs.microsoft.com/en-us/cli/azure/install-azure-cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli).
+ * Alternatively you can use the Azure in-browser shell: [https://shell.azure.com/](https://shell.azure.com/).

 ## Create an SSH key
 Generate an SSH key that will be used across this tutorial to SSH into your VMs and

--- a/docs/_tutorials/bert-pretraining.md
+++ b/docs/_tutorials/bert-pretraining.md
@@ -343,7 +343,12 @@ about 96 hours to reach parity on 64 TPU2 chips, we train in less than 9 hours o
 hours) from NVIDIA using their superpod on the same number of GPUs
 ([link](https://devblogs.nvidia.com/training-bert-with-gpus/)).

-![DeepSpeed BERT Training Time](/assets/images/end-to-end-bert-training.png){: .align-center}
+| Number of nodes | Number of V100 GPUs | Time         |
+| --------------- | ------------------- | ------------ |
+| 1 DGX-2         | 16                  | 33 hr 13 min |
+| 4 DGX-2         | 64                  | 8 hr 41 min  |
+| 16 DGX-2        | 256                 | 144 min      |
+| 64 DGX-2        | 1024                | 44 min       |

 Our configuration for the BERT training result above can be reproduced with
 the scripts/json configs in our DeepSpeedExamples repo. Below is a table containing a
@@ -377,7 +382,7 @@ for more details in

 ![DeepSpeed Single GPU Bert Training Throughput 128](/assets/images/transformer_kernel_perf_seq128.PNG){: .align-center}

-![DeepSpeed Single GPU Bert Training Throughput 512](/assets/images/transformer_kernel_perf_seq128.PNG){: .align-center}
+![DeepSpeed Single GPU Bert Training Throughput 512](/assets/images/transformer_kernel_perf_seq512.PNG){: .align-center}

 Compared to SOTA, DeepSpeed significantly improves single GPU performance for transformer-based model like BERT. Figure above shows the single GPU throughput of training BertBERT-Large optimized through DeepSpeed, compared with two well-known Pytorch implementations, NVIDIA BERT and HuggingFace BERT. DeepSpeed reaches as high as 64 and 53 teraflops throughputs (corresponding to 272 and 52 samples/second) for sequence lengths of 128 and 512, respectively, exhibiting up to 28% throughput improvements over NVIDIA BERT and up to 62% over HuggingFace BERT.  We also support up to 1.8x larger batch size without running out of memory.


--- a/docs/_tutorials/cifar-10.md
+++ b/docs/_tutorials/cifar-10.md
@@ -3,8 +3,8 @@ title: "CIFAR-10 Tutorial"
 excerpt: "Train your first model with DeepSpeed!"
 ---

-If you haven't already, we advise you to first read through the [Getting
-Started](../../README.md#getting-started) guide before stepping through this
+If you haven't already, we advise you to first read through the
+[Getting Started](/getting-started/) guide before stepping through this
 tutorial.

 In this tutorial we will be adding DeepSpeed to CIFAR-10 model, which is small image classification model.

--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -46,7 +46,7 @@ tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) an
 Model tests require four GPUs and training data downloaded for
 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/).

-To execute model tests, first [install DeepSpeed](#installation). The
+To execute model tests, first [install DeepSpeed](/getting-started/#installation). The
 [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
 as part of this process. Next, execute the model test driver:
 ```bash
@@ -59,7 +59,7 @@ Note that the `--forked` flag is not necessary for the model tests.
 This project welcomes contributions and suggestions. Most contributions require you to
 agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
 actually do, grant us the rights to use your contribution. For details, visit
-https://cla.opensource.microsoft.com.
+[https://cla.opensource.microsoft.com](https://cla.opensource.microsoft.com).

 When you submit a pull request, a CLA bot will automatically determine whether you need
 to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply