Unverified Commit 6379292c authored by Conglong Li's avatar Conglong Li Committed by GitHub
Browse files

Improving deepspeed.ai website (#269)

* syntax/typo fix

* add README for documentation

* fix links

* update navigation

* typo fix

* docs readme fix
parent 88c319aa
# DeepSpeed Documentation
This directory includes the documentation of DeepSpeed. There are three ways to read it:
## 1. Access [deepspeed.ai](https://www.deepspeed.ai/)
This is the most recommended way to read the documentation.
## 2. Directly read files in this directory
We do not recommend this way because this directory is organized to build the [deepspeed.ai](https://www.deepspeed.ai/) website using Jekyll. Thus some of the files actually are not DeepSpeed documentation. In addition, some of the url links in the documentation only work through the webpages generated by Jekyll.
## 3. Build [deepspeed.ai](https://www.deepspeed.ai/) website locally using Jekyll
This is recommended for local website development or when you do not have internet access. You can follow the instruction at [here](https://help.github.com/en/github/working-with-github-pages/testing-your-github-pages-site-locally-with-jekyll) to install Ruby, Bundler, and Jekyll. Then run `bundle exec jekyll serve` at this directory so that you can view the website in your web browser at `http://localhost:4000`.
......@@ -34,6 +34,8 @@ collections:
- azure.md
- cifar-10.md
- bert-pretraining.md
- bert-finetuning.md
- transformer_kernel.md
- megatron.md
- 1Cycle.md
- lrrt.md
......
......@@ -54,6 +54,10 @@ lnav:
url: /tutorials/cifar-10/
- title: "BERT Pre-training"
url: /tutorials/bert-pretraining/
- title: "BingBertSQuAD Fine-tuning"
url: /tutorials/bert-finetuning/
- title: "DeepSpeed Transformer Kernel"
url: /tutorials/transformer_kernel/
- title: "Megatron-LM GPT2"
url: /tutorials/megatron/
- title: "1-Cycle Schedule"
......
......@@ -34,10 +34,10 @@ The script `<client_entry.py>` will execute on the resources specified in `<host
## Model Parallelism
### Support for Custom Model Parallelism
DeepSpeed is supports all forms of model parallelism including tensor slicing based
approaches such as the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), or a
pipelined parallelism approach such as
[PipeDream](https://github.com/msr-fiddle/pipedream) or
DeepSpeed supports all forms of model parallelism including tensor slicing based
approaches such as the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), or
pipelined parallelism approaches such as
[PipeDream](https://github.com/msr-fiddle/pipedream) and
[GPipe](https://github.com/kakaobrain/torchgpipe). It does so by only requiring the model
parallelism framework to provide a *model parallelism unit* (`mpu`) that implements a few
bookkeeping functionalities:
......@@ -118,7 +118,7 @@ micro-batch, specially when the number of micro-batches per effective batch is l
### Communication Overlapping
During back propagation, DeepSpeed can overlap the communication required for averaging
parameter gradients that have already been computed with the ongoing gradient computation.
This computation communication overlap, allows DeepSpeed to achieve higher throughput even
This computation-communication overlap allows DeepSpeed to achieve higher throughput even
at modest batch sizes.
## Training Features
......@@ -177,9 +177,9 @@ DeepSpeed makes it easy to train with large batch sizes by enabling the LAMB Opt
For more details on LAMB, see the [LAMB paper](https://arxiv.org/pdf/1904.00962.pdf).
### Memory-Efficient Training with ZeRO Optimizer
DeepSpeed can train models up with up to 13 billion parameters without parallelism, and
DeepSpeed can train models with up to 13 billion parameters without model parallelism, and
models with up to 200 billion parameters with 16-way model parallelism. This leap in
model size is possible though the memory efficiency achieved via the ZeRO Optimizer. For
model size is possible through the memory efficiency achieved via the ZeRO Optimizer. For
more details see [ZeRO paper](https://arxiv.org/abs/1910.02054) .
......@@ -189,8 +189,7 @@ DeepSpeed can simplify checkpointing for you regardless of whether you are using
parallel training, model parallel training, mixed-precision training, a mix of these
three, or using the zero optimizer to enable larger model sizes.
Please see the [Getting Started](/getting-started/) guide
and the
Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
and the [core API doc](https://deepspeed.readthedocs.io/) for more details.
## Advanced parameter search
DeepSpeed supports multiple Learning Rate Schedules to enable faster convergence for
......@@ -210,7 +209,7 @@ can automatically handle batch creation appropriately.
## Performance Analysis and Debugging
For performance debugging, DeepSpeed can give you a detailed breakdown of the time spent
in different parts of the training with by simply enabling it in the `deepspeed_config`
in different parts of the training by simply enabling it in the `deepspeed_config`
file.
Please see the [core API doc](https://deepspeed.readthedocs.io/) for more details.
```json
......
......@@ -13,8 +13,8 @@ If you don't already have an Azure account please see more details here: [https:
To help with launching Azure instances we suggest using the [Azure
CLI](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest). We have created
several helper scripts to get you quickly started using DeepSpeed with Azure.
* Install Azure CLI on your local box: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
* Alternatively you can use the Azure in-browser shell: https://shell.azure.com/
* Install Azure CLI on your local box: [https://docs.microsoft.com/en-us/cli/azure/install-azure-cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli).
* Alternatively you can use the Azure in-browser shell: [https://shell.azure.com/](https://shell.azure.com/).
## Create an SSH key
Generate an SSH key that will be used across this tutorial to SSH into your VMs and
......
......@@ -343,7 +343,12 @@ about 96 hours to reach parity on 64 TPU2 chips, we train in less than 9 hours o
hours) from NVIDIA using their superpod on the same number of GPUs
([link](https://devblogs.nvidia.com/training-bert-with-gpus/)).
![DeepSpeed BERT Training Time](/assets/images/end-to-end-bert-training.png){: .align-center}
| Number of nodes | Number of V100 GPUs | Time |
| --------------- | ------------------- | ------------ |
| 1 DGX-2 | 16 | 33 hr 13 min |
| 4 DGX-2 | 64 | 8 hr 41 min |
| 16 DGX-2 | 256 | 144 min |
| 64 DGX-2 | 1024 | 44 min |
Our configuration for the BERT training result above can be reproduced with
the scripts/json configs in our DeepSpeedExamples repo. Below is a table containing a
......@@ -377,7 +382,7 @@ for more details in
![DeepSpeed Single GPU Bert Training Throughput 128](/assets/images/transformer_kernel_perf_seq128.PNG){: .align-center}
![DeepSpeed Single GPU Bert Training Throughput 512](/assets/images/transformer_kernel_perf_seq128.PNG){: .align-center}
![DeepSpeed Single GPU Bert Training Throughput 512](/assets/images/transformer_kernel_perf_seq512.PNG){: .align-center}
Compared to SOTA, DeepSpeed significantly improves single GPU performance for transformer-based model like BERT. Figure above shows the single GPU throughput of training BertBERT-Large optimized through DeepSpeed, compared with two well-known Pytorch implementations, NVIDIA BERT and HuggingFace BERT. DeepSpeed reaches as high as 64 and 53 teraflops throughputs (corresponding to 272 and 52 samples/second) for sequence lengths of 128 and 512, respectively, exhibiting up to 28% throughput improvements over NVIDIA BERT and up to 62% over HuggingFace BERT. We also support up to 1.8x larger batch size without running out of memory.
......
......@@ -3,8 +3,8 @@ title: "CIFAR-10 Tutorial"
excerpt: "Train your first model with DeepSpeed!"
---
If you haven't already, we advise you to first read through the [Getting
Started](../../README.md#getting-started) guide before stepping through this
If you haven't already, we advise you to first read through the
[Getting Started](/getting-started/) guide before stepping through this
tutorial.
In this tutorial we will be adding DeepSpeed to CIFAR-10 model, which is small image classification model.
......
......@@ -46,7 +46,7 @@ tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) an
Model tests require four GPUs and training data downloaded for
[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/).
To execute model tests, first [install DeepSpeed](#installation). The
To execute model tests, first [install DeepSpeed](/getting-started/#installation). The
[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/) repository is cloned
as part of this process. Next, execute the model test driver:
```bash
......@@ -59,7 +59,7 @@ Note that the `--forked` flag is not necessary for the model tests.
This project welcomes contributions and suggestions. Most contributions require you to
agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
actually do, grant us the rights to use your contribution. For details, visit
https://cla.opensource.microsoft.com.
[https://cla.opensource.microsoft.com](https://cla.opensource.microsoft.com).
When you submit a pull request, a CLA bot will automatically determine whether you need
to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment