update v0.9.2

5bcc463d · aiss · ac5fbab4 · 5bcc463d · 5bcc463d · 5bcc463d
Commit 5bcc463d authored May 29, 2023 by aiss
20 changed files
--- a/docs/_posts/2021-03-08-zero3-offload.md
+++ b/docs/_posts/2021-03-08-zero3-offload.md
@@ -2,7 +2,7 @@
 title: "DeepSpeed ZeRO-3 Offload"
 excerpt: ""
 date: 2021-03-08 00:00:00
-tags: training ZeRO
+tags: training ZeRO English
 ---
 Today we are announcing the release of ZeRO-3 Offload, a highly efficient and easy to use implementation of ZeRO Stage 3 and ZeRO Offload combined, geared towards our continued goal of democratizing AI by making efficient large-scale DL training available to everyone.  The key benefits of ZeRO-3 Offload are:


--- a/docs/_posts/2021-05-05-MoQ.md
+++ b/docs/_posts/2021-05-05-MoQ.md
@@ -2,7 +2,7 @@
 title: "Mixture-of-Quantization: A novel quantization approach for reducing model size with minimal accuracy impact"
 excerpt: ""
 date: 2021-05-05 00:00:00
-tags: inference
+tags: inference English
 ---

 ## A unified suite for quantization-aware training and inference

--- a/docs/_posts/2021-05-05-inference-kernel-optimization.md
+++ b/docs/_posts/2021-05-05-inference-kernel-optimization.md
@@ -2,7 +2,7 @@
 title: "DeepSpeed Inference: Multi-GPU inference with customized inference kernels and quantization support"
 excerpt: ""
 date: 2021-03-16 00:00:00
-tags: inference
+tags: inference English
 ---
 While DeepSpeed supports training advanced large-scale models, using these trained models in the desired application scenarios is still challenging due to three major limitations in existing inference solutions: 1) lack of support for multi-GPU inference to fit large models and meet latency requirements, 2) limited GPU kernel performance when running inference with small batch sizes, and 3) difficulties in exploiting quantization, which includes both quantizing the model to reduce the model size and latency as well as supporting high-performance inference of quantized models without specialized hardware.


--- a/docs/_posts/2021-05-14-inference-release.md
+++ b/docs/_posts/2021-05-14-inference-release.md
@@ -3,5 +3,5 @@ title: "DeepSpeed: Accelerating large-scale model inference and training via sys
 date:   2021-05-14
 link: https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/
 excerpt: ""
-tags: inference
+tags: inference English
 ---
--- a/docs/_posts/2021-08-18-deepspeed-moe.md
+++ b/docs/_posts/2021-08-18-deepspeed-moe.md
@@ -3,5 +3,5 @@ title: "DeepSpeed powers 8x larger MoE model training with high performance"
 excerpt: ""
 link: https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/
 date: 2021-08-18 00:00:00
-tags: training
+tags: training English
 ---
--- a/docs/_posts/2021-11-15-autotuning.md
+++ b/docs/_posts/2021-11-15-autotuning.md
@@ -2,7 +2,7 @@
 title: "Autotuning: Automatically discover the optimal DeepSpeed configuration that delivers good training speed"
 excerpt: ""
 date: 2021-11-16 10:00:00
-tags: training
+tags: training English
 toc: false
 ---


--- a/docs/_posts/2021-12-09-deepspeed-moe-nlg.md
+++ b/docs/_posts/2021-12-09-deepspeed-moe-nlg.md
@@ -2,7 +2,7 @@
 title: "DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times"
 excerpt: ""
 date: 2021-12-09 22:00:00
-tags: training
+tags: training English
 ---

 Autoregressive transformer-based natural language generation (referred to as

--- a/docs/_posts/2022-01-19-moe-inference.md
+++ b/docs/_posts/2022-01-19-moe-inference.md
@@ -3,5 +3,5 @@ title: "DeepSpeed: Advancing MoE inference and training to power next-generation
 excerpt: ""
 link: https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/
 date: 2022-01-19 00:00:00
-tags: inference
+tags: inference English
 ---
--- a/docs/_posts/2022-03-21-amd-support.md
+++ b/docs/_posts/2022-03-21-amd-support.md
@@ -3,5 +3,5 @@ title: "Supporting efficient large model training on AMD Instinct GPUs with Deep
 excerpt: ""
 link: https://cloudblogs.microsoft.com/opensource/2022/03/21/supporting-efficient-large-model-training-on-amd-instinct-gpus-with-deepspeed/
 date: 2022-03-21 00:00:00
-tags: training ZeRO
+tags: training ZeRO English
 ---
--- a/docs/_posts/2022-07-26-deepspeed-azure.md
+++ b/docs/_posts/2022-07-26-deepspeed-azure.md
@@ -2,7 +2,7 @@
 title: "Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed"
 excerpt: ""
 date: 2022-07-26 00:09:00
-tags: training azure
+tags: training azure English
 ---

 ## Introduction

--- a/docs/_posts/2022-09-10-zero-inference.md
+++ b/docs/_posts/2022-09-10-zero-inference.md
@@ -2,7 +2,7 @@
 title: "ZeRO-Inference: Democratizing massive model inference"
 excerpt: ""
 date: 2022-09-10 00:09:00
-tags: inference ZeRO
+tags: inference ZeRO English
 ---

 ## Introduction

--- a/docs/_posts/2022-10-11-mii.md
+++ b/docs/_posts/2022-10-11-mii.md
@@ -2,7 +2,7 @@
 title: "DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x cheaper inference"
 excerpt: ""
 date: 2022-10-11 00:09:00
-tags: inference
+tags: inference English
 ---

 [ ![Text Generation Models](/assets/images/mii/hero.png) ](/assets/images/mii/hero.png){: .align-center}

--- a/docs/_posts/2022-12-12-data-efficiency.md
+++ b/docs/_posts/2022-12-12-data-efficiency.md
@@ -2,7 +2,7 @@
 title: "DeepSpeed Data Efficiency: A composable library that makes better use of data, increases training efficiency, and improves model quality"
 excerpt: ""
 date: 2022-12-12 00:09:00
-tags: training
+tags: training English
 ---

 [ ![DeepSpeed Data Efficiency](/assets/images/data_efficiency/data_efficiecy_fig0.png) ](/assets/images/data_efficiency/data_efficiecy_fig0.png){: .align-center}

--- a/docs/_posts/2023-03-31-multi-modal.md
+++ b/docs/_posts/2023-03-31-multi-modal.md
+---
+title: "Scaling Large-Scale Generative Mixture-of-Expert Multimodal Model With VL-MoE "
+excerpt: ""
+date: 2023-03-31 00:09:00
+tags: training English
+---
+
+The field of Artificial Intelligence-Generated Content (AIGC) is rapidly growing, with the goal of making content creation more efficient and accessible. One of the most exciting areas of AIGC is the development of large-scale multi-modal models like [Flamingo](https://arxiv.org/abs/2204.14198), [BLIP](https://arxiv.org/abs/2301.12597), and [GPT4](https://arxiv.org/abs/2303.08774), which can accept inputs from multiple resources, e.g., image, text, audio, etc., and generate a variety of formats as outputs. For example, image creation can be made through stable diffusion and DALLE using the prompt text, and the new feature in the coming Office can create slides with texts, images, animations, etc., by leveraging the power of the new Microsoft Office Copilot.
+
+Scaling up the model size is one common approach to boost usability and capability of AIGC tasks. However, simply scaling up dense architectures (e.g., from GPT-1 to GPT-3) is usually extremely resource-intensive and time-consuming for both model training and inference. One effective way to tackle this challenge is to apply mixture of experts (MoE). In particular, recent [text-based MoE](https://arxiv.org/abs/2201.05596) and [vision-based MoE](https://arxiv.org/abs/2106.05974) studies have demonstrated that MoE models can significantly reduce the training and resource cost as compared to a quality-equivalent dense model, or produce a higher quality model under the same training budget. Up to now, the effectiveness of jointly training MoE for multi-modal models remains not well understood. To explore this important capability, [DeepSpeed team](https://www.deepspeed.ai/) is proud to announce our first large-scale generative mixture-of-expert (MoE) multimodal model, named [VL-MoE](https://arxiv.org/abs/2303.07226).
+
+[ ![Model architecture](/assets/images/vl_moe.png) ](/assets/images/vl_moe.png){: .align-center}
+
+*Figure 1: The new encoding process in our VL-MoE for various modality inputs, for which gray and colored blocks indicate non-activated and activated modules, respectively.*
+
+Specifically, we incorporate the MoE structure into the classical single-tower multi-modal model by comprising of the following components: (1) a shared self-attention module across modalities, (2) a pool of modality-specific experts in the feed-forward network (FFN), and (3) a sparse gated MoE extended from the dense FFN. Subsequently, under the same amount of training resources as that used in [VLMO](https://arxiv.org/abs/2111.02358) (200k training steps), we demonstrate VL-MoE's advantages over the state-of-the-art dense counterparts in the following two aspects:
+
+(1) **VL-MoE can achieve significant accuracy improvement in comparison to its dense counterparts.** Table 1 demonstrates that under the same training budget (i.e., have the same number of activated parameters for each token), VL-MoE Base with 32 experts achieves better accuracy than the VLMO-Base dense model on all four vision-language datasets.
+
+(2) **VL-MoE achieves similar model quality with a much smaller activated number of parameters compared to its dense counterparts.** Our results show that the finetuning performance of our VL-MoE is similar to that of the 3.1X larger VLMO-Large dense model (i.e., 3.1X more activated number of parameters per token). This can directly translate to approximately 3.1X training cost reduction as the training FLOPs for transformers are proportional to the activated model size per token.
+
+
+
+|                               | Param per Token (# Total Param) |       VQA      |     NLVR2     |     COCO    |  Flickr30K  |
+|                               |                                 | test-dev / std |  dev / test-P |   TR / IR   |   TR / IR   |
+|-------------------------------|:-------------------------------:|:--------------:|:-------------:|:-----------:|:-----------:|
+| Dense Counterparts            |                                 |                |               |             |             |
+| VLMO-dense Base               |           180M (180M)           |  76.64 / 76.89 | 82.77 / 83.34 | 74.8 / 57.2 | 92.3 / 79.3 |
+| VLMO-dense Large              |           560M (180M)           |  79.94 / 79.98 | 85.64 / 86.86 | 78.2 / 60.6 | 95.3 / 84.5 |
+| Ours (VL-MoE with 32 Experts) |                                 |                |               |             |             |
+| VL-MoE                        |           180M (1.9B)           |  78.23 / 78.65 | 85.54 / 86.77 | 79.4 / 61.2 | 96.1 / 84.9 |
+
+*Table 1: Comparison of finetuning accuracy results for different models used in vision-language classification tasks and image-text retrieval tasks.*
+
+A sophisticated MoE model design requires a highly efficient and scalable training system that can support multi-dimensional parallelism and efficient memory management. [DeepSpeed MoE](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/) training system offers such advanced capabilities including easy-to-use APIs enabling flexible combinations of data, tensor, and expert parallelism. Furthermore, DeepSpeed MoE enables larger model scale than state-of-the-art systems by exploiting expert parallelism and [ZeRO optimizations](https://arxiv.org/abs/1910.02054) together. By leveraging the DeepSpeed MoE system, VL-MoE Base with 32 experts achieves similar model quality as VLMO-dense Large with about 2.5x training speedup.
+
+[DeepSpeed MoE](https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/) system is already open-sourced and can be easily used as plug-and-play component to achieve high-performance low-cost training for any large-scale MoE models. The tutorial of how to use DeepSpeed MoE is available [here](https://www.deepspeed.ai/tutorials/mixture-of-experts/). VL-MoE is currently in the process of being integrated as a model example of [DeepSpeed Examples](https://github.com/microsoft/DeepSpeedExamples). Please stay tuned for our upcoming updates on this thread.
--- a/docs/_posts/2023-04-24-deepspeed-chat-chinese.md
+++ b/docs/_posts/2023-04-24-deepspeed-chat-chinese.md
+---
+title: "DeepSpeed Chat: 一键式RLHF训练，让你的类ChatGPT千亿大模型提速省钱15倍"
+excerpt: ""
+link: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/chinese/README.md
+date: 2023-04-24 00:00:00
+tags: training ZeRO RLHF Chinese
+---
--- a/docs/_posts/2023-04-24-deepspeed-chat-japanese.md
+++ b/docs/_posts/2023-04-24-deepspeed-chat-japanese.md
+---
+title: "DeepSpeed Chat: ChatGPTライクなモデルを簡単・高速・低コストに、あらゆるスケールで学習"
+excerpt: ""
+link: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/japanese/README.md
+date: 2023-04-24 00:00:00
+tags: training ZeRO RLHF Japanese
+---
--- a/docs/_posts/2023-04-24-deepspeed-chat.md
+++ b/docs/_posts/2023-04-24-deepspeed-chat.md
+---
+title: "DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales"
+excerpt: ""
+link: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/README.md
+date: 2023-04-24 00:00:00
+tags: training ZeRO RLHF English
+---
--- a/docs/_tutorials/automatic-tensor-parallelism.md
+++ b/docs/_tutorials/automatic-tensor-parallelism.md
@@ -7,6 +7,7 @@ tags: inference
   * [Introduction](#introduction)
   * [Example Script](#example-script)
        * [Launching](#launching)
+        * [T5 11B Inference Performance Comparison](#t5-11b-inference-performance-comparison)
        * [OPT 13B Inference Performance Comparison](#opt-13b-inference-performance-comparison)
   * [Supported Models](#supported-models)
   * [Unsupported Models](#unsupported-models)
@@ -65,7 +66,7 @@ With automatic tensor parallelism, we do not need to provide the injection polic

 # Example Script

-We can observe performance improvement with automatic tensor parallelism using the [inference test suite](https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py). The script includes per token latency, bandwidth, throughput and memory checks for comparison. See the [README](https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/text-generation#deepspeed-huggingface-text-generation-examples) for more information.
+We can observe performance improvement with automatic tensor parallelism using the [inference test suite](https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py). This script is for testing text-generation models and includes per token latency, bandwidth, throughput and memory checks for comparison. See the [README](https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/text-generation#deepspeed-huggingface-text-generation-examples) for more information.


 ## Launching
@@ -83,19 +84,31 @@ To enable tensor parallelism, you need to use the flag `ds_inference` for the co
 deepspeed --num_gpus <num_gpus> DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py --name <model> --batch_size <batch_size> --test_performance --ds_inference
 ```

-## OPT 13B Inference Performance Comparison
+## T5 11B Inference Performance Comparison

 The following results were collected using V100 SXM2 32GB GPUs.

-### Max New Tokens = 50
+### Latency

-| Test       | Memory Allocated per GPU   | Max Batch Size   | Max Throughput per GPU   |
-| ---------- | -------------------------- | ---------------- | ------------------------ |
-| No TP      | 23.94 GB                   | 64               | 18.84 TFlops             |
-| 2 GPU TP   | 12.23 GB                   | 320              | 27.17 TFlops             |
-| 4 GPU TP   | 6.36 GB                    | 664              | 27.63 TFlops             |
+![T5 Latency Graph](/assets/images/auto-tp-chart-latency.png){: .align-center}
+
+### Throughput
+
+![T5 Throughput Graph](/assets/images/auto-tp-chart-throughput.png){: .align-center}
+
+### Memory
+
+| Test           | Memory Allocated per GPU   | Max Batch Size | Max Throughput per GPU |
+| -------------- | -------------------------- | -------------- | ---------------------- |
+| No TP or 1 GPU | 21.06 GB                   | 64             | 9.29 TFLOPS            |
+| 2 GPU TP       | 10.56 GB                   | 320            | 13.04 TFLOPS           |
+| 4 GPU TP       | 5.31 GB                    | 768            | 14.04 TFLOPS           |
+
+## OPT 13B Inference Performance Comparison
+
+The following results were collected using V100 SXM2 32GB GPUs.

-### Max New Tokens = 1024
+![OPT Throughput Graph](/assets/images/auto-tp-chart-opt-throughput.png){: .align-center}

 | Test       | Memory Allocated per GPU   | Max Batch Size   | Max Throughput per GPU   |
 | ---------- | -------------------------- | ---------------- | ------------------------ |

--- a/docs/_tutorials/bert-finetuning.md
+++ b/docs/_tutorials/bert-finetuning.md
@@ -201,7 +201,7 @@ the `--predict_batch_size` should also be 8.

 For further details about the transformer kernel, please see our [usage
 tutorial](/tutorials/transformer_kernel/) and [technical deep
-dive](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) on
+dive](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html) on
 the fastest BERT training.


@@ -302,7 +302,7 @@ Table 4. The setting of memory-optimization flags for a range of micro-batch siz

 ### FineTuning model pre-trained with DeepSpeed Transformer Kernels

-Fine-tuning the model pre-trained using DeepSpeed Transformer and the recipe in [DeepSpeed Fast-Bert Training](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial.
+Fine-tuning the model pre-trained using DeepSpeed Transformer and the recipe in [DeepSpeed Fast-Bert Training](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial.

 To get these results, we do require some tuning of the dropout settings as described below:


--- a/docs/_tutorials/bert-pretraining.md
+++ b/docs/_tutorials/bert-pretraining.md
@@ -130,7 +130,7 @@ The `model` returned by `deepspeed.initialize` is the DeepSpeed _model
 engine_ that we will use to train the model using the forward, backward and
 step API. Since the model engine exposes the same forward pass API as
 `nn.Module` objects, there is no change in the forward pass.
-Thus, we only modify the the backward pass and optimizer/scheduler steps.
+Thus, we only modify the backward pass and optimizer/scheduler steps.

 Backward propagation is performed by calling `backward(loss)` directly with
 the model engine.
@@ -308,7 +308,7 @@ Note:

 For more details about the transformer kernel, please see [DeepSpeed
 Transformer Kernel](/tutorials/transformer_kernel/) and [DeepSpeed Fast-Bert
-Training](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html).
+Training](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html).


 ### Start Training
@@ -391,4 +391,4 @@ for more details in

 Compared to SOTA, DeepSpeed significantly improves single GPU performance for transformer-based model like BERT. Figure above shows the single GPU throughput of training BertBERT-Large optimized through DeepSpeed, compared with two well-known Pytorch implementations, NVIDIA BERT and HuggingFace BERT. DeepSpeed reaches as high as 64 and 53 teraflops throughputs (corresponding to 272 and 52 samples/second) for sequence lengths of 128 and 512, respectively, exhibiting up to 28% throughput improvements over NVIDIA BERT and up to 62% over HuggingFace BERT.  We also support up to 1.8x larger batch size without running out of memory.

-For more details on how we achieve the record breaking BERT training time please check out deep dive into DeepSpeed BERT [Fastest BERT Training](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
+For more details on how we achieve the record breaking BERT training time please check out deep dive into DeepSpeed BERT [Fastest BERT Training](https://www.deepspeed.ai/2020/05/18/bert-record.html)