README.md 8.91 KB
Newer Older
1
[![Build Status](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_apis/build/status/microsoft.DeepSpeed?branchName=master)](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/latest?definitionId=1&branchName=master)
Shaden Smith's avatar
Shaden Smith committed
2
[![Documentation Status](https://readthedocs.org/projects/deepspeed/badge/?version=latest)](https://deepspeed.readthedocs.io/en/latest/?badge=latest)
Shaden Smith's avatar
Shaden Smith committed
3
[![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
Jeff Rasley's avatar
Jeff Rasley committed
4
5
[![Docker Pulls](https://img.shields.io/docker/pulls/deepspeed/deepspeed)](https://hub.docker.com/r/deepspeed/deepspeed)

Shaden Smith's avatar
Shaden Smith committed
6

Jeff Rasley's avatar
Jeff Rasley committed
7
8
[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
library that makes distributed training easy, efficient, and effective.
Shaden Smith's avatar
Shaden Smith committed
9
10

<p align="center"><i><b>10x Larger Models</b></i></p>
Jeff Rasley's avatar
Jeff Rasley committed
11
<p align="center"><i><b>10x Faster Training</b></i></p>
Shaden Smith's avatar
Shaden Smith committed
12
13
<p align="center"><i><b>Minimal Code Change</b></i></p>

14
DeepSpeed can train deep learning models with over a hundred billion parameters on current
Jeff Rasley's avatar
Jeff Rasley committed
15
generation of GPU clusters, while achieving over 10x in system performance
Jeff Rasley's avatar
Jeff Rasley committed
16
17
18
19
20
compared to the state-of-art. Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.

Shaden Smith's avatar
Shaden Smith committed
21
22
23
24
25
DeepSpeed is an important part of Microsoft’s new
[AI at Scale](https://www.microsoft.com/en-us/research/project/ai-at-scale/)
initiative to enable next-generation AI capabilities at scale, where you can find more
information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale).

Jeff Rasley's avatar
Jeff Rasley committed
26
27
28
**_For further documentation, tutorials, and technical deep-dives please see [deepspeed.ai](https://www.deepspeed.ai/)!_**


29
# News
Shaden Smith's avatar
Shaden Smith committed
30

31
32
* [2020/07/24] [DeepSpeed webinar](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-Live.html) on August 6th, 2020
  [![DeepSpeed webinar](docs/assets/images/webinar-aug2020.png)](https://note.microsoft.com/MSR-Webinar-DeepSpeed-Registration-Live.html)
Shaden Smith's avatar
Shaden Smith committed
33
34
* [2020/05/19] [ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale](https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/)
<span style="color:dodgerblue">**[_NEW_]**</span>
35
* [2020/05/19] [An Order-of-Magnitude Larger and Faster Training with ZeRO-2](https://www.deepspeed.ai/news/2020/05/18/zero-stage2.html)
Jeff Rasley's avatar
Jeff Rasley committed
36
<span style="color:dodgerblue">**[_NEW_]**</span>
37
* [2020/05/19] [The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels](https://www.deepspeed.ai/news/2020/05/18/bert-record.html)
Jeff Rasley's avatar
Jeff Rasley committed
38
39
40
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/02/13] [Turing-NLG: A 17-billion-parameter language model by Microsoft](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/)
* [2020/02/13] [ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
Shaden Smith's avatar
Shaden Smith committed
41
42


43
# Table of Contents
Shaden Smith's avatar
Shaden Smith committed
44
45
46
| Section                                 | Description                                 |
| --------------------------------------- | ------------------------------------------- |
| [Why DeepSpeed?](#why-deepspeed)        |  DeepSpeed overview                         |
47
48
| [Features](#features)                   |  DeepSpeed features                         |
| [Further Reading](#further-reading)     |  DeepSpeed documentation, tutorials, etc.   |
Shaden Smith's avatar
Shaden Smith committed
49
| [Contributing](#contributing)           |  Instructions for contributing to DeepSpeed |
50
| [Publications](#publications)           |  DeepSpeed publications                     |
Shaden Smith's avatar
Shaden Smith committed
51

52
# Why DeepSpeed?
Shaden Smith's avatar
Shaden Smith committed
53
54
55
56
57
Training advanced deep learning models is challenging. Beyond model design,
model scientists also need to set up the state-of-the-art training techniques
such as distributed training, mixed precision, gradient accumulation, and
checkpointing. Yet still, scientists may not achieve the desired system
performance and convergence rate. Large model sizes are even more challenging:
Rahul Prasad's avatar
Rahul Prasad committed
58
a large model easily runs out of memory with pure data parallelism and it is
Shaden Smith's avatar
Shaden Smith committed
59
60
61
difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training.

62
# Features
Jeff Rasley's avatar
Jeff Rasley committed
63
64

Below we provide a brief feature list, see our detailed [feature
Shaden Smith's avatar
Shaden Smith committed
65
overview](https://www.deepspeed.ai/features/) for descriptions and usage.
Jeff Rasley's avatar
Jeff Rasley committed
66

Shaden Smith's avatar
Shaden Smith committed
67
* [Distributed Training with Mixed Precision](https://www.deepspeed.ai/features/#distributed-training-with-mixed-precision)
68
69
  * 16-bit mixed precision
  * Single-GPU/Multi-GPU/Multi-Node
Shaden Smith's avatar
Shaden Smith committed
70
* [Model Parallelism](https://www.deepspeed.ai/features/#model-parallelism)
71
72
  * Support for Custom Model Parallelism
  * Integration with Megatron-LM
Shaden Smith's avatar
Shaden Smith committed
73
* [Memory and Bandwidth Optimizations](https://www.deepspeed.ai/features/#memory-and-bandwidth-optimizations)
74
75
76
  * The Zero Redundancy Optimizer (ZeRO)
  * Constant Buffer Optimization (CBO)
  * Smart Gradient Accumulation
Shaden Smith's avatar
Shaden Smith committed
77
* [Training Features](https://www.deepspeed.ai/features/#training-features)
78
79
80
  * Simplified training API
  * Gradient Clipping
  * Automatic loss scaling with mixed precision
Shaden Smith's avatar
Shaden Smith committed
81
* [Training Optimizers](https://www.deepspeed.ai/features/#training-optimizers)
82
83
84
85
  * Fused Adam optimizer and arbitrary `torch.optim.Optimizer`
  * Memory bandwidth optimized FP16 Optimizer
  * Large Batch Training with LAMB Optimizer
  * Memory efficient Training with ZeRO Optimizer
Shaden Smith's avatar
Shaden Smith committed
86
87
* [Training Agnostic Checkpointing](https://www.deepspeed.ai/features/#training-agnostic-checkpointing)
* [Advanced Parameter Search](https://www.deepspeed.ai/features/#advanced-parameter-search)
88
89
  * Learning Rate Range Test
  * 1Cycle Learning Rate Schedule
Shaden Smith's avatar
Shaden Smith committed
90
91
* [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
* [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
Jeff Rasley's avatar
Jeff Rasley committed
92
93
94



95
# Further Reading
96

97
All DeepSpeed documentation can be found on our website: [deepspeed.ai](https://www.deepspeed.ai/)
Shaden Smith's avatar
Shaden Smith committed
98
99


100
101
| Article                                                                                        | Description                                  |
| ---------------------------------------------------------------------------------------------- | -------------------------------------------- |
Shaden Smith's avatar
Shaden Smith committed
102
| [DeepSpeed Features](https://www.deepspeed.ai/features/)                                       |  DeepSpeed features                          |
103
| [Getting Started](https://www.deepspeed.ai/getting-started/)                                   |  First steps with DeepSpeed                         |
104
| [DeepSpeed JSON Configuration](https://www.deepspeed.ai/docs/config-json/)                     |  Configuring DeepSpeed                       |
Shaden Smith's avatar
Shaden Smith committed
105
| [API Documentation](https://deepspeed.readthedocs.io/en/latest/)                               |  Generated DeepSpeed API documentation       |
Shaden Smith's avatar
Shaden Smith committed
106
| [CIFAR-10 Tutorial](https://www.deepspeed.ai/tutorials/cifar-10)                               |  Getting started with CIFAR-10 and DeepSpeed |
Shaden Smith's avatar
Shaden Smith committed
107
| [Megatron-LM Tutorial](https://www.deepspeed.ai/tutorials/megatron/)                           |  Train GPT2 with DeepSpeed and Megatron-LM   |
108
| [BERT Pre-training Tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/)             |  Pre-train BERT with DeepSpeed |
Shaden Smith's avatar
Shaden Smith committed
109
110
| [Learning Rate Range Test Tutorial](https://www.deepspeed.ai/tutorials/lrrt/)                  |  Faster training with large learning rates   |
| [1Cycle Tutorial](https://www.deepspeed.ai/tutorials/1Cycle/)                                  |  SOTA learning schedule in DeepSpeed         |
Shaden Smith's avatar
Shaden Smith committed
111
112
113



114
# Contributing
Jeff Rasley's avatar
Jeff Rasley committed
115
116
117
DeepSpeed welcomes your contributions! Please see our
[contributing](CONTRIBUTING.md) guide for more details on formatting, testing,
etc.
Shaden Smith's avatar
Shaden Smith committed
118

119
## Contributor License Agreement
Shaden Smith's avatar
Shaden Smith committed
120
121
122
123
124
125
126
127
128
129
This project welcomes contributions and suggestions. Most contributions require you to
agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
actually do, grant us the rights to use your contribution. For details, visit
https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need
to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
follow the instructions provided by the bot. You will only need to do this once across
all repos using our CLA.

130
## Code of Conduct
Shaden Smith's avatar
Shaden Smith committed
131
132
133
134
135
This project has adopted the [Microsoft Open Source Code of
Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the
[Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact
[opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or
comments.
Jeff Rasley's avatar
Jeff Rasley committed
136

137
# Publications
Jeff Rasley's avatar
Jeff Rasley committed
138
1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. [ArXiv:1910.02054](https://arxiv.org/abs/1910.02054)