2020-05-19-bert-record.md 1.44 KB
Newer Older
Jeff Rasley's avatar
Jeff Rasley committed
1
---
Shaden Smith's avatar
Shaden Smith committed
2
title: "The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels"
Jeff Rasley's avatar
Jeff Rasley committed
3
4
excerpt: ""
date: 2020-05-19 00:00:00
aiss's avatar
aiss committed
5
toc: false
aiss's avatar
aiss committed
6
tags: training English
Jeff Rasley's avatar
Jeff Rasley committed
7
8
---

Shaden Smith's avatar
Shaden Smith committed
9
10
11
12
13
14
15
16
17
We introduce new technology to accelerate single GPU performance via kernel
optimizations. These optimizations not only create a strong foundation for
scaling out large models, but also improve the single GPU performance of
highly tuned and moderately sized models like BERT by more than 30%, reaching
a staggering performance of 66 teraflops per V100 GPU, which is 52% of the
hardware peak. **Using optimized transformer kernels as the building block,
DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024
NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on
the same number and generation of GPUs.
Shaden Smith's avatar
Shaden Smith committed
18

aiss's avatar
aiss committed
19
* Brief overview, see our [press release](https://www.microsoft.com/en-us/research/blog/ZeRO-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/).
aiss's avatar
aiss committed
20
* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/2020/05/27/fastest-bert-training.html).
21
22
* Tutorial on how to reproduce our results, see our [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/).
* The source code for our transformer kernels can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed) and BERT pre-training code can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples).