Unverified Commit 4eade17a authored by Shaden Smith's avatar Shaden Smith Committed by GitHub
Browse files

News edits (#220)

* BERT title
parent 0c824830
......@@ -22,7 +22,7 @@ establishing a new SOTA in the LM category.
# News
* [2020/05/19] [An Order-of-Magnitude Larger and Faster Training with ZeRO-2](https://www.deepspeed.ai/news/2020/05/19/zero-stage2.html)
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/05/19] [DeepSpeed optimizes transformer kernels to achieve the world’s and most efficient fastest BERT training record: 44 minutes on 1024 NVIDIA V100 GPUs](https://www.deepspeed.ai/news/2020/05/19/bert-record.html)
* [2020/05/19] [The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels](https://www.deepspeed.ai/news/2020/05/19/bert-record.html)
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/02/13] [Turing-NLG: A 17-billion-parameter language model by Microsoft](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/)
* [2020/02/13] [ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
......
---
layout: single
title: "DeepSpeed optimizes transformer kernels to achieve the world's fastest and most efficient BERT training record: 44 minutes on 1024 NVIDIA V100 GPUs"
title: "The Fastest and Most Efficient BERT Training through Optimized Transformer Kernels"
excerpt: ""
categories: news
new_post: true
date: 2020-05-19 00:00:00
---
We introduce new technology to accelerate single GPU performance via
kernel optimizations. These optimizations not only create a strong
foundation for scaling out large models, but also improve the single GPU
performance of highly tuned and moderately sized models like BERT by more
than 30%, reaching a staggering performance of 66 teraflops per V100 GPU,
which is 52% of the hardware peak. **Using these optimizations as the building
block, DeepSpeed achieves the fastest BERT training record: 44 minutes on
1,024 NVIDIA V100 GPUs**, compared with the best published result
of 67 minutes on the same number and generation of GPUs.
**Code and tutorials are coming soon!**
We introduce new technology to accelerate single GPU performance via kernel
optimizations. These optimizations not only create a strong foundation for
scaling out large models, but also improve the single GPU performance of
highly tuned and moderately sized models like BERT by more than 30%, reaching
a staggering performance of 66 teraflops per V100 GPU, which is 52% of the
hardware peak. **Using optimized transformer kernels as the building block,
DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024
NVIDIA V100 GPUs**, compared with the best published result of 67 minutes on
the same number and generation of GPUs.
For a technical overview, see our [blog post](linklink).
**Code and tutorials are coming soon!**
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment