Unverified Commit 6e87251c authored by RezaYazdaniAminabadi's avatar RezaYazdaniAminabadi Committed by GitHub
Browse files

add the fine-tuning results (#260)

* add the fine-tuning results

* updating tutorial and blog-post

* updated the tutorials and links
parent 96c4daab
......@@ -24,9 +24,9 @@ DeepSpeed to achieve this record-breaking BERT training time.
4. Layer-norm reordering for training stability and faster convergence
These optimizations not only benefit BERT; they are also applicable to many
other transformer-based models such as RoBERTa, XLNet, and UniLM.
other transformer-based models such as RoBERTa, XLNet, and UniLM. Furthermore, besides the improvements mentioned for pre-training, DeepSpeed achieves up to 1.5x speedups for the downstream tasks, such as the fine-tuning for Bing-BERT SQuAD.
## Overview of Performance Results
## Performance Results for BERT Pretraining
Compared to SOTA, DeepSpeed significantly improves single GPU performance for
transformer-based model like BERT. Figure 1 shows the single GPU throughput of
......@@ -66,8 +66,8 @@ in teraflops (Tflops). DeepSpeed boosts throughput and allows for higher batch
sizes without running out-of-memory.
Looking at distributed training across GPUs, Table 1 shows our end-to-end
BERT-Large pretraining time (F1 score of 90.5 for SQUAD) using 16 to 1024 GPUs.
We complete BERT pretraining in 44 minutes using 1024 V100 GPUs (64 NVIDIA
BERT-Large pre-training time (F1 score of 90.5 for SQUAD) using 16 to 1024 GPUs.
We complete BERT pre-training in 44 minutes using 1024 V100 GPUs (64 NVIDIA
DGX-2 nodes). In comparison, the previous SOTA from NVIDIA takes 47 mins using
1472 V100 GPUs. DeepSpeed is not only faster but also uses 30% less resources.
Using the same 1024 GPUS,NVIDIA BERT takes 67 minutes using the same 1024 GPUs
......@@ -76,7 +76,7 @@ Similarly, on 256 GPUs, NVIDIA BERT takes 236 minutes while DeepSpeed takes 144
minutes (39% faster).
| Number of nodes | Number of V100 GPUs | Time |
| ----------------- | -------------------- | ------------ |
| --------------- | ------------------- | ------------ |
| 1 DGX-2 | 16 | 33 hr 13 min |
| 4 DGX-2 | 64 | 8 hr 41 min |
| 16 DGX-2 | 256 | 144 min |
......@@ -92,6 +92,35 @@ throughput by combining our software optimizations with the new hardware. We
project it would reduce BERT training time further to less than 25 minutes on a
cluster of 1024 A100 GPUs.
## Performance Results for Fine-Tuning Tasks
In addition to the performance benefits we show for the pre-training,
we have evaluated the performance of our customized kernel for fine-tuning the
downstream tasks. Tables 2 and 3 show the samples-per-second achieved when running
Bing-BERT SQuAD on NVIDIA V100 using 16 and 32 GB of memory, using PyTorch and DeepSpeed transformer kernels.
For the 16-GB V100, we can achieve up to 1.5x speedup while supporting 2x larger batch size per GPU.
On the other hand, we can support as large as 32 batch size (2.6x more than Pytorch) using 32GB of memory, while providing 1.3x speedup for the end-to-end fine-tune training. Note, that we use the best
samples-per-second to compute speedup for the cases that PyTorch runs out-of-memory (OOM).
| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
| ---------------- | ------- | --------- | ----------- |
| 4 | 36.34 | 50.76 | 1.4 |
| 6 | OOM | 54.28 | 1.5 |
| 8 | OOM | 54.16 | 1.5 |
Table 2. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (16-GB) using PyTorch and DeepSpeed transformer kernels.
| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
| ---------------- | ------- | --------- | ----------- |
| 4 | 37.78 | 50.82 | 1.3 |
| 6 | 43.81 | 55.97 | 1.3 |
| 12 | 49.32 | 61.41 | 1.2 |
| 24 | OOM | 60.70 | 1.2 |
| 32 | OOM | 63.01 | 1.3 |
Table 3: Samples/second for running SQuAD fine-tuning on NVIDIA V100 (32-GB) using PyTorch and DeepSpeed transformer kernels.
## BERT Highly Optimized Transformer Kernels
GPUs have very high peak floating-point throughput, but the default Transformer
......@@ -170,7 +199,7 @@ between the two versions depending on their usage scenarios: Stochastic version
pursues ultimate training performance goal, and deterministic version may save
development time by better facilitating experimentation and debugging.
In our experiments, we use stochastic kernels for the pretraining BERT, while
In our experiments, we use stochastic kernels for the pre-training BERT, while
using non-stochastic kernels for fine-tuning to achieve fully reproducible
results. We recommend using stochastic kernels for training tasks involving
massive amounts of data such as pre-training, while using non-stochastic
......
......@@ -7,19 +7,17 @@ In this tutorial we will be adding DeepSpeed to the BingBert model for the SQuAD
## Overview
Please clone the DeepSpeed repository and change to deepspeed directory
`git clone https://github.com/microsoft/deepspeed`
`cd deepspeed`
The DeepSpeedExamples are submodules so you need to initialize and update them using the following commands
`git submodule init`
`git submodule update`
Go to the `DeepSpeedExamples/BingBertSquad` folder to follow along.
If you don't already have a copy of the DeepSpeed repository, please clone in
now and checkout the DeepSpeedExamples submodule the contains the BingBertSquad
example (DeepSpeedExamples/BingBertSquad) we will be going over in the rest of
this tutorial.
```shell
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git submodule update --init --recursive
cd DeepSpeedExamples/BingBertSquad
```
### Pre-requisites
......@@ -27,27 +25,44 @@ Go to the `DeepSpeedExamples/BingBertSquad` folder to follow along.
* Training set: [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
* Validation set: [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
You also need a pre-trained BERT model checkpoint from DeepSpeed. We will use checkpoint 162 from the
BERT pre-training [tutorial](bert-pretraining).
* Pre-training checkpoint: `training_state_checkpoint_162.tar`
Note that the BERT model in the file `train-v1.1.json_bert-large-uncased_384_128_64` is not strictly required as it will be downloaded automatically if it is not present locally on the cluster.
You also need a pre-trained BERT model checkpoint from either DeepSpeed, [HuggingFace](https://github.com/huggingface/transformers), or [TensorFlow](https://github.com/google-research/bert#pre-trained-models) to run the fine-tuning. Regarding the DeepSpeed model, we will use checkpoint 160 from the BERT pre-training [tutorial](/tutorials/bert-pretraining/).
### Running BingBertSquad
- **Unmodified (BaseLine):** If you would like to run unmodified BingBertSquad with the pre-processed data, there is a helper script which you can invoke via: `bash run_squad_baseline.sh 8 <PATH_TO_CHECKPOINT>/training_state_checkpoint_162.tar <PATH_TO_DATA_DIR> <PATH_TO_OUTPUT_DIR> ` where the first argument `8` is the number of GPUs, second argument is the path to the pre-training checkpoint, third is the path to training and validation sets (e.g. train-v1.1.json), and fourth is path to an output folder (e.g. ~/output). This bash script sets the parameters and invokes `nvidia_run_squad_baseline.py`.
- **Modified (DeepSpeed):** This is similar to baseline; just substitute `run_squad_baseline.sh` with `run_squad_deepspeed.sh` which invokes `nvidia_run_squad_deepspeed.py`.
- **DeepSpeed-enabled:** We provide a shell script that you can invoke to start training with DeepSpeed, it takes 4 arguments: `bash run_squad_deepspeed.sh <NUM_GPUS> <PATH_TO_CHECKPOINT> <PATH_TO_DATA_DIR> <PATH_TO_OUTPUT_DIR>`. The first argument is the number of GPUs to train with, second argument is the path to the pre-training checkpoint, third is the path to training and validation sets (e.g., train-v1.1.json), and fourth is path to an output folder where the results will be saved. This script will invoke `nvidia_run_squad_deepspeed.py`.
- **Unmodified baseline** If you would like to run a non-DeepSpeed enabled version of fine-tuning we provide a shell script that takes the same arguments as the DeepSpeed one named `run_squad_baseline.sh`. This script will invoke `nvidia_run_squad_baseline.py`.
## DeepSpeed Integration
The main DeepSpeed modified script is `nvidia_run_squad_deepspeed.py`; the line `import deepspeed` enables you to use DeepSpeed.
Make sure that the number of GPUs specified in the job are available (else, this will yield an out of memory error). The wrapper script `run_BingBertSquad.sh` and the test script `run_tests.sh` essentially serve to automate training - they may also be used a guidelines to set parameters and launch the fine-tuning task.
The main part of training is done in `nvidia_run_squad_deepspeed.py`, which has
already been modified to use DeepSpeed. The `run_squad_deepspeed.sh` script
helps to invoke training and setup several different hyperparameters relevant
to the training process. In the next few sections we will cover what changes we
made to the baseline in order to enable DeepSpeed, you don't have to make these
changes yourself since we have already done them for you.
### Configuration
The `deepspeed_bsz24_config.json` file gives the user to specify DeepSpeed options in terms of batch size, learning rate, precision and other parameters. When running the `nvidia_run_squad_deepspeed.py`, in addition to the `-d` flag to enable DeepSpeed, the appropriate DeepSpeed configuration file must be specified using `--deepspeed_config <deepspeed_bsz24_config.json>`.
The `deepspeed_bsz24_config.json` file gives the user the ability to specify DeepSpeed
options in terms of batch size, micro batch size, learning rate, and other parameters.
When running the `nvidia_run_squad_deepspeed.py`, in addition to the
`--deepspeed` flag to enable DeepSpeed, the appropriate DeepSpeed configuration
file must be specified using `--deepspeed_config
deepspeed_bsz24_config.json`. Table 1 shows the fine-tuning configuration
used in our experiments.
| Parameters | Value |
| ------------------------------ | ----- |
| Total batch size | 24 |
| Train micro batch size per GPU | 3 |
| Optimizer | Adam |
| Learning rate | 3e-5 |
| Sequence-length | 384 |
| Weight-decay | 0.0 |
| Epoch count | 2 |
Table 1. Fine-tuning configuration
### Argument Parsing
......@@ -67,18 +82,18 @@ Similar to this, all the options with their corresponding description are availa
#### Initialization
DeepSpeed has an initialization function to create model, optimizer and LR scheduler. For BingBertSquad, we simply augment the Baseline script with the initialize function as follows.
DeepSpeed has an initialization function to wrap the model, optimizer, LR
scheduler, and data loader. For BingBertSquad, we simply augment the baseline
script with the initialize function to wrap the model and create the optimizer as follows:
```python
model, optimizer, _, _ = deepspeed.initialize(args=args,
model, optimizer, _, _ = deepspeed.initialize(
args=args,
model=model,
model_parameters=optimizer_grouped_parameters,
dist_init_required=False)
model_parameters=optimizer_grouped_parameters
)
```
Another feature of DeepSpeed is its convenient `step()` function which can be called directly as `model.step()` which hides the `fp16_optimizer` away from the user as opposed to `optimizer.step()` in the baseline code (similar to other models in this tutorial) which needs explicit handling of the case of FP16 computation.
#### Forward pass
This is identical in both Baseline and DeepSpeed, and is performed by `loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)`.
......@@ -89,113 +104,214 @@ In the Baseline script you need to handle the all-reduce operation at the gradie
#### Weight updates
In the Baseline Script, you are required to explicitly specify the optimizer as `FusedAdam` (along with the handling of dynamic loss scaling) in FP16 and `BertAdam` in FP32, followed by the call `optimizer.step()` and `optimizer.zero_grad()`. DeepSpeed handles this internally (by setting the optimizer using the JSON config) when `initialize()` is called and thus you don't need to explicitly write code but just do `model.step()`.
In the Baseline Script, you are required to explicitly specify the optimizer as
`FusedAdam` (along with the handling of dynamic loss scaling) in FP16 and
`BertAdam` in FP32, followed by the call `optimizer.step()` and
`optimizer.zero_grad()`. DeepSpeed handles this internally (by setting the
optimizer using the JSON config) when `initialize()` is called and thus you
don't need to explicitly write code but just do `model.step()`.
Congratulations! Porting into DeepSpeed is complete.
Congratulations! Porting to DeepSpeed is complete.
### Evaluation
Once training is complete, the EM and F1 scores may be obtained from the following command:
`python evaluate-v1.1.py <PATH_TO_DEVSET>/dev-v1.1.json <PATH_TO_PREDICTIONS>/predictions.json`
### DeepSpeed Improvements
The table summarizing the results are given below. In all cases, the batch size is set to 24 and the training is conducted on 8 GPUs for 2 epochs on the DLTS RR1 DGX-2 hypercluster. A set of parameters (seeds and learning rates) were tried and the best ones were selected. All learning rates was 3e-5; Baseline seeds were 42 and DeepSpeed seeds were 10.
```shell
python evaluate-v1.1.py <PATH_TO_DATA_DIR>/dev-v1.1.json <PATH_TO_DATA_DIR>/predictions.json
```
| Case | Precision | EM | F1 | Throughput |
| --------- | --------- | ----- | ----- | ---------- |
| DeepSpeed | FP16 | 84.38 | 91.11 | 9.6 |
| Baseline | FP16 | 84.39 | 91.29 | 8.4 |
| DeepSpeed | FP32 | 84.20 | 91.06 | 3.7 |
| Baseline | FP32 | 84.20 | 90.91 | 2.7 |
### Fine-tuning Results
In terms of throughput (expressed in iterations processed per second), we note that DeepSpeed outperforms the baseline for the desired accuracy (in terms of EM, F1 scores).
The table summarizing the results are given below. In all cases (unless
otherwise noted), the total batch size is set to 24 and training is conducted
on 4 GPUs for 2 epochs on a DGX-2 node. A set of parameters (seeds and
learning rates) were tried and the best ones were selected. All learning rates
were 3e-5; We set the seeds to 9041 and 19068 for HuggingFace and TensorFlow
models, respectively. The checkpoints used for each case are linked in the
table below.
## Fine-tuning the model pre-trained with DeepSpeed Transformer Kernels
For pre-training your model, please see [BERT Pre-Training](\bert-pretraining\) tutorial for the detailed instrucions.
If you already obtained the checkpoint of your model, use the following configuration to finetune your pretrained checkpoint.
| Case | Model | Precision | EM | F1 |
| ----------- | ------------------------------------- | --------- | ----- | ----- |
| TensorFlow | [Bert-large-uncased-L-24_H-1024_A-16](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip) | FP16 | 84.13 | 91.03 |
| HuggingFace | [Bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin) | FP16 | 87.27 | 93.33 |
| Parameters | Value |
| ------------------------ | ------------------------- |
| Total batch size | 24 |
| Train micro batch size per gpu | 3 |
| Optimizer | Adam |
| Learning rate | 4e-5 |
| Sequence-length | 384 |
| Weight-decay | 0.0 |
| Epoch count | 2 |
## Enabling DeepSpeed's Transformer Kernel for better Throughput
### Enabling DeepSpeed's Transformer Kernel
DeepSpeed's optimized transformer kernel can be enabled during fine-tuning to
increase the training throughput. In addition to supporting the models
pre-trained with DeepSpeed, the kernel can be used with TensorFlow and
HuggingFace checkpoints.
DeepSpeed's optimized transformer kernel must be enabled during fine-tuning
if and only if it was used also during pre-training, because the transformer
kernel has its own parameters saved in checkpoint files.
### Enabling Transformer Kernel
To enable the transformer kernel for higher performance, first add an argument
`--deepspeed_transformer_kernel` in `utils.py`, we can set it as `False` by
default, for easily turning on/off.
An argument `--deepspeed_transformer_kernel` is already created in `utils.py`, we enable the transformer kernel by adding it in the shell script.
```python
parser.add_argument('--deepspeed_transformer_kernel',
parser.add_argument(
'--deepspeed_transformer_kernel',
default=False,
action='store_true',
help='Use DeepSpeed transformer kernel to accelerate.')
help='Use DeepSpeed transformer kernel to accelerate.'
)
```
Then in the `BertEncoder` class of the modeling source file, instantiate
transformer layers using the DeepSpeed transformer kernel as below.
In the `BertEncoder` class of the modeling source file, DeepSpeed transformer kernel is created as below when it is enabled by using `--deepspeed_transformer_kernel` argument.
```python
if args.deepspeed_transformer_kernel:
from deepspeed import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig, DeepSpeedConfig
if args.deepspeed_transformer_kernel:
from deepspeed import DeepSpeedTransformerLayer, \
DeepSpeedTransformerConfig, DeepSpeedConfig
if hasattr(args, 'deepspeed_config') and args.deepspeed_config:
ds_config = DeepSpeedConfig(args.deepspeed_config)
else:
raise RuntimeError('deepspeed_config is not found in args.')
cuda_config = DeepSpeedTransformerConfig(
batch_size = ds_config.train_micro_batch_size_per_gpu,
max_seq_length = args.max_seq_length,
hidden_size = config.hidden_size,
heads = config.num_attention_heads,
attn_dropout_ratio = config.attention_probs_dropout_prob,
hidden_dropout_ratio = config.hidden_dropout_prob,
num_hidden_layers = config.num_hidden_layers,
initializer_range = config.initializer_range,
seed = args.seed,
fp16 = ds_config.fp16_enabled,
pre_layer_norm=True)
self.layer = nn.ModuleList([copy.deepcopy(DeepSpeedTransformerLayer(i, cuda_config)) for i in range(config.num_hidden_layers)])
else:
batch_size=ds_config.train_micro_batch_size_per_gpu,
max_seq_length=args.max_seq_length,
hidden_size=config.hidden_size,
heads=config.num_attention_heads,
attn_dropout_ratio=config.attention_probs_dropout_prob,
hidden_dropout_ratio=config.hidden_dropout_prob,
num_hidden_layers=config.num_hidden_layers,
initializer_range=config.initializer_range,
seed=args.seed,
fp16=ds_config.fp16_enabled
)
self.layer = nn.ModuleList([
copy.deepcopy(DeepSpeedTransformerLayer(i, cuda_config))
for i in range(config.num_hidden_layers)
])
else:
layer = BertLayer(config)
self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
self.layer = nn.ModuleList([
copy.deepcopy(layer)
for _ in range(config.num_hidden_layers)
])
```
All configuration settings come from the DeepSpeed configuration file and
command arguments and thus we must pass the `args` variable to here in this model.
Note: `batch_size` is the maximum bath size of input data, all fine-tuning
training data or prediction data shouldn't exceed this threshold, otherwise it
will throw an exception. In the DeepSpeed configuration file micro batch size
is defined as `train_micro_batch_size_per_gpu`, e.g., if it is set as 8 then
the `--predict_batch_size` should also be 8.
For further details about the transformer kernel, please see our [usage
tutorial](/tutorials/transformer_kernel/) and [technical deep
dive](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) on
the fastest BERT training.
### Loading HuggingFace and TensorFlow Pretrained Models
BingBertSquad supports both HuggingFace and TensorFlow pretrained models. Here,
we show the two model examples:
1. `test/huggingface` which includes the checkpoint
[Bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin) and [bert json config](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json).
2. `test/tensorflow` which comes from a checkpoint zip from Google
[Bert-large-uncased-L-24_H-1024_A-16](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip).
```shell
[test/huggingface]
bert-large-uncased-whole-word-masking-config.json
bert-large-uncased-whole-word-masking-pytorch_model.bin
```
```shell
[test/tensorflow]
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
```
There are three arguments used for loading these two types of checkpoints.
1. `--model_file`, points to the pretrained model file.
2. `--ckpt_type`, indicates the checkpoint type, `TF` for Tensorflow, `HF` for HuggingFace, default value is `DS` for DeepSpeed.
3. `--origin_bert_config_file`, points to the BERT config file, usually saved in same folder of `model_file`.
We can add the following in our fine-tuning shell script in
`run_squad_deepspeed.sh` to run the above HuggingFace and TensorFlow examples.
```shell
[HuggingFace]
--model_file test/huggingface/bert-large-uncased-whole-word-masking-pytorch_model.bin \
--ckpt_type HF \
--origin_bert_config_file test/huggingface/bert-large-uncased-whole-word-masking-config.json \
```
```shell
[TensorFlow]
--model_file /test/tensorflow/bert_model.ckpt \
--ckpt_type TF \
--origin_bert_config_file /test/tensorflow/bert_config.json \
```
Note:
1. `batch_size` is the maximum bath size of input data, all fine-tuning training data or prediction data shouldn't exceed this threshold, otherwise it will throw an exception. In the DeepSpeed configuration file micro batch size is defined as `train_micro_batch_size_per_gpu`, e.g. if it is set as 8 and prediction uses batch size of 12, we can use 12 as transformer kernel batch size, or using "--predict_batch_size" argument to set prediction batch size to 8 or a smaller number.
2. `local_rank` in DeepSpeedTransformerConfig is used to assign the transformer kernel to the correct device. Since the model already runs set_device() before here, so does not need to be set here.
1. `--deepspeed_transformer_kernel` flag is required for using HuggingFace or TensorFlow pretrained models.
For more details about the transformer kernel, please see [DeepSpeed Transformer Kernel](/transformer_kernel/) and [DeepSpeed Fast-Bert Training](/fast_bert/).
2. `--preln` flag cannot be used with HuggingFace or TensorFlow pretrained models, since they use a post-layer-norm.
### Dropout Setting
For the fine-tuning, we only use the deterministic transformer to have reproducible the fine-tuning results. But, we choose different values for dropout based on whether pre-training was done using deterministic or stochastic transformer (Please see [Transformer tutorial](/transformer_kernel/) for more detail of selecting these two modes).
3. BingBertSquad will check the pretrained models to have the same vocabulary size and won't be able to run if there is any mismatch. We advise that you use a model checkpoint of the style described above or a DeepSpeed bing\_bert checkpoint.
For model pre-trained with deterministic transformer, we use the same dropout ration used in pretraining (0.1). However, we slightly increase the dropout ratio when fine-tuning the model pre-trained using the stochastic transformer to compensate for the lack of stochastic noise during fune-tuning.
### Tuning Performance
In order to perform fine-tuning, we set the total batch size to 24 as shown in Table 1. However, we can tune the micro-batch size per GPU to get high-performance training. In this regard, we have tried different micro-batch sizes on NVIDIA V100 using either 16GB or 32GB of memory. As Tables 2 and 3 show, we can improve performance by increasing the micro-batch. Compared with PyTorch, we can achieve up to 1.5x speedup for the 16GB V100 while supporting a 2x larger batch size per GPU. On the other hand, we can support as large as 32 batch size (2.6x higher than PyTorch) using a 32GB V100, while providing 1.3x speedup in the end-to-end fine-tune training. Note, that we use the best samples-per-second to compute speedup for the cases that PyTorch runs out-of-memory (OOM).
| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
| ---------------- | ------- | --------- | ----------- |
| 4 | 36.34 | 50.76 | 1.4 |
| 6 | OOM | 54.28 | 1.5 |
| 8 | OOM | 54.16 | 1.5 |
| Pretraining mode | Dropout ratio |
| ------------------------ | ------------------------- |
| Determinstic | 0.1 |
| Stochastic | 0.12 - 0.14 |
Table 2. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (16GB) using PyTorch and DeepSpeed transformer kernels.
| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
| ---------------- | ------- | --------- | ----------- |
| 4 | 37.78 | 50.82 | 1.3 |
| 6 | 43.81 | 55.97 | 1.3 |
| 12 | 49.32 | 61.41 | 1.2 |
| 24 | OOM | 60.70 | 1.2 |
| 32 | OOM | 63.01 | 1.3 |
Table 3. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (32GB) using PyTorch and DeepSpeed transformer kernels.
As mentioned, we can increase the micro-batch size per GPU from 3 to 24 or even
higher if a larger batch size is desired. In order to support a larger
micro-batch size, we may need to enable different memory-optimization flags for our
transformer kernel as described in [DeepSpeed Transformer
Kernel](/tutorials/transformer_kernel/) tutorial. Table 4 shows which
optimization flags are required for running different range of micro-batch
sizes.
| Micro batch size | NVIDIA V100 (32-GB) | NVIDIA V100 (16-GB) |
| :--------------: | :--------------------------------------: | :--------------------------------------: |
| > 4 | - | `normalize_invertible` |
| > 6 | - | `attn_dropout_checkpoint`, `gelu_checkpoint` |
| > 12 | `normalize_invertible`, `attn_dropout_checkpoint` | OOM |
| > 24 | `gelu_checkpoint` | OOM |
### Results
Table 4. The setting of memory-optimization flags for a range of micro-batch size on 16-GB and 32-GB V100.
Fine-tuning the model pre-trained usng DeepSpeed Transformer and the recepie in [DeepSpeed Fast-Bert Training](/fast_bert/) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial.
### FineTuning model pre-trained with DeepSpeed Transformer Kernels
Fine-tuning the model pre-trained using DeepSpeed Transformer and the recipe in [DeepSpeed Fast-Bert Training](/fast_bert/) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial.
To get these results, we do require some tuning of the dropout settings as described below:
### Dropout Setting
For the fine-tuning, we only use the deterministic transformer to have reproducible the fine-tuning results. But, we choose different values for dropout based on whether pre-training was done using deterministic or stochastic transformer (Please see [Transformer tutorial](/tutorials/transformer_kernel/) for more detail of selecting these two modes).
For models pre-trained with deterministic transformer, we use the same dropout ratio used in pre-training (0.1). However, we slightly increase the dropout ratio when fine-tuning the model pre-trained using the stochastic transformer to compensate for the lack of stochastic noise during fine-tuning.
| Pre-training mode | Dropout ratio |
| ----------------- | ------------- |
| Deterministic | 0.1 |
| Stochastic | 0.12 - 0.14 |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment