add the fine-tuning results (#260)

* add the fine-tuning results * updating tutorial and blog-post * updated the tutorials and links

add the fine-tuning results (#260)
* add the fine-tuning results * updating tutorial and blog-post * updated the tutorials and links
6e87251c · RezaYazdaniAminabadi · GitHub · 96c4daab · 6e87251c · 6e87251c
Unverified Commit 6e87251c authored Jun 16, 2020 by RezaYazdaniAminabadi Committed by GitHub Jun 16, 2020
Showing with 266 additions and 121 deletions

docs/_posts/2020-05-28-fastest-bert-training.md docs/_posts/2020-05-28-fastest-bert-training.md +40 -11

docs/_tutorials/bert-finetuning.md docs/_tutorials/bert-finetuning.md +226 -110

No files found.
--- a/docs/_posts/2020-05-28-fastest-bert-training.md
+++ b/docs/_posts/2020-05-28-fastest-bert-training.md
@@ -24,9 +24,9 @@ DeepSpeed to achieve this record-breaking BERT training time.
 4.  Layer-norm reordering for training stability and faster convergence

 These optimizations not only benefit BERT; they are also applicable to many
-other transformer-based models such as RoBERTa, XLNet, and UniLM.
+other transformer-based models such as RoBERTa, XLNet, and UniLM. Furthermore, besides the improvements mentioned for pre-training, DeepSpeed achieves up to 1.5x speedups for the downstream tasks, such as the fine-tuning for Bing-BERT SQuAD.

-## Overview of Performance Results
+## Performance Results for BERT Pretraining

 Compared to SOTA, DeepSpeed significantly improves single GPU performance for
 transformer-based model like BERT. Figure 1 shows the single GPU throughput of
@@ -66,8 +66,8 @@ in teraflops (Tflops). DeepSpeed boosts throughput and allows for higher batch
 sizes without running out-of-memory.

 Looking at distributed training across GPUs, Table 1 shows our end-to-end
-BERT-Large pretraining time (F1 score of 90.5 for SQUAD) using 16 to 1024 GPUs.
-We complete BERT pretraining in 44 minutes using 1024 V100 GPUs (64 NVIDIA
+BERT-Large pre-training time (F1 score of 90.5 for SQUAD) using 16 to 1024 GPUs.
+We complete BERT pre-training in 44 minutes using 1024 V100 GPUs (64 NVIDIA
 DGX-2 nodes). In comparison, the previous SOTA from NVIDIA takes 47 mins using
 1472 V100 GPUs. DeepSpeed is not only faster but also uses  30% less resources.
 Using the same 1024 GPUS,NVIDIA BERT takes 67 minutes using the same 1024 GPUs
@@ -76,7 +76,7 @@ Similarly, on 256 GPUs, NVIDIA BERT takes 236 minutes while DeepSpeed takes 144
 minutes (39% faster).

 | Number of nodes | Number of V100 GPUs | Time         |
-| ----------------- | -------------------- | ------------ |
+| --------------- | ------------------- | ------------ |
 | 1 DGX-2         | 16                  | 33 hr 13 min |
 | 4 DGX-2         | 64                  | 8 hr 41 min  |
 | 16 DGX-2        | 256                 | 144 min      |
@@ -92,6 +92,35 @@ throughput by combining our software optimizations with the new hardware. We
 project it would reduce BERT training time further to less than 25 minutes on a
 cluster of 1024 A100 GPUs.

+## Performance Results for Fine-Tuning Tasks
+
+In addition to the performance benefits we show for the pre-training,
+we have evaluated the performance of our customized kernel for fine-tuning the
+downstream tasks. Tables 2 and 3 show the samples-per-second achieved when running
+Bing-BERT SQuAD on NVIDIA V100 using 16 and 32 GB of memory, using PyTorch and DeepSpeed transformer kernels.
+For the 16-GB V100, we can achieve up to 1.5x speedup while supporting 2x larger batch size per GPU.
+On the other hand, we can support as large as 32 batch size (2.6x more than Pytorch) using 32GB of memory, while providing 1.3x speedup for the end-to-end fine-tune training. Note, that we use the best
+samples-per-second to compute speedup for the cases that PyTorch runs out-of-memory (OOM).
+
+| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
+| ---------------- | ------- | --------- | ----------- |
+| 4                | 36.34   | 50.76     | 1.4         |
+| 6                | OOM     | 54.28     | 1.5         |
+| 8                | OOM     | 54.16     | 1.5         |
+
+Table 2. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (16-GB) using PyTorch and DeepSpeed transformer kernels.
+
+| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
+| ---------------- | ------- | --------- | ----------- |
+| 4                | 37.78    | 50.82      | 1.3        |
+| 6                | 43.81    | 55.97      | 1.3         |
+| 12               | 49.32    | 61.41      | 1.2         |
+| 24               | OOM      | 60.70      | 1.2         |
+| 32               | OOM      | 63.01      | 1.3         |
+
+Table 3: Samples/second for running SQuAD fine-tuning on NVIDIA V100 (32-GB) using PyTorch and DeepSpeed transformer kernels.
+
+
 ## BERT Highly Optimized Transformer Kernels

 GPUs have very high peak floating-point throughput, but the default Transformer
@@ -170,7 +199,7 @@ between the two versions depending on their usage scenarios: Stochastic version
 pursues ultimate training performance goal, and deterministic version may save
 development time by better facilitating experimentation and debugging.

-In our experiments, we use stochastic kernels for the pretraining BERT, while
+In our experiments, we use stochastic kernels for the pre-training BERT, while
 using non-stochastic kernels for fine-tuning to achieve fully reproducible
 results. We recommend using stochastic kernels for training tasks involving
 massive amounts of data such as pre-training, while using non-stochastic

--- a/docs/_tutorials/bert-finetuning.md
+++ b/docs/_tutorials/bert-finetuning.md
@@ -7,19 +7,17 @@ In this tutorial we will be adding DeepSpeed to the BingBert model for the SQuAD

 ## Overview

-Please clone the DeepSpeed repository and change to deepspeed directory
-
-`git clone https://github.com/microsoft/deepspeed`
-
-`cd deepspeed`
-
-The DeepSpeedExamples are submodules so you need to initialize and update them using the following commands
-
-`git submodule init`
-
-`git submodule update`
-
-Go to the `DeepSpeedExamples/BingBertSquad` folder to follow along.
+If you don't already have a copy of the DeepSpeed repository, please clone in
+now and checkout the DeepSpeedExamples submodule the contains the BingBertSquad
+example (DeepSpeedExamples/BingBertSquad) we will be going over in the rest of
+this tutorial.
+
+```shell
+git clone https://github.com/microsoft/DeepSpeed
+cd DeepSpeed
+git submodule update --init --recursive
+cd DeepSpeedExamples/BingBertSquad
+```

 ### Pre-requisites

@@ -27,27 +25,44 @@ Go to the `DeepSpeedExamples/BingBertSquad` folder to follow along.
  * Training set: [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
  * Validation set: [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

-You also need a pre-trained BERT model checkpoint from DeepSpeed. We will use checkpoint 162 from the
-BERT pre-training [tutorial](bert-pretraining).
-
-  * Pre-training checkpoint: `training_state_checkpoint_162.tar`
-
-Note that the BERT model in the file `train-v1.1.json_bert-large-uncased_384_128_64` is not strictly required as it will be downloaded automatically if it is not present locally on the cluster.
+You also need a pre-trained BERT model checkpoint from either DeepSpeed, [HuggingFace](https://github.com/huggingface/transformers), or [TensorFlow](https://github.com/google-research/bert#pre-trained-models) to run the fine-tuning. Regarding the DeepSpeed model, we will use checkpoint 160 from the BERT pre-training [tutorial](/tutorials/bert-pretraining/).

 ### Running BingBertSquad

- **Unmodified (BaseLine):** If you would like to run unmodified BingBertSquad with the pre-processed data, there is a helper script which you can invoke via: `bash run_squad_baseline.sh 8 <PATH_TO_CHECKPOINT>/training_state_checkpoint_162.tar <PATH_TO_DATA_DIR> <PATH_TO_OUTPUT_DIR> ` where the first argument `8` is the number of GPUs, second argument is the path to the pre-training checkpoint, third is the path to training and validation sets (e.g. train-v1.1.json), and fourth is path to an output folder (e.g. ~/output). This bash script sets the parameters and invokes `nvidia_run_squad_baseline.py`.
- **Modified (DeepSpeed):** This is similar to baseline;  just substitute `run_squad_baseline.sh` with `run_squad_deepspeed.sh` which invokes `nvidia_run_squad_deepspeed.py`.
+- **DeepSpeed-enabled:** We provide a shell script that you can invoke to start training with DeepSpeed, it takes 4 arguments: `bash run_squad_deepspeed.sh <NUM_GPUS> <PATH_TO_CHECKPOINT> <PATH_TO_DATA_DIR> <PATH_TO_OUTPUT_DIR>`. The first argument is the number of GPUs to train with, second argument is the path to the pre-training checkpoint, third is the path to training and validation sets (e.g., train-v1.1.json), and fourth is path to an output folder where the results will be saved. This script will invoke `nvidia_run_squad_deepspeed.py`.
+- **Unmodified baseline** If you would like to run a non-DeepSpeed enabled version of fine-tuning we provide a shell script that takes the same arguments as the DeepSpeed one named `run_squad_baseline.sh`. This script will invoke `nvidia_run_squad_baseline.py`.

 ## DeepSpeed Integration

-The main DeepSpeed modified script is `nvidia_run_squad_deepspeed.py`; the line `import deepspeed` enables you to use DeepSpeed.
-
-Make sure that the number of GPUs specified in the job are available (else, this will yield an out of memory error). The wrapper script `run_BingBertSquad.sh` and the test script `run_tests.sh` essentially serve to automate training - they may also be used a guidelines to set parameters and launch the fine-tuning task.
+The main part of training is done in `nvidia_run_squad_deepspeed.py`, which has
+already been modified to use DeepSpeed. The `run_squad_deepspeed.sh` script
+helps to invoke training and setup several different hyperparameters relevant
+to the training process. In the next few sections we will cover what changes we
+made to the baseline in order to enable DeepSpeed, you don't have to make these
+changes yourself since we have already done them for you.

 ### Configuration

-The `deepspeed_bsz24_config.json` file gives the user to specify DeepSpeed options in terms of batch size, learning rate, precision and other parameters. When running the `nvidia_run_squad_deepspeed.py`, in addition to the `-d` flag to enable DeepSpeed, the appropriate DeepSpeed configuration file must be specified using `--deepspeed_config <deepspeed_bsz24_config.json>`.
+The `deepspeed_bsz24_config.json` file gives the user the ability to specify DeepSpeed
+options in terms of batch size, micro batch size, learning rate, and other parameters.
+When running the `nvidia_run_squad_deepspeed.py`, in addition to the
+`--deepspeed` flag to enable DeepSpeed, the appropriate DeepSpeed configuration
+file must be specified using `--deepspeed_config
+deepspeed_bsz24_config.json`. Table 1 shows the fine-tuning configuration
+used in our experiments.
+
+| Parameters                     | Value |
+| ------------------------------ | ----- |
+| Total batch size               | 24    |
+| Train micro batch size per GPU | 3     |
+| Optimizer                      | Adam  |
+| Learning rate                  | 3e-5  |
+| Sequence-length                | 384   |
+| Weight-decay                   | 0.0   |
+| Epoch count                    | 2     |
+
+Table 1. Fine-tuning configuration
+

 ### Argument Parsing

@@ -67,18 +82,18 @@ Similar to this, all the options with their corresponding description are availa

 #### Initialization

-DeepSpeed has an initialization function to create model, optimizer and LR scheduler. For BingBertSquad, we simply augment the Baseline script with the initialize function as follows.
+DeepSpeed has an initialization function to wrap the model, optimizer, LR
+scheduler, and data loader. For BingBertSquad, we simply augment the baseline
+script with the initialize function to wrap the model and create the optimizer as follows:

 ```python
-model, optimizer, _, _ = deepspeed.initialize(args=args,
+model, optimizer, _, _ = deepspeed.initialize(
+    args=args,
    model=model,
-                                              model_parameters=optimizer_grouped_parameters,
-                                              dist_init_required=False)
-
+    model_parameters=optimizer_grouped_parameters
+)
 ```

-Another feature of DeepSpeed is its convenient `step()` function which can be called directly as `model.step()` which hides the `fp16_optimizer` away from the user as opposed to `optimizer.step()` in the baseline code (similar to other models in this tutorial) which needs explicit handling of the case of FP16 computation.
-
 #### Forward pass

 This is identical in both Baseline and DeepSpeed, and is performed by `loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)`.
@@ -89,113 +104,214 @@ In the Baseline script you need to handle the all-reduce operation at the gradie

 #### Weight updates

-In the Baseline Script, you are required to explicitly specify the optimizer as `FusedAdam` (along with the handling of dynamic loss scaling) in FP16 and `BertAdam` in FP32, followed by the call `optimizer.step()` and `optimizer.zero_grad()`. DeepSpeed handles this internally (by setting the optimizer using the JSON config) when `initialize()` is called and thus you don't need to explicitly write code but just do `model.step()`.
+In the Baseline Script, you are required to explicitly specify the optimizer as
+`FusedAdam` (along with the handling of dynamic loss scaling) in FP16 and
+`BertAdam` in FP32, followed by the call `optimizer.step()` and
+`optimizer.zero_grad()`. DeepSpeed handles this internally (by setting the
+optimizer using the JSON config) when `initialize()` is called and thus you
+don't need to explicitly write code but just do `model.step()`.

-Congratulations! Porting into DeepSpeed is complete.
+Congratulations! Porting to DeepSpeed is complete.

 ### Evaluation

 Once training is complete, the EM and F1 scores may be obtained from the following command:

-`python evaluate-v1.1.py <PATH_TO_DEVSET>/dev-v1.1.json <PATH_TO_PREDICTIONS>/predictions.json`
-
-### DeepSpeed Improvements
-
-The table summarizing the results are given below. In all cases, the batch size is set to 24 and the training is conducted on 8 GPUs for 2 epochs on the DLTS RR1 DGX-2 hypercluster. A set of parameters (seeds and learning rates) were tried and the best ones were selected. All learning rates was 3e-5; Baseline seeds were 42 and DeepSpeed seeds were 10.
+```shell
+python evaluate-v1.1.py <PATH_TO_DATA_DIR>/dev-v1.1.json <PATH_TO_DATA_DIR>/predictions.json
+```

-| Case      | Precision | EM    | F1    | Throughput |
-| --------- | --------- | ----- | ----- | ---------- |
-| DeepSpeed | FP16      | 84.38 | 91.11 | 9.6        |
-| Baseline  | FP16      | 84.39 | 91.29 | 8.4        |
-| DeepSpeed | FP32      | 84.20 | 91.06 | 3.7        |
-| Baseline  | FP32      | 84.20 | 90.91 | 2.7        |
+### Fine-tuning Results

-In terms of throughput (expressed in iterations processed per second), we note that DeepSpeed outperforms the baseline for the desired accuracy (in terms of EM, F1 scores).
+The table summarizing the results are given below. In all cases (unless
+otherwise noted), the total batch size is set to 24 and training is conducted
+on 4 GPUs for 2 epochs on a DGX-2 node.  A set of parameters (seeds and
+learning rates) were tried and the best ones were selected. All learning rates
+were 3e-5; We set the seeds to 9041 and 19068 for HuggingFace and TensorFlow
+models, respectively. The checkpoints used for each case are linked in the
+table below.

-## Fine-tuning the model pre-trained with DeepSpeed Transformer Kernels
-For pre-training your model, please see [BERT Pre-Training](\bert-pretraining\) tutorial for the detailed instrucions.
-If you already obtained the checkpoint of your model, use the following configuration to finetune your pretrained checkpoint.
+| Case        | Model                                 | Precision | EM    | F1    |
+| ----------- | ------------------------------------- | --------- | ----- | ----- |
+| TensorFlow  | [Bert-large-uncased-L-24_H-1024_A-16](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)   | FP16      | 84.13 | 91.03 |
+| HuggingFace | [Bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin) | FP16      | 87.27 | 93.33 |

-| Parameters               | Value             |
-| ------------------------ | ------------------------- |
-| Total batch size         | 24                       |
-| Train micro batch size per gpu | 3                  |
-| Optimizer                | Adam                      |
-| Learning rate            | 4e-5                      |
-| Sequence-length          | 384                      |
-| Weight-decay             | 0.0                      |
-| Epoch count               | 2                      |
+## Enabling DeepSpeed's Transformer Kernel for better Throughput

-### Enabling DeepSpeed's Transformer Kernel
+DeepSpeed's optimized transformer kernel can be enabled during fine-tuning to
+increase the training throughput. In addition to supporting the models
+pre-trained with DeepSpeed, the kernel can be used with TensorFlow and
+HuggingFace checkpoints.

-DeepSpeed's optimized transformer kernel must be enabled during fine-tuning
-if and only if it was used also during pre-training, because the transformer
-kernel has its own parameters saved in checkpoint files.
+### Enabling Transformer Kernel

-To enable the transformer kernel for higher performance, first add an argument
-`--deepspeed_transformer_kernel` in `utils.py`, we can set it as `False` by
-default, for easily turning on/off.
+An argument `--deepspeed_transformer_kernel` is already created in `utils.py`, we enable the transformer kernel by adding it in the shell script.

 ```python
- parser.add_argument('--deepspeed_transformer_kernel',
+parser.add_argument(
+    '--deepspeed_transformer_kernel',
    default=False,
    action='store_true',
-                     help='Use DeepSpeed transformer kernel to accelerate.')
+    help='Use DeepSpeed transformer kernel to accelerate.'
+)
 ```

-Then in the `BertEncoder` class of the modeling source file, instantiate
-transformer layers using the DeepSpeed transformer kernel as below.
+In the `BertEncoder` class of the modeling source file, DeepSpeed transformer kernel is created as below when it is enabled by using `--deepspeed_transformer_kernel` argument.

 ```python
-         if args.deepspeed_transformer_kernel:
-             from deepspeed import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig, DeepSpeedConfig
+if args.deepspeed_transformer_kernel:
+    from deepspeed import DeepSpeedTransformerLayer, \
+        DeepSpeedTransformerConfig, DeepSpeedConfig

-             if hasattr(args, 'deepspeed_config') and args.deepspeed_config:
    ds_config = DeepSpeedConfig(args.deepspeed_config)
-             else:
-                 raise RuntimeError('deepspeed_config is not found in args.')

    cuda_config = DeepSpeedTransformerConfig(
-                 batch_size = ds_config.train_micro_batch_size_per_gpu,
-                 max_seq_length = args.max_seq_length,
-                 hidden_size = config.hidden_size,
-                 heads = config.num_attention_heads,
-                 attn_dropout_ratio = config.attention_probs_dropout_prob,
-                 hidden_dropout_ratio = config.hidden_dropout_prob,
-                 num_hidden_layers = config.num_hidden_layers,
-                 initializer_range = config.initializer_range,
-                 seed = args.seed,
-                 fp16 = ds_config.fp16_enabled,
-                 pre_layer_norm=True)
-
-             self.layer = nn.ModuleList([copy.deepcopy(DeepSpeedTransformerLayer(i, cuda_config)) for i in range(config.num_hidden_layers)])
-         else:
+        batch_size=ds_config.train_micro_batch_size_per_gpu,
+        max_seq_length=args.max_seq_length,
+        hidden_size=config.hidden_size,
+        heads=config.num_attention_heads,
+        attn_dropout_ratio=config.attention_probs_dropout_prob,
+        hidden_dropout_ratio=config.hidden_dropout_prob,
+        num_hidden_layers=config.num_hidden_layers,
+        initializer_range=config.initializer_range,
+        seed=args.seed,
+        fp16=ds_config.fp16_enabled
+    )
+    self.layer = nn.ModuleList([
+        copy.deepcopy(DeepSpeedTransformerLayer(i, cuda_config))
+        for i in range(config.num_hidden_layers)
+    ])
+else:
    layer = BertLayer(config)
-             self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
-
+    self.layer = nn.ModuleList([
+        copy.deepcopy(layer)
+        for _ in range(config.num_hidden_layers)
+    ])
 ```

 All configuration settings come from the DeepSpeed configuration file and
 command arguments and thus we must pass the `args` variable to here in this model.

+Note: `batch_size` is the maximum bath size of input data, all fine-tuning
+training data or prediction data shouldn't exceed this threshold, otherwise it
+will throw an exception. In the DeepSpeed configuration file micro batch size
+is defined as `train_micro_batch_size_per_gpu`, e.g., if it is set as 8 then
+the `--predict_batch_size` should also be 8.
+
+For further details about the transformer kernel, please see our [usage
+tutorial](/tutorials/transformer_kernel/) and [technical deep
+dive](https://www.deepspeed.ai/news/2020/05/27/fastest-bert-training.html) on
+the fastest BERT training.
+
+
+### Loading HuggingFace and TensorFlow Pretrained Models
+
+BingBertSquad supports both HuggingFace and TensorFlow pretrained models. Here,
+we show the two model examples:
+
+1. `test/huggingface` which includes the checkpoint
+[Bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin) and [bert json config](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json).
+2. `test/tensorflow` which comes from a checkpoint zip from Google
+[Bert-large-uncased-L-24_H-1024_A-16](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip).
+
+```shell
+[test/huggingface]
+bert-large-uncased-whole-word-masking-config.json
+bert-large-uncased-whole-word-masking-pytorch_model.bin
+```
+
+```shell
+[test/tensorflow]
+bert_config.json
+bert_model.ckpt.data-00000-of-00001
+bert_model.ckpt.index
+bert_model.ckpt.meta
+```
+
+There are three arguments used for loading these two types of checkpoints.
+
+1. `--model_file`, points to the pretrained model file.
+2. `--ckpt_type`, indicates the checkpoint type, `TF` for Tensorflow, `HF` for HuggingFace, default value is `DS` for DeepSpeed.
+3. `--origin_bert_config_file`, points to the BERT config file, usually saved in same folder of `model_file`.
+
+We can add the following in our fine-tuning shell script in
+`run_squad_deepspeed.sh` to run the above HuggingFace and TensorFlow examples.
+
+```shell
+[HuggingFace]
+
+--model_file test/huggingface/bert-large-uncased-whole-word-masking-pytorch_model.bin \
+--ckpt_type HF \
+--origin_bert_config_file test/huggingface/bert-large-uncased-whole-word-masking-config.json \
+```
+
+```shell
+[TensorFlow]
+
+--model_file /test/tensorflow/bert_model.ckpt \
+--ckpt_type TF \
+--origin_bert_config_file /test/tensorflow/bert_config.json \
+```
+
 Note:

-1. `batch_size` is the maximum bath size of input data, all fine-tuning training data or prediction data shouldn't exceed this threshold, otherwise it will throw an exception. In the DeepSpeed configuration file micro batch size is defined as `train_micro_batch_size_per_gpu`, e.g. if it is set as 8 and prediction uses batch size of 12, we can use 12 as transformer kernel batch size, or using "--predict_batch_size" argument to set prediction batch size to 8 or a smaller number.
-2. `local_rank` in DeepSpeedTransformerConfig is used to assign the transformer kernel to the correct device. Since the model already runs set_device() before here, so does not need to be set here.
+1. `--deepspeed_transformer_kernel` flag is required for using HuggingFace or TensorFlow pretrained models.

-For more details about the transformer kernel, please see [DeepSpeed Transformer Kernel](/transformer_kernel/) and [DeepSpeed Fast-Bert Training](/fast_bert/).
+2. `--preln` flag cannot be used with HuggingFace or TensorFlow pretrained models, since they use a post-layer-norm.

-### Dropout Setting
-For the fine-tuning, we only use the deterministic transformer to have reproducible the fine-tuning results. But, we choose different values for dropout based on whether pre-training was done using deterministic or stochastic transformer (Please see [Transformer tutorial](/transformer_kernel/) for more detail of selecting these two modes).
+3. BingBertSquad will check the pretrained models to have the same vocabulary size and won't be able to run if there is any mismatch. We advise that you use a model checkpoint of the style described above or a DeepSpeed bing\_bert checkpoint.

-For model pre-trained with deterministic transformer, we use the same dropout ration used in pretraining (0.1). However, we slightly increase the dropout ratio when fine-tuning the model pre-trained using the stochastic transformer to compensate for the lack of stochastic noise during fune-tuning.
+### Tuning Performance
+In order to perform fine-tuning, we set the total batch size to 24 as shown in Table 1. However, we can tune the micro-batch size per GPU to get high-performance training. In this regard, we have tried different micro-batch sizes on NVIDIA V100 using either 16GB or 32GB of memory. As Tables 2 and 3 show, we can improve performance by increasing the micro-batch. Compared with PyTorch, we can achieve up to 1.5x speedup for the 16GB V100 while supporting a 2x larger batch size per GPU. On the other hand, we can support as large as 32 batch size (2.6x higher than PyTorch) using a 32GB V100, while providing 1.3x speedup in the end-to-end fine-tune training. Note, that we use the best samples-per-second to compute speedup for the cases that PyTorch runs out-of-memory (OOM).

+| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
+| ---------------- | ------- | --------- | ----------- |
+| 4                | 36.34   | 50.76     | 1.4         |
+| 6                | OOM     | 54.28     | 1.5         |
+| 8                | OOM     | 54.16     | 1.5         |

-| Pretraining mode               | Dropout ratio             |
-| ------------------------ | ------------------------- |
-| Determinstic | 0.1                      |
-| Stochastic | 0.12 - 0.14                |
+Table 2. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (16GB) using PyTorch and DeepSpeed transformer kernels.
+
+| Micro batch size | PyTorch | DeepSpeed | Speedup (x) |
+| ---------------- | ------- | --------- | ----------- |
+| 4                | 37.78   | 50.82     | 1.3         |
+| 6                | 43.81   | 55.97     | 1.3         |
+| 12               | 49.32   | 61.41     | 1.2         |
+| 24               | OOM     | 60.70     | 1.2         |
+| 32               | OOM     | 63.01     | 1.3         |
+
+Table 3. Samples/second for running SQuAD fine-tuning on NVIDIA V100 (32GB) using PyTorch and DeepSpeed transformer kernels.
+
+As mentioned, we can increase the micro-batch size per GPU from 3 to 24 or even
+higher if a larger batch size is desired. In order to support a larger
+micro-batch size, we may need to enable different memory-optimization flags for our
+transformer kernel as described in [DeepSpeed Transformer
+Kernel](/tutorials/transformer_kernel/) tutorial. Table 4 shows which
+optimization flags are required for running different range of micro-batch
+sizes.
+
+| Micro batch size |           NVIDIA V100 (32-GB)            |           NVIDIA V100 (16-GB)            |
+| :--------------: | :--------------------------------------: | :--------------------------------------: |
+|       > 4        |                    -                     |          `normalize_invertible`          |
+|       > 6        |                    -                     | `attn_dropout_checkpoint`, `gelu_checkpoint` |
+|       > 12       | `normalize_invertible`, `attn_dropout_checkpoint` |                   OOM                    |
+|       > 24       |            `gelu_checkpoint`             |                   OOM                    |

-### Results
+Table 4. The setting of memory-optimization flags for a range of micro-batch size on 16-GB and 32-GB V100.

-Fine-tuning the model pre-trained usng DeepSpeed Transformer and the recepie in [DeepSpeed Fast-Bert Training](/fast_bert/) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial.
+### FineTuning model pre-trained with DeepSpeed Transformer Kernels
+
+Fine-tuning the model pre-trained using DeepSpeed Transformer and the recipe in [DeepSpeed Fast-Bert Training](/fast_bert/) should yield F1 score of 90.5 and is expected to increase if you let the pre-training longer than suggested in the tutorial.
+
+To get these results, we do require some tuning of the dropout settings as described below:
+
+### Dropout Setting
+For the fine-tuning, we only use the deterministic transformer to have reproducible the fine-tuning results. But, we choose different values for dropout based on whether pre-training was done using deterministic or stochastic transformer (Please see [Transformer tutorial](/tutorials/transformer_kernel/) for more detail of selecting these two modes).
+
+For models pre-trained with deterministic transformer, we use the same dropout ratio used in pre-training (0.1). However, we slightly increase the dropout ratio when fine-tuning the model pre-trained using the stochastic transformer to compensate for the lack of stochastic noise during fine-tuning.
+
+
+| Pre-training mode | Dropout ratio |
+| ----------------- | ------------- |
+| Deterministic     | 0.1           |
+| Stochastic        | 0.12 - 0.14   |