onebit-adam.md 13.5 KB
Newer Older
1
---
401qingkong's avatar
401qingkong committed
2
title: "1-bit Adam: Up to 5x less communication volume and up to 2x faster training"
3
4
---

5
In this tutorial, we are going to introduce the 1-bit Adam optimizer in DeepSpeed. 1-bit Adam can improve model training speed on communication-constrained clusters, especially for communication-intensive large models by reducing the overall communication volume by up to 5x. Detailed description of the 1-bit Adam algorithm, its implementation in DeepSpeed, and performance evaluation is available from our [blog post](https://www.deepspeed.ai/news/2020/09/08/onebit-adam-blog-post.html). We also have a [paper](https://arxiv.org/abs/2102.02888) which provides the most complete details including algorithm, system implementation, theoretical analysis, and more evaluations.
6
7
8
9
10
11

To illustrate the benefits and usage of 1-bit Adam optimizer in DeepSpeed, we use the following two training tasks as examples:

1. BingBertSQuAD Fine-tuning
2. BERT Pre-training

12
For more details on these tasks, please refer to the tutorial posts on [BingBertSQuAD Fine-tuning](/tutorials/bert-finetuning/) and [BERT Pre-training](/tutorials/bert-pretraining/).
13

14
15
## 1. Overview

401qingkong's avatar
401qingkong committed
16
### Pre-requisites for installing DeepSpeed
17
18
19
20
21
22
23
24
25
26

If you don't already have a copy of the DeepSpeed repository, please clone in
now and checkout the DeepSpeedExamples submodule that contains the BingBertSQuAD and BERT Pre-training examples.

```shell
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git submodule update --init --recursive
cd DeepSpeedExamples/
```
27

401qingkong's avatar
401qingkong committed
28
### Pre-requisites for 1-bit Adam
29

401qingkong's avatar
401qingkong committed
30
1-bit Adam uses advanced communication schemes that are not yet supported by PyTorch distributed and NCCL. We rely on Message Passing Interface (MPI) for these advanced communication primitives.
31

Stas Bekman's avatar
Stas Bekman committed
32
33
34
35
36
37
38
We package the necessary dependencies in the DeepSpeed docker images. However, if you are using a different build system, please install MPI and mpi4py on your system. To install the prerequisites run:

```shell
pip install deepspeed[1bit_adam]
```

We have tested CUDA-Aware MPI communication using the [MVAPICH2-GDR](http://mvapich.cse.ohio-state.edu/userguide/gdr/) library. However, any CUDA-Aware communication library including [OpenMPI](https://www.open-mpi.org/) should work fine with these examples.
39
40
41
42
43
44
45

An example launch command for 1-bit Adam using the `deepspeed` launcher is as follows:

```shell
deepspeed --launcher=[mvapich|openmpi] script.py
```

401qingkong's avatar
401qingkong committed
46
Please note that because 1-bit Adam uses MPI backend to communicate during the compression stage, the `--launcher=[mvapich|openmpi]` flag is required when using the `deepspeed` launcher.
47

48
49
50
Alternatively, the standard mpirun launcher can also be used as follows:

```shell
401qingkong's avatar
401qingkong committed
51
mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash [training_script.sh]
52
53
```

401qingkong's avatar
401qingkong committed
54
### 1-bit Algorithm
55

401qingkong's avatar
401qingkong committed
56
The detailed description of the 1-bit Algorithm can be seen from our [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html).
57

401qingkong's avatar
401qingkong committed
58
### Configuration of 1-bit Adam
59
60
61
62
63
The 1-bit Adam feature can be used by setting the optimizer configuration options as follows. An example json config file is shown below.

```json
{
  "train_batch_size": 4096,
401qingkong's avatar
401qingkong committed
64
  "train_micro_batch_size_per_gpu": 64,
65
66
67
  "optimizer": {
    "type": "OneBitAdam",
    "params": {
401qingkong's avatar
401qingkong committed
68
69
70
      "lr": 2e-4,
      "freeze_step": 400,
      "cuda_aware": true
71
72
73
74
75
76
77
    }
  },
  "fp16": {
    "enabled": true,
  }
}
```
401qingkong's avatar
401qingkong committed
78
Please note two new parameters `freeze_step` and `cuda_aware` that have been added to support the 1-bit Adam feature.
79

401qingkong's avatar
401qingkong committed
80
81
`cuda_aware` is used to indicate that the underlying MPI library support CUDA-Aware communication.
This feature is only supported on systems with InfiniBand interconnect and a CUDA-Aware MPI library like [MVAPICH2-GDR](http://mvapich.cse.ohio-state.edu/userguide/gdr/) or OpenMPI built with CUDA-Aware support. Setting `cuda_aware` to False will allow training on Ethernet based systems. However, the communication will happen using sender as well as receiver side memory copies between CPU and GPU buffers before and after communication.
82

401qingkong's avatar
401qingkong committed
83
`freeze_step` is the number of warm up steps before 1-bit compression gets applied to the communication. In order to determine the number of warm up steps, one strategy is to set 15-25% of the total training steps for a given model. If it provides the desired outcome, one can try to extract more performance by reducing the steps systematically. In future, we plan to introduce a threshold that can automatically search and decide for the number of warm up steps for different models. The examples below have been tuned for the number of warm up steps. The `freeze_step` parameter has already been set to the best number we found in the corresponding run scripts.
84

85
## 2. BingBertSQuAD Fine-tuning with 1-bit Adam
86
87
88
89
90
91
92
93
94
95

* Download the SQuAD dataset:
  * Training set: [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
  * Validation set: [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
* Download the HuggingFace checkpoint and config files:
  * [bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin)
  * [bert json config](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json)

You can also use a pre-trained BERT model checkpoint from either DeepSpeed, [HuggingFace](https://github.com/huggingface/transformers), or [TensorFlow](https://github.com/google-research/bert#pre-trained-models) to run the fine-tuning.

96
### 2.1 Running BingBertSQuAD with DeepSpeed and 1-bit Adam
97

401qingkong's avatar
401qingkong committed
98
The main part of training is done in `nvidia_run_squad_deepspeed.py`, which has
99
100
101
102
103
104
105
106
107
108
109
110
111
112
already been modified to use DeepSpeed. The `run_squad_deepspeed.sh` script
helps to invoke training and setup several different hyperparameters relevant
to the training process.

- **DeepSpeed-enabled:** Start training with DeepSpeed by providing the following 4 arguments to this script:

```shell
bash run_squad_deepspeed.sh <NUM_GPUS> <PATH_TO_CHECKPOINT> <PATH_TO_DATA_DIR> <PATH_TO_OUTPUT_DIR>`
```

The first argument is the number of GPUs to train with, second argument is the path to the pre-training checkpoint, third is the path to training and validation sets (e.g., train-v1.1.json), and fourth is path to an output folder where the results will be saved. This script will invoke `nvidia_run_squad_deepspeed.py`.

- **DeepSpeed with 1-bit Adam enabled:** In order to run with 1-bit Adam feature enabled, the same script (`nvidia_run_squad_deepspeed.py`) can be used but there are two options for launching this properly: 1) Launch using deepspeed launcher and 2) Launch with mpirun.

113
To enable the 1-bit compressed training, 1-bit Adam uses an MPI library (E.g. MVAPICH2-GDR, OpenMPI, etc.) as the communication backend, which means that we can use `mpirun` to launch the training job. However, our user-friendly launcher called `deepspeed` has been enhanced to launch MPI jobs as well.
114
115
116

### Launch with deepspeed

117
The following helper script in the DeepSpeedExamples/BingBertSQuAD will launch the training without the need for setting any `mpirun` parameters. The number of nodes and GPUs will be automatically detected and the job will be launched on all the available resources.
118
119

```shell
120
bash run_squad_deepspeed_onebitadam.sh <PATH_TO_OUTPUT_DIR>
121
122
123
124
125
126
127
```

### Launch with mpirun

Alternatively, we show how the standard `mpirun` launcher can be used for launching the fine-tuning job.

```shell
128
mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash run_squad_mpi_onebitadam.sh
129
```
130

131
For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use the `mpirun` launcher packaged with the MVAPICH2 library. Please run the following command:
132
133

```shell
134
mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash run_squad_mpi_onebitadam.sh
401qingkong's avatar
401qingkong committed
135
```
136

137
### 2.2 Configuration for BingBertSQuAD with DeepSpeed and 1-bit Adam enabled
138

139
The `deepspeed_onebitadam_bsz96_config.json` file gives the user the ability to specify DeepSpeed
140
141
142
options in terms of batch size, micro batch size, optimizer, learning rate, and other parameters.
When running the `nvidia_run_squad_deepspeed.py`, in addition to the
`--deepspeed` flag to enable DeepSpeed, the appropriate DeepSpeed configuration
143
file must be specified using `--deepspeed_config deepspeed_onebitadam_bsz96_config.json`.
144
145
146
147
148
149
150

Table 1 shows the fine-tuning configuration we used in our experiments.

| Parameters                     | Value 		|
| ------------------------------ | ---------------------|
| Total batch size               | 96    		|
| Train micro batch size per GPU | 3     		|
401qingkong's avatar
401qingkong committed
151
| Optimizer                      | **OnebitAdam**  	|
152
153
154
155
156
| Learning rate                  | 3e-5  		|
| Sequence-length                | 384   		|
| Weight-decay                   | 0.0   		|
| Epoch count                    | 2     		|
| **freeze_step**                | 400     	   	|
401qingkong's avatar
401qingkong committed
157
| **cuda_aware**                 | True     		|
158
159
160

Table 1. Fine-tuning configuration

401qingkong's avatar
401qingkong committed
161
162
**Note:** For more details about loading checkpoint, argument parsing, initialization, forward pass, backward pass, weight update and evaluation, please refer to the [BingBertSQuAD Fine-tuning](/tutorials/bert-finetuning/) tutorial.

163
164
165
### 2.3 Performance Results for BingBertSQuAD Fine-tuning

***Accuracy:***
166
167
168
169
170
171
172
173
174
The results are summarized in the table below. The total batch size is set to 96 and training is conducted
on 32 GPUs for 2 epochs. A set of parameters (seeds and learning rates) were tried and the best ones were selected.
We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scores we achieved that are on-par or better than the [HuggingFace results](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

| Case        | Model                                 | Precision | EM    | F1    |
| ----------- | ------------------------------------- | --------- | ----- | ----- |
| HuggingFace | [Bert-large-uncased-whole-word-masking](https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin) | FP16      | 87.26 | 93.32 |


175
176
***Training Speed and Scalability:***

401qingkong's avatar
401qingkong committed
177
1-bit Adam enables up to 2.7x overall speedup in training speed for SQuAD fine-tuning. This is made possible by up to 6.2x faster throughput during the compressed stage of the algorithm as shown in Figure 1.
178

179
![SQuAD Finetuning](/assets/images/squad-scaling.png){: .align-center}
180

401qingkong's avatar
401qingkong committed
181
Figure 1: Scalability of 1-bit Adam for SQuAD Finetuning on V100 GPUs with batch size of 3/GPU.
182
183
184


## 3. BERT Pre-training with 1-bit Adam
401qingkong's avatar
401qingkong committed
185
For data downloading and pre-processing, please refer to the [BERT Pre-training](/tutorials/bert-pretraining/) post.
186
187

### 3.1 Running Pre-training with DeepSpeed and 1-bit Adam
188

401qingkong's avatar
401qingkong committed
189
The main part of training is done in `deepspeed_train.py`, which has
190
191
already been modified to use DeepSpeed. The `ds_train_bert_onebit_bsz4k_seq128.sh` and `ds_train_bert_bsz64k_seq128.sh`
are the shell scripts that help to invoke training and setup several different hyperparameters relevant
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
to the training process.

- **DeepSpeed-enabled:** Start training with DeepSpeed by running the command below:

```shell
bash ds_train_bert_bsz64k_seq128.sh
```

- **DeepSpeed with 1-bit Adam enabled:** In order to run with 1-bit Adam feature enabled, the same script (`deepspeed_train.py`) can be used but there are two options for launching this properly:

### Launch with deepspeed

As discussed for BingBertSQuAD fine-tuning, we can simply use the `deepspeed` launcher to launch our BERT pre-training jobs as follows.

```shell
207
bash ds_train_bert_onebit_bsz4k_seq128.sh
208
209
210
211
212
213
214
```

### Launch with mpirun

Alternatively, use the following command to launch using `mpirun`.

```shell
215
mpirun -np [#processes] -ppn [#GPUs on each node] -hostfile [hostfile] [MPI flags] bash mpi_train_bert_onebit_bsz4k_seq128.sh
216
217
218
219
```

For example, in order to use 32 GPUs (4GPUs/node, 8 nodes in total), with the support of InfiniBand, you can use MVAPICH2 as the launcher and run the following command:
```shell
220
mpirun -np 32 -ppn 4 -hostfile hosts -env MV2_USE_CUDA=1 -env MV2_SUPPORT_DL=1 -env MV2_ENABLE_AFFINITY=0 -env MV2_SMP_USE_CMA=0 bash ds_train_bert_onebit_bsz4k_seq128.sh
401qingkong's avatar
401qingkong committed
221
```
222

223
### 3.2 Configuration for BERT Pre-training with DeepSpeed and 1-bit Adam enabled
224

401qingkong's avatar
401qingkong committed
225
The `deepspeed_bsz4k_onebit_config_seq128.json` file gives the user the ability to specify DeepSpeed
226
227
options in terms of batch size, micro batch size, optimizer, learning rate, and other parameters.

228
229
Below is the DeepSpeed configuration file for running BERT-large pre-training with sequence length of 128 using the 1-bit Adam optimizer.

230
231
232
```json
{
  "train_batch_size": 4096,
233
234
  "train_micro_batch_size_per_gpu": 16,
  "steps_per_print": 100,
235
  "prescale_gradients": false,
236
  "optimizer": {
237
    "type": "OneBitAdam",
238
    "params": {
239
      "lr": 4e-4,
240
241
242
      "weight_decay": 0.01,
      "bias_correction": false,
      "freeze_step": 23000,
401qingkong's avatar
401qingkong committed
243
      "cuda_aware": true
244
245
    }
  },
246
  "gradient_clipping": 1.0,
247
248
249
250
251
252
253
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  }
}
```
401qingkong's avatar
401qingkong committed
254
The above file is for BERT-large but for BERT-base training (sequence length 128), the suggested `freeze_step` will need to be changed to 16000. For the rest of the pre-training using sequence 512, we suggest to use a `freeze_step` of 1500. And make sure to set the `cuda_aware` correctly as described above.
255

256
### 3.3 Performance Results for BERT Pre-training
257

401qingkong's avatar
401qingkong committed
258
Performance results of BERT Pre-training can be seen from our detailed [blog post](https://www.deepspeed.ai/news/2020/09/09/onebit-adam-blog-post.html).