README.md 17.9 KB
Newer Older
thomwolf's avatar
thomwolf committed
1
# PyTorch implementation of Google AI's BERT model with a script to load Google's pre-trained models
VictorSanh's avatar
VictorSanh committed
2

VictorSanh's avatar
VictorSanh committed
3
4
## Introduction

thomwolf's avatar
thomwolf committed
5
This repository contains an op-for-op PyTorch reimplementation of [Google's TensorFlow repository for the BERT model](https://github.com/google-research/bert) that was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
VictorSanh's avatar
VictorSanh committed
6

thomwolf's avatar
thomwolf committed
7
This implementation can load any pre-trained TensorFlow checkpoint for BERT (in particular [Google's pre-trained models](https://github.com/google-research/bert)) and a conversion script is provided (see below).
8

Clement's avatar
typo  
Clement committed
9
The code to use, in addition, [the Multilingual and Chinese models](https://github.com/google-research/bert/blob/master/multilingual.md) will be added later this week (it's actually just the tokenization code that needs to be updated).
10

thomwolf's avatar
thomwolf committed
11
## Installation, requirements, test
VictorSanh's avatar
VictorSanh committed
12

thomwolf's avatar
thomwolf committed
13
This code was tested on Python 3.5+. The requirements are:
VictorSanh's avatar
VictorSanh committed
14

thomwolf's avatar
thomwolf committed
15
16
- PyTorch (>= 0.4.1)
- tqdm
thomwolf's avatar
thomwolf committed
17

thomwolf's avatar
thomwolf committed
18
To install the dependencies:
VictorSanh's avatar
VictorSanh committed
19

thomwolf's avatar
thomwolf committed
20
21
22
````bash
pip install -r ./requirements.txt
````
VictorSanh's avatar
VictorSanh committed
23

thomwolf's avatar
thomwolf committed
24
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
VictorSanh's avatar
VictorSanh committed
25

thomwolf's avatar
thomwolf committed
26
27
28
You can run the tests with the command:
```bash
python -m pytest -sv tests/
VictorSanh's avatar
VictorSanh committed
29
30
```

thomwolf's avatar
thomwolf committed
31
## PyTorch models for BERT
thomwolf's avatar
thomwolf committed
32

thomwolf's avatar
thomwolf committed
33
34
35
36
37
We included three PyTorch models in this repository that you will find in [`modeling.py`](modeling.py):

- `BertModel` - the basic BERT Transformer model
- `BertForSequenceClassification` - the BERT model with a sequence classification head on top
- `BertForQuestionAnswering` - the BERT model with a token classification head on top
thomwolf's avatar
thomwolf committed
38

thomwolf's avatar
thomwolf committed
39
Here are some details on each class.
thomwolf's avatar
thomwolf committed
40

thomwolf's avatar
thomwolf committed
41
42
### 1. `BertModel`

thomwolf's avatar
thomwolf committed
43
44
45
46
`BertModel` is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).

The inputs and output are **identical to the TensorFlow model inputs and outputs**.

Thomas Wolf's avatar
Thomas Wolf committed
47
We detail them here. This model takes as inputs:
thomwolf's avatar
thomwolf committed
48

thomwolf's avatar
thomwolf committed
49
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts `extract_features.py`, `run_classifier.py` and `run_squad.py`), and
Clement's avatar
typos  
Clement committed
50
- `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
thomwolf's avatar
thomwolf committed
51
- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
thomwolf's avatar
thomwolf committed
52
- `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
thomwolf's avatar
thomwolf committed
53

thomwolf's avatar
thomwolf committed
54
55
This model outputs a tuple composed of:

thomwolf's avatar
thomwolf committed
56
57
58
59
60
- `encoded_layers`: controled by the value of the `output_encoded_layers` argument:

  . `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
  . `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block,

thomwolf's avatar
thomwolf committed
61
- `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
thomwolf's avatar
thomwolf committed
62
63
64
65
66

An example on how to use this class is given in the `extract_features.py` script which can be used to extract the hidden states of the model for a given input.

### 2. `BertForSequenceClassification`

Thomas Wolf's avatar
typos  
Thomas Wolf committed
67
`BertForSequenceClassification` is a fine-tuning model that includes `BertModel` and a sequence-level (sequence or pair of sequences) classifier on top of the `BertModel`.
thomwolf's avatar
thomwolf committed
68

Thomas Wolf's avatar
Thomas Wolf committed
69
The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).
thomwolf's avatar
thomwolf committed
70
71
72
73
74

An example on how to use this class is given in the `run_classifier.py` script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.

### 3. `BertForQuestionAnswering`

Knut Ole Sj酶li's avatar
Knut Ole Sj酶li committed
75
`BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.
thomwolf's avatar
thomwolf committed
76

Thomas Wolf's avatar
Thomas Wolf committed
77
The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a `start_span` and a `end_span` token (see Figures 3c and 3d in the BERT paper).
thomwolf's avatar
thomwolf committed
78

Thomas Wolf's avatar
typos  
Thomas Wolf committed
79
An example on how to use this class is given in the `run_squad.py` script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
thomwolf's avatar
thomwolf committed
80
81


thomwolf's avatar
thomwolf committed
82
## Converting a TensorFlow checkpoint in a PyTorch checkpoint
thomwolf's avatar
thomwolf committed
83

thomwolf's avatar
thomwolf committed
84
You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.
thomwolf's avatar
thomwolf committed
85

thomwolf's avatar
thomwolf committed
86
This script takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using `torch.load()` (see examples in `extract_features.py`, `run_classifier.py` and `run_squad.py`).
thomwolf's avatar
thomwolf committed
87

thomwolf's avatar
thomwolf committed
88
You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too.
thomwolf's avatar
thomwolf committed
89

thomwolf's avatar
thomwolf committed
90
To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch.
Thomas Wolf's avatar
typos  
Thomas Wolf committed
91

thomwolf's avatar
thomwolf committed
92
93
94
95
96
97
98
99
100
Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model:

```shell
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

python convert_tf_checkpoint_to_pytorch.py \
  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin
thomwolf's avatar
thomwolf committed
101
102
```

thomwolf's avatar
thomwolf committed
103
104
You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).

thomwolf's avatar
thomwolf committed
105
106
## Training on large batches: gradient accumulation, multi-GPU and distributed training

Thomas Wolf's avatar
Thomas Wolf committed
107
BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
thomwolf's avatar
thomwolf committed
108

thomwolf's avatar
thomwolf committed
109
To help with fine-tuning these models, we have included five techniques that you can activate in the fine-tuning scripts `run_classifier.py` and `run_squad.py`: gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training . For more details on how to use these techniques you can read [the tips on training large batches in PyTorch](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) that I published earlier this month.
thomwolf's avatar
thomwolf committed
110

thomwolf's avatar
thomwolf committed
111
Here is how to use these techniques in our scripts:
thomwolf's avatar
thomwolf committed
112

thomwolf's avatar
thomwolf committed
113
114
- **Gradient Accumulation**: Gradient accumulation can be used by supplying a integer greater than 1 to the `--gradient_accumulation_steps` argument. The batch at each step will be divided by this integer and gradient will be accumulated over `gradient_accumulation_steps` steps.
- **Multi-GPU**: Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
thomwolf's avatar
thomwolf committed
115
- **Distributed training**: Distributed training can be activated by supplying an integer greater or equal to 0 to the `--local_rank` argument (see below).
thomwolf's avatar
typo  
thomwolf committed
116
- **Optimize on CPU**: The Adam optimizer stores 2 moving average of the weights of the model. If you keep them on GPU 1 (typical behavior), your first GPU will have to store 3-times the size of the model. This is not optimal for large models like `BERT-large` and means your batch size is a lot lower than it could be. This option will perform the optimization and store the averages on the CPU/RAM to free more room on the GPU(s). As the most computational intensive operation is usually the backward pass, this doesn't have a significant impact on the training time. Activate this option with `--optimize_on_cpu` on the `run_squad.py` script.
thomwolf's avatar
thomwolf committed
117
- **16-bits training**: 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found [here](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) and a full documentation is [here](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html). In our scripts, this option can be activated by setting the `--fp16` flag and you can play with loss scaling using the `--loss_scaling` flag (see the previously linked documentation for details on loss scaling). If the loss scaling is too high (`Nan` in the gradients) it will be automatically scaled down until the value is acceptable. The default loss scaling is 128 which behaved nicely in our tests.
thomwolf's avatar
thomwolf committed
118

thomwolf's avatar
thomwolf committed
119
Note: To use *Distributed Training*, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see [the above mentioned blog post]((https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)) for more details):
thomwolf's avatar
thomwolf committed
120
121
122
```bash
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
```
123
Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address `192.168.1.1` and an open port `1234`.
thomwolf's avatar
thomwolf committed
124
125
126
127
128
129
130

## TPU support and pretraining scripts

TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)).

We will add TPU support when this next release is published.

thomwolf's avatar
thomwolf committed
131
The original TensorFlow code further comprises two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py).
thomwolf's avatar
thomwolf committed
132
133

Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details [here](https://github.com/google-research/bert#pre-training-with-bert)) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.
thomwolf's avatar
thomwolf committed
134

thomwolf's avatar
thomwolf committed
135
136
137
138
## Comparing the PyTorch model and the TensorFlow model predictions

We also include [two Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.

139
- The first NoteBook ([Comparing TF and PT models.ipynb](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing%20TF%20and%20PT%20models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
thomwolf's avatar
thomwolf committed
140

141
- The second NoteBook ([Comparing TF and PT models SQuAD predictions.ipynb](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing%20TF%20and%20PT%20models%20SQuAD%20predictions.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
thomwolf's avatar
thomwolf committed
142

Thomas Wolf's avatar
Thomas Wolf committed
143
Please follow the instructions given in the notebooks to run and modify them. They can also be nice example on how to use the models in a simpler way than the full fine-tuning scripts we provide.
thomwolf's avatar
thomwolf committed
144

VictorSanh's avatar
VictorSanh committed
145
146
## Fine-tuning with BERT: running the examples

thomwolf's avatar
thomwolf committed
147
We showcase the same examples as [the original implementation](https://github.com/google-research/bert/): fine-tuning a sequence-level classifier on the MRPC classification corpus and a token-level classifier on the question answering dataset SQuAD.
VictorSanh's avatar
VictorSanh committed
148

149
Before running these examples you should download the
VictorSanh's avatar
VictorSanh committed
150
151
152
153
154
155
156
157
158
159
160
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`. Please also download the `BERT-Base`
checkpoint, unzip it to some directory `$BERT_BASE_DIR`, and convert it to its PyTorch version as explained in the previous section.

This example code fine-tunes `BERT-Base` on the Microsoft Research Paraphrase
Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80.

```shell
export GLUE_DIR=/path/to/glue

161
python run_classifier.py \
VictorSanh's avatar
VictorSanh committed
162
163
164
165
166
167
168
169
170
171
172
173
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --vocab_file $BERT_BASE_DIR/vocab.txt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
174
  --output_dir /tmp/mrpc_output/
VictorSanh's avatar
VictorSanh committed
175
176
```

Thomas Wolf's avatar
Thomas Wolf committed
177
Our test ran on a few seeds with [the original implementation hyper-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation results between 84% and 88%.
thomwolf's avatar
thomwolf committed
178

thomwolf's avatar
thomwolf committed
179
The second example fine-tunes `BERT-Base` on the SQuAD question answering task.
VictorSanh's avatar
VictorSanh committed
180

VictorSanh's avatar
VictorSanh committed
181
The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory.
182

VictorSanh's avatar
VictorSanh committed
183
184
185
186
*   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
*   [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
*   [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

VictorSanh's avatar
VictorSanh committed
187
```shell
VictorSanh's avatar
VictorSanh committed
188
export SQUAD_DIR=/path/to/SQUAD
VictorSanh's avatar
VictorSanh committed
189

190
python run_squad.py \
thomwolf's avatar
thomwolf committed
191
192
193
  --vocab_file $BERT_BASE_DIR/vocab.txt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin \
VictorSanh's avatar
VictorSanh committed
194
195
  --do_train \
  --do_predict \
Thomas Wolf's avatar
Thomas Wolf committed
196
  --do_lower_case \
Thomas Wolf's avatar
Thomas Wolf committed
197
  --train_file $SQUAD_DIR/train-v1.1.json \
thomwolf's avatar
thomwolf committed
198
199
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
Thomas Wolf's avatar
Thomas Wolf committed
200
  --learning_rate 3e-5 \
thomwolf's avatar
thomwolf committed
201
202
203
204
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ../debug_squad/
thomwolf's avatar
thomwolf committed
205
```
206

Thomas Wolf's avatar
Thomas Wolf committed
207
Training with the previous hyper-parameters gave us the following results:
208
```bash
Thomas Wolf's avatar
Thomas Wolf committed
209
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
210
```
211
212
213
214
215

# Fine-tuning BERT-large on GPUs

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

Thomas Wolf's avatar
Thomas Wolf committed
216
For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):
217
218
219
```bash
{"exact_match": 84.56953642384106, "f1": 91.04028647786927}
```
Thomas Wolf's avatar
Thomas Wolf committed
220
To get these results we used a combination of:
221
222
223
224
- multi-GPU training (automatically activated on a multi-GPU server),
- 2 steps of gradient accumulation and
- perform the optimization step on CPU to store Adam's averages in RAM.

thomwolf's avatar
thomwolf committed
225
Here is the full list of hyper-parameters for this run:
226
```bash
Thomas Wolf's avatar
Thomas Wolf committed
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
python ./run_squad.py \
  --vocab_file $BERT_LARGE_DIR/vocab.txt \
  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
  --do_lower_case \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --gradient_accumulation_steps 2 \
  --optimize_on_cpu
244
```
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272

If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:
```bash
python ./run_squad.py \
  --vocab_file $BERT_LARGE_DIR/vocab.txt \
  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
  --do_lower_case \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --fp16 \
  --loss_scale 128
```

The results were similar to the above FP32 results (actually slightly higher):
```bash
{"exact_match": 84.65468306527909, "f1": 91.238669287002}
```