"vscode:/vscode.git/clone" did not exist on "33edd0503f7d11412c13ea07b797d5d46fff8157"
README.md 28.8 KB
Newer Older
thomwolf's avatar
thomwolf committed
1
# PyTorch implementation of Google AI's BERT model with Google's pre-trained models
VictorSanh's avatar
VictorSanh committed
2

thomwolf's avatar
thomwolf committed
3
This repository contains an op-for-op PyTorch reimplementation of [Google's TensorFlow repository for the BERT model](https://github.com/google-research/bert) that was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
VictorSanh's avatar
VictorSanh committed
4

thomwolf's avatar
thomwolf committed
5
This implementation can load any pre-trained TensorFlow checkpoint for BERT (in particular [Google's pre-trained models](https://github.com/google-research/bert)) and a conversion script is provided (see below).
6

Clement's avatar
typo  
Clement committed
7
The code to use, in addition, [the Multilingual and Chinese models](https://github.com/google-research/bert/blob/master/multilingual.md) will be added later this week (it's actually just the tokenization code that needs to be updated).
8

thomwolf's avatar
thomwolf committed
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Documentation

| Section | Content |
|-|-|
| [Installation](#installation) | How to install the package |
| [Content](#content) | Overview of the package |
| [Usage](#usage) | Quickstart examples |
| [Doc](#doc) |  Detailed documentation |
| [Examples](#examples) | Detailed examples on how to fine-tune Bert |
| [Notebooks](#notebooks) | Introduction on the provided Jupyter Notebooks |
| [TPU](#tup) | Notes on TPU support and pretraining scripts |
| [Command-line interface](#Command-line-interface) | Convert a TensorFlow checkpoint in a PyTorch dump |

# Installation
VictorSanh's avatar
VictorSanh committed
23

thomwolf's avatar
thomwolf committed
24
This repo was tested on Python 3.5+ and PyTorch 0.4.1
VictorSanh's avatar
VictorSanh committed
25

thomwolf's avatar
thomwolf committed
26
## From pip
thomwolf's avatar
thomwolf committed
27

thomwolf's avatar
thomwolf committed
28
29
30
31
PyTorch pretrained bert can be installed by pip as follows:
```bash
pip install pytorch_pretrained_bert
```
VictorSanh's avatar
VictorSanh committed
32

thomwolf's avatar
thomwolf committed
33
34
35
36
37
38
## From source

Clone the repository and run:
```bash
pip install [--editable] .
```
VictorSanh's avatar
VictorSanh committed
39

thomwolf's avatar
thomwolf committed
40
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
VictorSanh's avatar
VictorSanh committed
41

thomwolf's avatar
thomwolf committed
42
43
44
You can run the tests with the command:
```bash
python -m pytest -sv tests/
VictorSanh's avatar
VictorSanh committed
45
46
```

thomwolf's avatar
thomwolf committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# Content

This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:

- Six PyTorch models (`torch.nn.Module`) for Bert with pre-trained weights:
  - `BertModel` - raw BERT Transformer model (**fully pre-trained**),
  - `BertForMaskedLM` - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
  - `BertForNextSentencePrediction` - BERT Transformer with the pre-trained next sentence prediction classifier on top  (**fully pre-trained**),
  - `BertForPretraining` - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
  - `BertForSequenceClassification` - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
  - `BertForQuestionAnswering` - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).

- Three tokenizers:
  - `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
  - `WordpieceTokenizer` - WordPiece tokenization,
  - `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

- One optimizer:
  - `BERTAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.

- A configuration class:
  - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilisities to read and write from JSON configuration files.

The repository further comprises:

- Three examples on how to use Bert (in the [`examples` folder](./examples)):
  - [`extract_features.py`](./examples/extract_features.py) - Show how to extract hidden states from an instance of `BertModel`,
  - [`run_classifier.py`](./examples/run_classifier.py) - Show how to fine-tune an instance of `BertForSequenceClassification` on GLUE's MRPC task,
  - [`run_squad.py`](./examples/run_squad.py) - Show how to fine-tune an instance of `BertForQuestionAnswering` on SQuAD v1.0 task.

  These examples are detailed in the [Examples](#examples) section of this readme.

- Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the [`notebooks` folder](./notebooks)):
  - [`Comparing-TF-and-PT-models.ipynb`](./notebooks/Comparing-TF-and-PT-models.ipynb) - Compare the hidden states predicted by `BertModel`,
  - [`Comparing-TF-and-PT-models-SQuAD.ipynb`](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb) - Compare the spans predicted by  `BertForQuestionAnswering` instances,
  - [`Comparing-TF-and-PT-models-MLM-NSP.ipynb`](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb) - Compare the predictions of the `BertForPretraining` instances.

  These notebooks are detailed in the [Notebooks](#notebooks) section of this readme.

- A command-line interface to convert any TensorFlow checkpoint in a PyTorch dump:

  This CLI is detailed in the [Command-line interface](#Command-line-interface) section of this readme.

# Usage

Here is a quick-start example using the `BertForMaskedLM` class with Google AI's pre-trained `Bert base uncased` model:

```python
import torch
from pytorch_pretrained_bert import BertForMaskedLM, BertTokenizer

# Load pre-trained model and tokenizer (weights and vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Prepare tokenized input with a masked token
tokenized_text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)
masked_index = 6
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['who', 'was', 'jim', 'henson', '?', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Assign sentence A and sentence B indices to 1st (resp 2nd) sentences
segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

# Predict masked tokens with model
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model.eval()
predictions = model(tokens_tensor, segments_tensors)

# Use model to predict
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
assert predicted_token == 'henson'
```

# Doc

Here is a detailed documentation of the classes in the package.

## Loading pre-trained weigths

To load Google AI's pre-trained weight, the PyTorch model classes and the tokenizer can be instantiated as

```python
model = BERT_CLASS.from_pretrain(PRE_TRAINED_MODEL_NAME_OR_PATH)
```

where

- `BERT_CLASS` is either the `BertTokenizer` class (to load the vocabulary) or one of the six PyTorch model classes: `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPretraining`, `BertForSequenceClassification` or `BertForQuestionAnswering` (to load the pre-trained weights), and

- `PRE_TRAINED_MODEL_NAME` is either:
thomwolf's avatar
thomwolf committed
143

thomwolf's avatar
thomwolf committed
144
  - the shortcut name of a Google AI's pre-trained model selected in the list:
thomwolf's avatar
thomwolf committed
145

thomwolf's avatar
thomwolf committed
146
147
148
149
150
    - `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
    - `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
    - `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
    - `bert-base-multilingual`: 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
151

thomwolf's avatar
thomwolf committed
152
153
154
155
156
157
158
159
160
161
162
163
  - a path or url to a pretrained model archive containing:
      . `bert_config.json` a configuration file for the model
      . `pytorch_model.bin` a PyTorch dump of a pre-trained instance `BertForPreTraining` (saved with the usual `torch.save()`)

If `PRE_TRAINED_MODEL_NAME` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).

Example:
```python
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
```

## PyTorch models
thomwolf's avatar
thomwolf committed
164

thomwolf's avatar
thomwolf committed
165
166
### 1. `BertModel`

thomwolf's avatar
thomwolf committed
167
168
169
170
`BertModel` is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).

The inputs and output are **identical to the TensorFlow model inputs and outputs**.

thomwolf's avatar
thomwolf committed
171
We detail them here. This model takes as *inputs*:
thomwolf's avatar
thomwolf committed
172

thomwolf's avatar
thomwolf committed
173
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts `extract_features.py`, `run_classifier.py` and `run_squad.py`), and
Clement's avatar
typos  
Clement committed
174
- `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
thomwolf's avatar
thomwolf committed
175
- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
thomwolf's avatar
thomwolf committed
176
- `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
thomwolf's avatar
thomwolf committed
177

thomwolf's avatar
thomwolf committed
178
This model *outputs* a tuple composed of:
thomwolf's avatar
thomwolf committed
179

thomwolf's avatar
thomwolf committed
180
181
182
183
184
- `encoded_layers`: controled by the value of the `output_encoded_layers` argument:

  . `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
  . `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block,

thomwolf's avatar
thomwolf committed
185
- `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
thomwolf's avatar
thomwolf committed
186
187
188

An example on how to use this class is given in the `extract_features.py` script which can be used to extract the hidden states of the model for a given input.

thomwolf's avatar
thomwolf committed
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
### 2. `BertForPreTraining`

`BertForPreTraining` includes the `BertModel` Transformer followed by the two pre-training heads:

- the masked language modeling head, and
- the next sentence classification head.

*Inputs* comprises the inputs of the [`BertModel`](###-1.-`BertModel`) class plus two optional labels:

- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]
- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

*Outputs*:

- if `masked_lm_labels` and `next_sentence_label` are not `None`: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss.
- if `masked_lm_labels` or `next_sentence_label` is `None`: Outputs a tuple comprising
  - the masked language modeling logits, and
  - the next sentence classification logits.

### 3. `BertForMaskedLM`

`BertForMaskedLM` includes the `BertModel` Transformer followed by the (possibly) pre-trained  masked language modeling head.

*Inputs* comprises the inputs of the [`BertModel`](###-1.-`BertModel`) class plus optional label:

- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]

*Outputs*:

- if `masked_lm_labels` is not `None`: Outputs the masked language modeling loss.
- if `masked_lm_labels` is `None`: Outputs the masked language modeling logits.

### 4. `BertForNextSentencePrediction`

`BertForNextSentencePrediction` includes the `BertModel` Transformer followed by the next sentence classification head.

*Inputs* comprises the inputs of the [`BertModel`](###-1.-`BertModel`) class plus an optional label:

- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

*Outputs*:

- if `next_sentence_label` is not `None`: Outputs the next sentence classification loss.
- if `next_sentence_label` is `None`: Outputs the next sentence classification logits.

### 5. `BertForSequenceClassification`
thomwolf's avatar
thomwolf committed
235

Thomas Wolf's avatar
typos  
Thomas Wolf committed
236
`BertForSequenceClassification` is a fine-tuning model that includes `BertModel` and a sequence-level (sequence or pair of sequences) classifier on top of the `BertModel`.
thomwolf's avatar
thomwolf committed
237

Thomas Wolf's avatar
Thomas Wolf committed
238
The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).
thomwolf's avatar
thomwolf committed
239
240
241

An example on how to use this class is given in the `run_classifier.py` script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.

thomwolf's avatar
thomwolf committed
242
### 6. `BertForQuestionAnswering`
thomwolf's avatar
thomwolf committed
243

Knut Ole Sj酶li's avatar
Knut Ole Sj酶li committed
244
`BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.
thomwolf's avatar
thomwolf committed
245

Thomas Wolf's avatar
Thomas Wolf committed
246
The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a `start_span` and a `end_span` token (see Figures 3c and 3d in the BERT paper).
thomwolf's avatar
thomwolf committed
247

Thomas Wolf's avatar
typos  
Thomas Wolf committed
248
An example on how to use this class is given in the `run_squad.py` script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
thomwolf's avatar
thomwolf committed
249

thomwolf's avatar
thomwolf committed
250
## Tokenizers
thomwolf's avatar
thomwolf committed
251

thomwolf's avatar
thomwolf committed
252
### `BertTokenizer`
thomwolf's avatar
thomwolf committed
253

thomwolf's avatar
thomwolf committed
254
`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
thomwolf's avatar
thomwolf committed
255

thomwolf's avatar
thomwolf committed
256
This class has two arguments:
thomwolf's avatar
thomwolf committed
257

thomwolf's avatar
thomwolf committed
258
259
- `vocab_file`: path to a vocabulary file.
- `do_lower_case`: convert text to lower-case while tokenizing. **Default = True**.
thomwolf's avatar
thomwolf committed
260

thomwolf's avatar
thomwolf committed
261
and three methods:
Thomas Wolf's avatar
typos  
Thomas Wolf committed
262

thomwolf's avatar
thomwolf committed
263
264
265
- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
thomwolf's avatar
thomwolf committed
266

thomwolf's avatar
thomwolf committed
267
### `BasicTokenizer` and `WordpieceTokenizer`
thomwolf's avatar
thomwolf committed
268

thomwolf's avatar
thomwolf committed
269
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of these classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
thomwolf's avatar
thomwolf committed
270

thomwolf's avatar
thomwolf committed
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
## Optimizer

### `BERTAdam`

`BERTAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:

- BERTAdam implements weight decay fix,
- BERTAdam doesn't compensate for bias as in the regular Adam optimizer.

The optimizer accepts the following arguments:

- `lr` : learning rate
- `warmup` : portion of t_total for the warmup, -1  means no warmup. Default : -1
- `t_total` : total number of training steps for the learning
    rate schedule, -1  means constant learning rate. Default : -1
- `schedule` : schedule to use for the warmup (see above). Default : 'warmup_linear'
- `b1` : Adams b1. Default : 0.9
- `b2` : Adams b2. Default : 0.999
- `e` : Adams epsilon. Default : 1e-6
- `weight_decay_rate:` Weight decay. Default : 0.01
- `max_grad_norm` : Maximum norm for the gradients (-1 means no clipping). Default : 1.0
thomwolf's avatar
thomwolf committed
292

thomwolf's avatar
thomwolf committed
293
294
295
296
297
# Examples

Fine-tuning the models

## Training large models: introduction, tools and examples
thomwolf's avatar
thomwolf committed
298

Thomas Wolf's avatar
Thomas Wolf committed
299
BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
thomwolf's avatar
thomwolf committed
300

thomwolf's avatar
thomwolf committed
301
To help with fine-tuning these models, we have included five techniques that you can activate in the fine-tuning scripts `run_classifier.py` and `run_squad.py`: gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training . For more details on how to use these techniques you can read [the tips on training large batches in PyTorch](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) that I published earlier this month.
thomwolf's avatar
thomwolf committed
302

thomwolf's avatar
thomwolf committed
303
Here is how to use these techniques in our scripts:
thomwolf's avatar
thomwolf committed
304

thomwolf's avatar
thomwolf committed
305
306
- **Gradient Accumulation**: Gradient accumulation can be used by supplying a integer greater than 1 to the `--gradient_accumulation_steps` argument. The batch at each step will be divided by this integer and gradient will be accumulated over `gradient_accumulation_steps` steps.
- **Multi-GPU**: Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
thomwolf's avatar
thomwolf committed
307
- **Distributed training**: Distributed training can be activated by supplying an integer greater or equal to 0 to the `--local_rank` argument (see below).
thomwolf's avatar
typo  
thomwolf committed
308
- **Optimize on CPU**: The Adam optimizer stores 2 moving average of the weights of the model. If you keep them on GPU 1 (typical behavior), your first GPU will have to store 3-times the size of the model. This is not optimal for large models like `BERT-large` and means your batch size is a lot lower than it could be. This option will perform the optimization and store the averages on the CPU/RAM to free more room on the GPU(s). As the most computational intensive operation is usually the backward pass, this doesn't have a significant impact on the training time. Activate this option with `--optimize_on_cpu` on the `run_squad.py` script.
thomwolf's avatar
thomwolf committed
309
- **16-bits training**: 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found [here](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) and a full documentation is [here](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html). In our scripts, this option can be activated by setting the `--fp16` flag and you can play with loss scaling using the `--loss_scaling` flag (see the previously linked documentation for details on loss scaling). If the loss scaling is too high (`Nan` in the gradients) it will be automatically scaled down until the value is acceptable. The default loss scaling is 128 which behaved nicely in our tests.
thomwolf's avatar
thomwolf committed
310

thomwolf's avatar
thomwolf committed
311
Note: To use *Distributed Training*, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see [the above mentioned blog post]((https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)) for more details):
thomwolf's avatar
thomwolf committed
312
313
314
```bash
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
```
315
Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address `192.168.1.1` and an open port `1234`.
thomwolf's avatar
thomwolf committed
316

VictorSanh's avatar
VictorSanh committed
317
318
## Fine-tuning with BERT: running the examples

thomwolf's avatar
thomwolf committed
319
We showcase the same examples as [the original implementation](https://github.com/google-research/bert/): fine-tuning a sequence-level classifier on the MRPC classification corpus and a token-level classifier on the question answering dataset SQuAD.
VictorSanh's avatar
VictorSanh committed
320

321
Before running these examples you should download the
VictorSanh's avatar
VictorSanh committed
322
323
324
325
326
327
328
329
330
331
332
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`. Please also download the `BERT-Base`
checkpoint, unzip it to some directory `$BERT_BASE_DIR`, and convert it to its PyTorch version as explained in the previous section.

This example code fine-tunes `BERT-Base` on the Microsoft Research Paraphrase
Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80.

```shell
export GLUE_DIR=/path/to/glue

333
python run_classifier.py \
VictorSanh's avatar
VictorSanh committed
334
335
336
337
338
339
340
341
342
343
344
345
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --vocab_file $BERT_BASE_DIR/vocab.txt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
346
  --output_dir /tmp/mrpc_output/
VictorSanh's avatar
VictorSanh committed
347
348
```

Thomas Wolf's avatar
Thomas Wolf committed
349
Our test ran on a few seeds with [the original implementation hyper-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation results between 84% and 88%.
thomwolf's avatar
thomwolf committed
350

thomwolf's avatar
thomwolf committed
351
The second example fine-tunes `BERT-Base` on the SQuAD question answering task.
VictorSanh's avatar
VictorSanh committed
352

VictorSanh's avatar
VictorSanh committed
353
The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory.
354

VictorSanh's avatar
VictorSanh committed
355
356
357
358
*   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
*   [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
*   [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

VictorSanh's avatar
VictorSanh committed
359
```shell
VictorSanh's avatar
VictorSanh committed
360
export SQUAD_DIR=/path/to/SQUAD
VictorSanh's avatar
VictorSanh committed
361

362
python run_squad.py \
thomwolf's avatar
thomwolf committed
363
364
365
  --vocab_file $BERT_BASE_DIR/vocab.txt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin \
VictorSanh's avatar
VictorSanh committed
366
367
  --do_train \
  --do_predict \
Thomas Wolf's avatar
Thomas Wolf committed
368
  --do_lower_case \
Thomas Wolf's avatar
Thomas Wolf committed
369
  --train_file $SQUAD_DIR/train-v1.1.json \
thomwolf's avatar
thomwolf committed
370
371
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
Thomas Wolf's avatar
Thomas Wolf committed
372
  --learning_rate 3e-5 \
thomwolf's avatar
thomwolf committed
373
374
375
376
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir ../debug_squad/
thomwolf's avatar
thomwolf committed
377
```
378

Thomas Wolf's avatar
Thomas Wolf committed
379
Training with the previous hyper-parameters gave us the following results:
380
```bash
Thomas Wolf's avatar
Thomas Wolf committed
381
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
382
```
383
384
385
386
387

# Fine-tuning BERT-large on GPUs

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

Thomas Wolf's avatar
Thomas Wolf committed
388
For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):
389
390
391
```bash
{"exact_match": 84.56953642384106, "f1": 91.04028647786927}
```
Thomas Wolf's avatar
Thomas Wolf committed
392
To get these results we used a combination of:
393
394
395
396
- multi-GPU training (automatically activated on a multi-GPU server),
- 2 steps of gradient accumulation and
- perform the optimization step on CPU to store Adam's averages in RAM.

thomwolf's avatar
thomwolf committed
397
Here is the full list of hyper-parameters for this run:
398
```bash
Thomas Wolf's avatar
Thomas Wolf committed
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
python ./run_squad.py \
  --vocab_file $BERT_LARGE_DIR/vocab.txt \
  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
  --do_lower_case \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --gradient_accumulation_steps 2 \
  --optimize_on_cpu
416
```
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444

If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:
```bash
python ./run_squad.py \
  --vocab_file $BERT_LARGE_DIR/vocab.txt \
  --bert_config_file $BERT_LARGE_DIR/bert_config.json \
  --init_checkpoint $BERT_LARGE_DIR/pytorch_model.bin \
  --do_lower_case \
  --do_train \
  --do_predict \
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --fp16 \
  --loss_scale 128
```

The results were similar to the above FP32 results (actually slightly higher):
```bash
{"exact_match": 84.65468306527909, "f1": 91.238669287002}
```
thomwolf's avatar
thomwolf committed
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493

# Notebooks

Comparing the PyTorch model and the TensorFlow model predictions

We also include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.

- The first NoteBook ([Comparing TF and PT models.ipynb](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing%20TF%20and%20PT%20models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.

- The second NoteBook ([Comparing TF and PT models SQuAD predictions.ipynb](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing%20TF%20and%20PT%20models%20SQuAD%20predictions.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.

Please follow the instructions given in the notebooks to run and modify them. They can also be nice example on how to use the models in a simpler way than the full fine-tuning scripts we provide.

# Command-line interface

A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch checkpoint

You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](convert_tf_checkpoint_to_pytorch.py) script.

This script takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using `torch.load()` (see examples in `extract_features.py`, `run_classifier.py` and `run_squad.py`).

You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too.

To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch.

Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model:

```shell
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

python convert_tf_checkpoint_to_pytorch.py \
  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \
  --bert_config_file $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin
```

You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).

# TPU

TPU support and pretraining scripts

TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)).

We will add TPU support when this next release is published.

The original TensorFlow code further comprises two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py).

Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details [here](https://github.com/google-research/bert#pre-training-with-bert)) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.