README.md 75.2 KB
Newer Older
Thomas Wolf's avatar
Thomas Wolf committed
1
# PyTorch Pretrained BERT: The Big & Extending Repository of pretrained Transformers
VictorSanh's avatar
VictorSanh committed
2

Julien Chaumond's avatar
Julien Chaumond committed
3
4
[![CircleCI](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)

thomwolf's avatar
thomwolf committed
5
This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
VictorSanh's avatar
VictorSanh committed
6

thomwolf's avatar
thomwolf committed
7
- [Google's BERT model](https://github.com/google-research/bert),
thomwolf's avatar
thomwolf committed
8
9
- [OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm),
- [Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl), and
Thomas Wolf's avatar
Thomas Wolf committed
10
- [OpenAI's GPT-2 model](https://blog.openai.com/better-language-models/).
thomwolf's avatar
thomwolf committed
11
12
13
14
15
16

These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the [Examples](#examples) section below.

Here are some information on these models:

**BERT** was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
thomwolf's avatar
thomwolf committed
17
18
This PyTorch implementation of BERT is provided with [Google's pre-trained models](https://github.com/google-research/bert), examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.

thomwolf's avatar
thomwolf committed
19
**OpenAI GPT** was released together with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
thomwolf's avatar
thomwolf committed
20
This PyTorch implementation of OpenAI GPT is an adaptation of the [PyTorch implementation by HuggingFace](https://github.com/huggingface/pytorch-openai-transformer-lm) and is provided with [OpenAI's pre-trained model](https://github.com/openai/finetune-transformer-lm) and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
thomwolf's avatar
thomwolf committed
21
22

**Google/CMU's Transformer-XL** was released together with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
Davide Fiocco's avatar
Davide Fiocco committed
23
This PyTorch implementation of Transformer-XL is an adaptation of the original [PyTorch implementation](https://github.com/kimiyoung/transformer-xl) which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
24

Davide Fiocco's avatar
Davide Fiocco committed
25
**OpenAI GPT-2** was released together with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
Thomas Wolf's avatar
Thomas Wolf committed
26
This PyTorch implementation of OpenAI GPT-2 is an adaptation of the [OpenAI's implementation](https://github.com/openai/gpt-2) and is provided with [OpenAI's pre-trained model](https://github.com/openai/gpt-2) and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
thomwolf's avatar
thomwolf committed
27
28


thomwolf's avatar
thomwolf committed
29
## Content
30

thomwolf's avatar
thomwolf committed
31
| Section | Description |
thomwolf's avatar
thomwolf committed
32
|-|-|
thomwolf's avatar
thomwolf committed
33
34
35
36
37
38
| [Installation](#installation) | How to install the package |
| [Overview](#overview) | Overview of the package |
| [Usage](#usage) | Quickstart examples |
| [Doc](#doc) |  Detailed documentation |
| [Examples](#examples) | Detailed examples on how to fine-tune Bert |
| [Notebooks](#notebooks) | Introduction on the provided Jupyter Notebooks |
thomwolf's avatar
thomwolf committed
39
| [TPU](#tpu) | Notes on TPU support and pretraining scripts |
thomwolf's avatar
thomwolf committed
40
| [Command-line interface](#Command-line-interface) | Convert a TensorFlow checkpoint in a PyTorch dump |
thomwolf's avatar
thomwolf committed
41

thomwolf's avatar
thomwolf committed
42
## Installation
VictorSanh's avatar
VictorSanh committed
43

thomwolf's avatar
thomwolf committed
44
This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
VictorSanh's avatar
VictorSanh committed
45

thomwolf's avatar
thomwolf committed
46
### With pip
thomwolf's avatar
thomwolf committed
47

thomwolf's avatar
thomwolf committed
48
49
PyTorch pretrained bert can be installed by pip as follows:
```bash
Joel Grus's avatar
Joel Grus committed
50
pip install pytorch-pretrained-bert
thomwolf's avatar
thomwolf committed
51
```
VictorSanh's avatar
VictorSanh committed
52

thomwolf's avatar
thomwolf committed
53
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
thomwolf's avatar
thomwolf committed
54
55
56
57
58
```bash
pip install spacy ftfy==4.4.3
python -m spacy download en
```

thomwolf's avatar
thomwolf committed
59
60
If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).

thomwolf's avatar
thomwolf committed
61
### From source
thomwolf's avatar
thomwolf committed
62
63
64
65
66

Clone the repository and run:
```bash
pip install [--editable] .
```
VictorSanh's avatar
VictorSanh committed
67

thomwolf's avatar
thomwolf committed
68
Here also, if you want to reproduce the original tokenization process of the `OpenAI GPT` model, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
thomwolf's avatar
thomwolf committed
69
70
71
72
73
```bash
pip install spacy ftfy==4.4.3
python -m spacy download en
```

thomwolf's avatar
thomwolf committed
74
Again, if you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage).
thomwolf's avatar
thomwolf committed
75

thomwolf's avatar
thomwolf committed
76
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
VictorSanh's avatar
VictorSanh committed
77

thomwolf's avatar
thomwolf committed
78
79
80
You can run the tests with the command:
```bash
python -m pytest -sv tests/
VictorSanh's avatar
VictorSanh committed
81
82
```

thomwolf's avatar
thomwolf committed
83
## Overview
thomwolf's avatar
thomwolf committed
84

thomwolf's avatar
thomwolf committed
85
This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:
thomwolf's avatar
thomwolf committed
86

thomwolf's avatar
thomwolf committed
87
- Eight **Bert** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
88
89
90
91
92
93
94
95
  - [`BertModel`](./pytorch_pretrained_bert/modeling.py#L639) - raw BERT Transformer model (**fully pre-trained**),
  - [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L793) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
  - [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L854) - BERT Transformer with the pre-trained next sentence prediction classifier on top  (**fully pre-trained**),
  - [`BertForPreTraining`](./pytorch_pretrained_bert/modeling.py#L722) - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
  - [`BertForSequenceClassification`](./pytorch_pretrained_bert/modeling.py#L916) - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
  - [`BertForMultipleChoice`](./pytorch_pretrained_bert/modeling.py#L982) - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
  - [`BertForTokenClassification`](./pytorch_pretrained_bert/modeling.py#L1051) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
  - [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L1124) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
Thomas Wolf's avatar
Thomas Wolf committed
96

thomwolf's avatar
thomwolf committed
97
- Three **OpenAI GPT** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py) file):
98
99
100
  - [`OpenAIGPTModel`](./pytorch_pretrained_bert/modeling_openai.py#L536) - raw OpenAI GPT Transformer model (**fully pre-trained**),
  - [`OpenAIGPTLMHeadModel`](./pytorch_pretrained_bert/modeling_openai.py#L643) - OpenAI GPT Transformer with the tied language modeling head on top (**fully pre-trained**),
  - [`OpenAIGPTDoubleHeadsModel`](./pytorch_pretrained_bert/modeling_openai.py#L722) - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
thomwolf's avatar
thomwolf committed
101

thomwolf's avatar
thomwolf committed
102
- Two **Transformer-XL** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) file):
103
104
  - [`TransfoXLModel`](./pytorch_pretrained_bert/modeling_transfo_xl.py#L983) - Transformer-XL model which outputs the last hidden state and memory cells (**fully pre-trained**),
  - [`TransfoXLLMHeadModel`](./pytorch_pretrained_bert/modeling_transfo_xl.py#L1260) - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (**fully pre-trained**),
thomwolf's avatar
thomwolf committed
105

thomwolf's avatar
thomwolf committed
106
- Three **OpenAI GPT-2** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_gpt2.py`](./pytorch_pretrained_bert/modeling_gpt2.py) file):
107
108
109
  - [`GPT2Model`](./pytorch_pretrained_bert/modeling_gpt2.py#L479) - raw OpenAI GPT-2 Transformer model (**fully pre-trained**),
  - [`GPT2LMHeadModel`](./pytorch_pretrained_bert/modeling_gpt2.py#L559) - OpenAI GPT-2 Transformer with the tied language modeling head on top (**fully pre-trained**),
  - [`GPT2DoubleHeadsModel`](./pytorch_pretrained_bert/modeling_gpt2.py#L624) - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
thomwolf's avatar
thomwolf committed
110

thomwolf's avatar
thomwolf committed
111
- Tokenizers for **BERT** (using word-piece) (in the [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) file):
thomwolf's avatar
thomwolf committed
112
113
114
115
  - `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
  - `WordpieceTokenizer` - WordPiece tokenization,
  - `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

thomwolf's avatar
thomwolf committed
116
- Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) file):
thomwolf's avatar
thomwolf committed
117
118
119
120
  - `OpenAIGPTTokenizer` - perform Byte-Pair-Encoding (BPE) tokenization.

- Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) file):
  - `OpenAIGPTTokenizer` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
thomwolf's avatar
thomwolf committed
121

thomwolf's avatar
thomwolf committed
122
123
124
- Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the [`tokenization_gpt2.py`](./pytorch_pretrained_bert/tokenization_gpt2.py) file):
  - `GPT2Tokenizer` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.

thomwolf's avatar
thomwolf committed
125
- Optimizer for **BERT** (in the [`optimization.py`](./pytorch_pretrained_bert/optimization.py) file):
thomwolf's avatar
thomwolf committed
126
  - `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
thomwolf's avatar
thomwolf committed
127

thomwolf's avatar
thomwolf committed
128
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_pretrained_bert/optimization_openai.py) file):
thomwolf's avatar
thomwolf committed
129
130
  - `OpenAIGPTAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.

thomwolf's avatar
thomwolf committed
131
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files):
Julien Chaumond's avatar
Julien Chaumond committed
132
  - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
thomwolf's avatar
thomwolf committed
133
  - `OpenAIGPTConfig` - Configuration class to store the configuration of a `OpenAIGPTModel` with utilities to read and write from JSON configuration files.
thomwolf's avatar
thomwolf committed
134
  - `TransfoXLConfig` - Configuration class to store the configuration of a `TransfoXLModel` with utilities to read and write from JSON configuration files.
thomwolf's avatar
thomwolf committed
135

thomwolf's avatar
thomwolf committed
136
137
The repository further comprises:

thomwolf's avatar
thomwolf committed
138
- Five examples on how to use **BERT** (in the [`examples` folder](./examples)):
thomwolf's avatar
thomwolf committed
139
140
  - [`extract_features.py`](./examples/extract_features.py) - Show how to extract hidden states from an instance of `BertModel`,
  - [`run_classifier.py`](./examples/run_classifier.py) - Show how to fine-tune an instance of `BertForSequenceClassification` on GLUE's MRPC task,
thomwolf's avatar
thomwolf committed
141
  - [`run_squad.py`](./examples/run_squad.py) - Show how to fine-tune an instance of `BertForQuestionAnswering` on SQuAD v1.0 and SQuAD v2.0 tasks.
142
  - [`run_swag.py`](./examples/run_swag.py) - Show how to fine-tune an instance of `BertForMultipleChoice` on Swag task.
Sepehr Sameni's avatar
Sepehr Sameni committed
143
  - [`simple_lm_finetuning.py`](./examples/lm_finetuning/simple_lm_finetuning.py) - Show how to fine-tune an instance of `BertForPretraining` on a target text corpus.
thomwolf's avatar
thomwolf committed
144
145

- One example on how to use **OpenAI GPT** (in the [`examples` folder](./examples)):
Thomas Wolf's avatar
Thomas Wolf committed
146
  - [`run_openai_gpt.py`](./examples/run_openai_gpt.py) - Show how to fine-tune an instance of `OpenGPTDoubleHeadsModel` on the RocStories task.
thomwolf's avatar
thomwolf committed
147

Thomas Wolf's avatar
Thomas Wolf committed
148
149
- One example on how to use **Transformer-XL** (in the [`examples` folder](./examples)):
  - [`run_transfo_xl.py`](./examples/run_transfo_xl.py) - Show how to load and evaluate a pre-trained model of `TransfoXLLMHeadModel` on WikiText 103.
thomwolf's avatar
thomwolf committed
150

thomwolf's avatar
thomwolf committed
151
152
153
- One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the [`examples` folder](./examples)):
  - [`run_gpt2.py`](./examples/run_gpt2.py) - Show how to use OpenAI GPT-2 an instance of `GPT2LMHeadModel` to generate text (same as the original OpenAI GPT-2 examples).

thomwolf's avatar
thomwolf committed
154
  These examples are detailed in the [Examples](#examples) section of this readme.
thomwolf's avatar
thomwolf committed
155
156
157
158
159
160

- Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the [`notebooks` folder](./notebooks)):
  - [`Comparing-TF-and-PT-models.ipynb`](./notebooks/Comparing-TF-and-PT-models.ipynb) - Compare the hidden states predicted by `BertModel`,
  - [`Comparing-TF-and-PT-models-SQuAD.ipynb`](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb) - Compare the spans predicted by  `BertForQuestionAnswering` instances,
  - [`Comparing-TF-and-PT-models-MLM-NSP.ipynb`](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb) - Compare the predictions of the `BertForPretraining` instances.

thomwolf's avatar
thomwolf committed
161
  These notebooks are detailed in the [Notebooks](#notebooks) section of this readme.
thomwolf's avatar
thomwolf committed
162

thomwolf's avatar
thomwolf committed
163
- A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
thomwolf's avatar
thomwolf committed
164

thomwolf's avatar
thomwolf committed
165
  This CLI is detailed in the [Command-line interface](#Command-line-interface) section of this readme.
thomwolf's avatar
thomwolf committed
166
167

## Usage
thomwolf's avatar
thomwolf committed
168

thomwolf's avatar
thomwolf committed
169
170
### BERT

thomwolf's avatar
thomwolf committed
171
Here is a quick-start example using `BertTokenizer`, `BertModel` and `BertForMaskedLM` class with Google AI's pre-trained `Bert base uncased` model. See the [doc section](#doc) below for all the details on these classes.
thomwolf's avatar
thomwolf committed
172

thomwolf's avatar
thomwolf committed
173
First let's prepare a tokenized input with `BertTokenizer`
thomwolf's avatar
thomwolf committed
174
175
176

```python
import torch
thomwolf's avatar
thomwolf committed
177
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
thomwolf's avatar
thomwolf committed
178

thomwolf's avatar
thomwolf committed
179
180
181
182
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
183
# Load pre-trained model tokenizer (vocabulary)
thomwolf's avatar
thomwolf committed
184
185
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

thomwolf's avatar
thomwolf committed
186
# Tokenized input
thomwolf's avatar
thomwolf committed
187
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
thomwolf's avatar
thomwolf committed
188
tokenized_text = tokenizer.tokenize(text)
thomwolf's avatar
thomwolf committed
189
190

# Mask a token that we will try to predict back with `BertForMaskedLM`
Liang Niu's avatar
Liang Niu committed
191
masked_index = 8
thomwolf's avatar
thomwolf committed
192
tokenized_text[masked_index] = '[MASK]'
thomwolf's avatar
thomwolf committed
193
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
thomwolf's avatar
thomwolf committed
194
195
196

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
thomwolf's avatar
thomwolf committed
197
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
thomwolf's avatar
thomwolf committed
198
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
thomwolf's avatar
thomwolf committed
199

thomwolf's avatar
thomwolf committed
200
# Convert inputs to PyTorch tensors
thomwolf's avatar
thomwolf committed
201
202
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
thomwolf's avatar
thomwolf committed
203
204
205
206
207
208
209
```

Let's see how to use `BertModel` to get hidden states

```python
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
thomwolf's avatar
thomwolf committed
210
model.eval()
thomwolf's avatar
thomwolf committed
211

thomwolf's avatar
thomwolf committed
212
213
214
215
216
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
217
# Predict hidden states features for each layer
thomwolf's avatar
thomwolf committed
218
219
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)
thomwolf's avatar
thomwolf committed
220
221
222
223
224
225
226
227
228
229
230
# We have a hidden states for each of the 12 layers in model bert-base-uncased
assert len(encoded_layers) == 12
```

And how to use `BertForMaskedLM`

```python
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

thomwolf's avatar
thomwolf committed
231
232
233
234
235
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
236
# Predict all tokens
thomwolf's avatar
thomwolf committed
237
238
with torch.no_grad():
    predictions = model(tokens_tensor, segments_tensors)
thomwolf's avatar
thomwolf committed
239

thomwolf's avatar
thomwolf committed
240
# confirm we were able to predict 'henson'
thomwolf's avatar
thomwolf committed
241
predicted_index = torch.argmax(predictions[0, masked_index]).item()
242
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
243
244
245
assert predicted_token == 'henson'
```

thomwolf's avatar
thomwolf committed
246
247
248
249
250
251
252
253
254
255
### OpenAI GPT

Here is a quick-start example using `OpenAIGPTTokenizer`, `OpenAIGPTModel` and `OpenAIGPTLMHeadModel` class with OpenAI's pre-trained  model. See the [doc section](#doc) below for all the details on these classes.

First let's prepare a tokenized input with `OpenAIGPTTokenizer`

```python
import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel

thomwolf's avatar
thomwolf committed
256
257
258
259
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
```

Let's see how to use `OpenAIGPTModel` to get hidden states
thomwolf's avatar
thomwolf committed
275
276
277
278
279
280

```python
# Load pre-trained model (weights)
model = OpenAIGPTModel.from_pretrained('openai-gpt')
model.eval()

thomwolf's avatar
thomwolf committed
281
282
283
284
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
285
# Predict hidden states features for each layer
thomwolf's avatar
thomwolf committed
286
287
with torch.no_grad():
    hidden_states = model(tokens_tensor)
thomwolf's avatar
thomwolf committed
288
289
290
291
292
293
294
295
296
```

And how to use `OpenAIGPTLMHeadModel`

```python
# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()

thomwolf's avatar
thomwolf committed
297
298
299
300
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
301
# Predict all tokens
thomwolf's avatar
thomwolf committed
302
303
with torch.no_grad():
    predictions = model(tokens_tensor)
thomwolf's avatar
thomwolf committed
304
305

# get the predicted last token
thomwolf's avatar
thomwolf committed
306
predicted_index = torch.argmax(predictions[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
307
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
308
assert predicted_token == '.</w>'
thomwolf's avatar
thomwolf committed
309
310
311
312
```

### Transformer-XL

thomwolf's avatar
thomwolf committed
313
Here is a quick-start example using `TransfoXLTokenizer`, `TransfoXLModel` and `TransfoXLModelLMHeadModel` class with the Transformer-XL model pre-trained on WikiText-103. See the [doc section](#doc) below for all the details on these classes.
thomwolf's avatar
thomwolf committed
314

thomwolf's avatar
thomwolf committed
315
First let's prepare a tokenized input with `TransfoXLTokenizer`
thomwolf's avatar
thomwolf committed
316
317
318

```python
import torch
thomwolf's avatar
thomwolf committed
319
from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
thomwolf's avatar
thomwolf committed
320

thomwolf's avatar
thomwolf committed
321
322
323
324
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
325
326
# Load pre-trained model tokenizer (vocabulary from wikitext 103)
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
327
328

# Tokenized input
thomwolf's avatar
thomwolf committed
329
330
331
332
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
tokenized_text_1 = tokenizer.tokenize(text_1)
tokenized_text_2 = tokenizer.tokenize(text_2)
thomwolf's avatar
thomwolf committed
333
334

# Convert token to vocabulary indices
thomwolf's avatar
thomwolf committed
335
336
indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
thomwolf's avatar
thomwolf committed
337
338

# Convert inputs to PyTorch tensors
thomwolf's avatar
thomwolf committed
339
340
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])
thomwolf's avatar
thomwolf committed
341
342
```

thomwolf's avatar
thomwolf committed
343
Let's see how to use `TransfoXLModel` to get hidden states
thomwolf's avatar
thomwolf committed
344
345
346

```python
# Load pre-trained model (weights)
thomwolf's avatar
thomwolf committed
347
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
348
349
model.eval()

thomwolf's avatar
thomwolf committed
350
351
352
353
354
355
356
357
358
359
# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

with torch.no_grad():
    # Predict hidden states features for each layer
    hidden_states_1, mems_1 = model(tokens_tensor_1)
    # We can re-use the memory cells in a subsequent call to attend a longer context
    hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
thomwolf's avatar
thomwolf committed
360
361
```

thomwolf's avatar
thomwolf committed
362
And how to use `TransfoXLLMHeadModel`
thomwolf's avatar
thomwolf committed
363
364
365

```python
# Load pre-trained model (weights)
thomwolf's avatar
thomwolf committed
366
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
367
368
model.eval()

thomwolf's avatar
thomwolf committed
369
370
371
372
373
374
375
376
377
378
# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

with torch.no_grad():
    # Predict all tokens
    predictions_1, mems_1 = model(tokens_tensor_1)
    # We can re-use the memory cells in a subsequent call to attend a longer context
    predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
thomwolf's avatar
thomwolf committed
379
380

# get the predicted last token
thomwolf's avatar
thomwolf committed
381
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
382
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
383
assert predicted_token == 'who'
thomwolf's avatar
thomwolf committed
384
385
```

thomwolf's avatar
thomwolf committed
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
### OpenAI GPT-2

Here is a quick-start example using `GPT2Tokenizer`, `GPT2Model` and `GPT2LMHeadModel` class with OpenAI's pre-trained  model. See the [doc section](#doc) below for all the details on these classes.

First let's prepare a tokenized input with `GPT2Tokenizer`

```python
import torch
from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

403
404
405
406
407
# Encode some inputs
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
indexed_tokens_1 = tokenizer.encode(text_1)
indexed_tokens_2 = tokenizer.encode(text_2)
thomwolf's avatar
thomwolf committed
408
409

# Convert inputs to PyTorch tensors
410
411
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])
thomwolf's avatar
thomwolf committed
412
413
414
415
416
417
418
419
420
421
```

Let's see how to use `GPT2Model` to get hidden states

```python
# Load pre-trained model (weights)
model = GPT2Model.from_pretrained('gpt2')
model.eval()

# If you have a GPU, put everything on cuda
422
423
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
thomwolf's avatar
thomwolf committed
424
425
426
427
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
428
429
    hidden_states_1, past = model(tokens_tensor_1)
    # past can be used to reuse precomputed hidden state in a subsequent predictions
430
    # (see beam-search examples in the run_gpt2.py example).
431
    hidden_states_2, past = model(tokens_tensor_2, past=past)
thomwolf's avatar
thomwolf committed
432
433
434
435
436
437
438
439
440
441
```

And how to use `GPT2LMHeadModel`

```python
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# If you have a GPU, put everything on cuda
442
443
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
thomwolf's avatar
thomwolf committed
444
445
446
447
model.to('cuda')

# Predict all tokens
with torch.no_grad():
448
449
    predictions_1, past = model(tokens_tensor_1)
    # past can be used to reuse precomputed hidden state in a subsequent predictions
450
    # (see beam-search examples in the run_gpt2.py example).
451
    predictions_2, past = model(tokens_tensor_2, past=past)
thomwolf's avatar
thomwolf committed
452
453

# get the predicted last token
454
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
455
456
457
predicted_token = tokenizer.decode([predicted_index])
```

thomwolf's avatar
thomwolf committed
458
## Doc
thomwolf's avatar
thomwolf committed
459

thomwolf's avatar
thomwolf committed
460
461
462
463
Here is a detailed documentation of the classes in the package and how to use them:

| Sub-section | Description |
|-|-|
Desiree Vogt-Lee's avatar
Desiree Vogt-Lee committed
464
| [Loading Google AI's/OpenAI's pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump) | How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
Thomas Wolf's avatar
Thomas Wolf committed
465
466
467
| [PyTorch models](#PyTorch-models) | API of the BERT, GPT, GPT-2 and Transformer-XL PyTorch model classes |
| [Tokenizers](#tokenizers) | API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL|
| [Optimizers](#optimizerss) |  API of the optimizers |
thomwolf's avatar
thomwolf committed
468

Desiree Vogt-Lee's avatar
Desiree Vogt-Lee committed
469
### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
thomwolf's avatar
thomwolf committed
470

thomwolf's avatar
thomwolf committed
471
To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated as
thomwolf's avatar
thomwolf committed
472
473

```python
Thomas Wolf's avatar
Thomas Wolf committed
474
model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None)
thomwolf's avatar
thomwolf committed
475
476
477
478
```

where

thomwolf's avatar
thomwolf committed
479
- `BERT_CLASS` is either a tokenizer to load the vocabulary (`BertTokenizer` or `OpenAIGPTTokenizer` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification`, `BertForMultipleChoice`, `BertForQuestionAnswering`, `OpenAIGPTModel`, `OpenAIGPTLMHeadModel` or `OpenAIGPTDoubleHeadsModel`, and
Thomas Wolf's avatar
Thomas Wolf committed
480
- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
thomwolf's avatar
thomwolf committed
481

thomwolf's avatar
thomwolf committed
482
  - the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
thomwolf's avatar
thomwolf committed
483

thomwolf's avatar
thomwolf committed
484
485
486
    - `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
    - `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
    - `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
thomwolf's avatar
thomwolf committed
487
488
    - `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
    - `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
489
    - `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
490
    - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
491
    - `openai-gpt`: OpenAI English model, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
492
    - `transfo-xl-wt103`: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
thomwolf's avatar
thomwolf committed
493
    - `gpt2`: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
thomwolf's avatar
thomwolf committed
494

thomwolf's avatar
thomwolf committed
495
  - a path or url to a pretrained model archive containing:
thomwolf's avatar
thomwolf committed
496

thomwolf's avatar
thomwolf committed
497
    - `bert_config.json` or `openai_gpt_config.json` a configuration file for the model, and
thomwolf's avatar
thomwolf committed
498
    - `pytorch_model.bin` a PyTorch dump of a pre-trained instance of `BertForPreTraining`, `OpenAIGPTModel`, `TransfoXLModel`, `GPT2LMHeadModel` (saved with the usual `torch.save()`)
thomwolf's avatar
thomwolf committed
499

500
  If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
Thomas Wolf's avatar
Thomas Wolf committed
501
- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
thomwolf's avatar
thomwolf committed
502

503
504
`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.

Thomas Wolf's avatar
Thomas Wolf committed
505
**When using an `uncased model`, make sure to pass `--do_lower_case` to the example training scripts (or pass `do_lower_case=True` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).**
506

thomwolf's avatar
thomwolf committed
507
Examples:
thomwolf's avatar
thomwolf committed
508
```python
thomwolf's avatar
thomwolf committed
509
# BERT
510
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
thomwolf's avatar
thomwolf committed
511
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
thomwolf's avatar
thomwolf committed
512
513
514
515

# OpenAI GPT
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTModel.from_pretrained('openai-gpt')
thomwolf's avatar
thomwolf committed
516
517
518
519

# Transformer-XL
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
520
521
522
523
524

# OpenAI GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

thomwolf's avatar
thomwolf committed
525
526
```

thomwolf's avatar
thomwolf committed
527
### PyTorch models
thomwolf's avatar
thomwolf committed
528

thomwolf's avatar
thomwolf committed
529
#### 1. `BertModel`
thomwolf's avatar
thomwolf committed
530

thomwolf's avatar
thomwolf committed
531
532
533
534
`BertModel` is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).

The inputs and output are **identical to the TensorFlow model inputs and outputs**.

thomwolf's avatar
thomwolf committed
535
We detail them here. This model takes as *inputs*:
536
[`modeling.py`](./pytorch_pretrained_bert/modeling.py)
537
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts [`extract_features.py`](./examples/extract_features.py), [`run_classifier.py`](./examples/run_classifier.py) and [`run_squad.py`](./examples/run_squad.py)), and
Clement's avatar
typos  
Clement committed
538
- `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
Thomas Wolf's avatar
Thomas Wolf committed
539
- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if some input sequence lengths are smaller than the max input sequence length of the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
thomwolf's avatar
thomwolf committed
540
- `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
thomwolf's avatar
thomwolf committed
541

thomwolf's avatar
thomwolf committed
542
This model *outputs* a tuple composed of:
thomwolf's avatar
thomwolf committed
543

thomwolf's avatar
thomwolf committed
544
545
- `encoded_layers`: controled by the value of the `output_encoded_layers` argument:

Thomas Wolf's avatar
Thomas Wolf committed
546
547
  - `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
  - `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block, i.e. a single torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
thomwolf's avatar
thomwolf committed
548

thomwolf's avatar
thomwolf committed
549
- `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
thomwolf's avatar
thomwolf committed
550

551
An example on how to use this class is given in the [`extract_features.py`](./examples/extract_features.py) script which can be used to extract the hidden states of the model for a given input.
thomwolf's avatar
thomwolf committed
552

thomwolf's avatar
thomwolf committed
553
#### 2. `BertForPreTraining`
thomwolf's avatar
thomwolf committed
554
555
556
557
558
559

`BertForPreTraining` includes the `BertModel` Transformer followed by the two pre-training heads:

- the masked language modeling head, and
- the next sentence classification head.

thomwolf's avatar
thomwolf committed
560
*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus two optional labels:
thomwolf's avatar
thomwolf committed
561
562
563
564
565
566
567
568

- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]
- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

*Outputs*:

- if `masked_lm_labels` and `next_sentence_label` are not `None`: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss.
- if `masked_lm_labels` or `next_sentence_label` is `None`: Outputs a tuple comprising
Thomas Wolf's avatar
Thomas Wolf committed
569

thomwolf's avatar
thomwolf committed
570
571
  - the masked language modeling logits, and
  - the next sentence classification logits.
Joel Grus's avatar
Joel Grus committed
572

tholor's avatar
tholor committed
573
574
An example on how to use this class is given in the [`run_lm_finetuning.py`](./examples/run_lm_finetuning.py) script which can be used to fine-tune the BERT language model on your specific different text corpus. This should improve model performance, if the language style is different from the original BERT training corpus (Wiki + BookCorpus).

thomwolf's avatar
thomwolf committed
575

thomwolf's avatar
thomwolf committed
576
#### 3. `BertForMaskedLM`
thomwolf's avatar
thomwolf committed
577
578
579

`BertForMaskedLM` includes the `BertModel` Transformer followed by the (possibly) pre-trained  masked language modeling head.

thomwolf's avatar
thomwolf committed
580
*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus optional label:
thomwolf's avatar
thomwolf committed
581
582
583
584
585
586
587
588

- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]

*Outputs*:

- if `masked_lm_labels` is not `None`: Outputs the masked language modeling loss.
- if `masked_lm_labels` is `None`: Outputs the masked language modeling logits.

thomwolf's avatar
thomwolf committed
589
#### 4. `BertForNextSentencePrediction`
thomwolf's avatar
thomwolf committed
590
591
592

`BertForNextSentencePrediction` includes the `BertModel` Transformer followed by the next sentence classification head.

thomwolf's avatar
thomwolf committed
593
*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus an optional label:
thomwolf's avatar
thomwolf committed
594
595
596
597
598
599
600
601

- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

*Outputs*:

- if `next_sentence_label` is not `None`: Outputs the next sentence classification loss.
- if `next_sentence_label` is `None`: Outputs the next sentence classification logits.

thomwolf's avatar
thomwolf committed
602
#### 5. `BertForSequenceClassification`
thomwolf's avatar
thomwolf committed
603

Thomas Wolf's avatar
typos  
Thomas Wolf committed
604
`BertForSequenceClassification` is a fine-tuning model that includes `BertModel` and a sequence-level (sequence or pair of sequences) classifier on top of the `BertModel`.
thomwolf's avatar
thomwolf committed
605

Thomas Wolf's avatar
Thomas Wolf committed
606
The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).
thomwolf's avatar
thomwolf committed
607

608
An example on how to use this class is given in the [`run_classifier.py`](./examples/run_classifier.py) script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.
thomwolf's avatar
thomwolf committed
609

610
611
612
613
#### 6. `BertForMultipleChoice`

`BertForMultipleChoice` is a fine-tuning model that includes `BertModel` and a linear layer on top of the `BertModel`.

Gr茅gory Ch芒tel's avatar
Gr茅gory Ch芒tel committed
614
The linear layer outputs a single value for each choice of a multiple choice problem, then all the outputs corresponding to an instance are passed through a softmax to get the model choice.
615
616
617
618
619
620

This implementation is largely inspired by the work of OpenAI in [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) and the answer of Jacob Devlin in the following [issue](https://github.com/google-research/bert/issues/38).

An example on how to use this class is given in the [`run_swag.py`](./examples/run_swag.py) script which can be used to fine-tune a multiple choice classifier using BERT, for example for the Swag task.

#### 7. `BertForTokenClassification`
621
622
623
624
625

`BertForTokenClassification` is a fine-tuning model that includes `BertModel` and a token-level classifier on top of the `BertModel`.

The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.

626
#### 8. `BertForQuestionAnswering`
thomwolf's avatar
thomwolf committed
627

Knut Ole Sj酶li's avatar
Knut Ole Sj酶li committed
628
`BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.
thomwolf's avatar
thomwolf committed
629

Thomas Wolf's avatar
Thomas Wolf committed
630
The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a `start_span` and a `end_span` token (see Figures 3c and 3d in the BERT paper).
thomwolf's avatar
thomwolf committed
631

632
An example on how to use this class is given in the [`run_squad.py`](./examples/run_squad.py) script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
thomwolf's avatar
thomwolf committed
633

thomwolf's avatar
thomwolf committed
634
635
636
637
#### 9. `OpenAIGPTModel`

`OpenAIGPTModel` is the basic OpenAI GPT Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.

638
639
640
641
642
643
OpenAI GPT use a single embedding matrix to store the word and special embeddings.
Special tokens embeddings are additional tokens that are not pre-trained: `[SEP]`, `[CLS]`...
Special tokens need to be trained during the fine-tuning if you use them.
The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.

The embeddings are ordered as follow in the token embeddings matrice:
thomwolf's avatar
thomwolf committed
644

645
```python
thomwolf's avatar
thomwolf committed
646
647
648
649
650
    [0,                                                         ----------------------
      ...                                                        -> word embeddings
      config.vocab_size - 1,                                     ______________________
      config.vocab_size,
      ...                                                        -> special embeddings
651
652
      config.vocab_size + config.n_special - 1]                  ______________________
```
thomwolf's avatar
thomwolf committed
653

654
655
where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
    `total_tokens_embeddings = config.vocab_size + config.n_special`
thomwolf's avatar
thomwolf committed
656
657
658
659
660
661
You should use the associate indices to index the embeddings.

The inputs and output are **identical to the TensorFlow model inputs and outputs**.

We detail them here. This model takes as *inputs*:
[`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py)
662
663
664
665
666
667
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
    with the position indices (selected in the range [0, config.n_positions - 1[.
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
    You can use it to add a third type of embedding to each input token in the sequence
    (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
thomwolf's avatar
thomwolf committed
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682

This model *outputs*:
- `hidden_states`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)

#### 10. `OpenAIGPTLMHeadModel`

`OpenAIGPTLMHeadModel` includes the `OpenAIGPTModel` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).

*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus optional labels:
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].

*Outputs*:
- if `lm_labels` is not `None`:
  Outputs the language modeling loss.
- else:
683
  Outputs `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
thomwolf's avatar
thomwolf committed
684
685
686
687
688

#### 11. `OpenAIGPTDoubleHeadsModel`

`OpenAIGPTDoubleHeadsModel` includes the `OpenAIGPTModel` Transformer followed by two heads:
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
689
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).
thomwolf's avatar
thomwolf committed
690
691

*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus a classification mask and two optional labels:
692
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
thomwolf's avatar
thomwolf committed
693
694
695
696
697
698
699
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].

*Outputs*:
- if `lm_labels` and `multiple_choice_labels` are not `None`:
  Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
- else Outputs a tuple with:
700
  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
thomwolf's avatar
thomwolf committed
701
702
  - `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]

thomwolf's avatar
thomwolf committed
703
704
705
706
707
708
709
710
711
712
713
#### 12. `TransfoXLModel`

The Transformer-XL model is described in "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".

Transformer XL use a relative positioning with sinusiodal patterns and adaptive softmax inputs which means that:

- you don't need to specify positioning embeddings indices
- the tokens in the vocabulary have to be sorted to decreasing frequency.

This model takes as *inputs*:
[`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py)
714
715
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the token indices selected in the range [0, self.config.n_token[
- `mems`: an optional memory of hidden states from previous forward passes as a list (num layers) of hidden states at the entry of each layer. Each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
thomwolf's avatar
thomwolf committed
716
717

This model *outputs* a tuple of (last_hidden_state, new_mems)
718
719
- `last_hidden_state`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
thomwolf's avatar
thomwolf committed
720

Thomas Wolf's avatar
Thomas Wolf committed
721
722
723
724
725
726
727
728
729
730
731
732
##### Extracting a list of the hidden states at each layer of the Transformer-XL from `last_hidden_state` and `new_mems`:
The `new_mems` contain all the hidden states PLUS the output of the embeddings (`new_mems[0]`). `new_mems[-1]` is the output of the hidden state of the layer below the last layer and `last_hidden_state` is the output of the last layer (i.E. the input of the softmax when we have a language modeling head on top).

There are two differences between the shapes of `new_mems` and `last_hidden_state`: `new_mems` have transposed first dimensions and are longer (of size `self.config.mem_len`). Here is how to extract the full list of hidden states from the model output:

```python
hidden_states, mems = model(tokens_tensor)
seq_length = hidden_states.size(1)
lower_hidden_states = list(t[-seq_length:, ...].transpose(0, 1) for t in mems)
all_hidden_states = lower_hidden_states + [hidden_states]
```

thomwolf's avatar
thomwolf committed
733
734
735
736
737
#### 13. `TransfoXLLMHeadModel`

`TransfoXLLMHeadModel` includes the `TransfoXLModel` Transformer followed by an (adaptive) softmax head with weights tied to the input embeddings.

*Inputs* are the same as the inputs of the [`TransfoXLModel`](#-12.-`TransfoXLModel`) class plus optional labels:
738
- `target`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the target token indices selected in the range [0, self.config.n_token[
thomwolf's avatar
thomwolf committed
739
740
741

*Outputs* a tuple of (last_hidden_state, new_mems)
- `softmax_output`: output of the (adaptive) softmax:
thomwolf's avatar
thomwolf committed
742
743
  - if target is None: log probabilities of tokens, shape [batch_size, sequence_length, n_tokens] 
  - else: Negative log likelihood of target tokens with shape [batch_size, sequence_length]
744
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
thomwolf's avatar
thomwolf committed
745

thomwolf's avatar
thomwolf committed
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
#### 14. `GPT2Model`

`GPT2Model` is the OpenAI GPT-2 Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.

The inputs and output are **identical to the TensorFlow model inputs and outputs**.

We detail them here. This model takes as *inputs*:
[`modeling_gpt2.py`](./pytorch_pretrained_bert/modeling_gpt2.py)
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, vocab_size[
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
    with the position indices (selected in the range [0, config.n_positions - 1[.
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
    You can use it to add a third type of embedding to each input token in the sequence
    (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
- `past`: an optional list of torch.LongTensor that contains pre-computed hidden-states (key and values in the attention blocks) to speed up sequential decoding (this is the `presents` output of the model, cf. below).

This model *outputs*:
- `hidden_states`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
- `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).

#### 15. `GPT2LMHeadModel`

`GPT2LMHeadModel` includes the `GPT2Model` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).

*Inputs* are the same as the inputs of the [`GPT2Model`](#-14.-`GPT2Model`) class plus optional labels:
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].

*Outputs*:
- if `lm_labels` is not `None`:
  Outputs the language modeling loss.
Joel Grus's avatar
Joel Grus committed
776
- else: a tuple of
thomwolf's avatar
thomwolf committed
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
  - `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).

#### 16. `GPT2DoubleHeadsModel`

`GPT2DoubleHeadsModel` includes the `GPT2Model` Transformer followed by two heads:
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).

*Inputs* are the same as the inputs of the [`GPT2Model`](#-14.-`GPT2Model`) class plus a classification mask and two optional labels:
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].

*Outputs*:
- if `lm_labels` and `multiple_choice_labels` are not `None`:
  Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
- else Outputs a tuple with:
  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
  - `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
  - `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).


thomwolf's avatar
thomwolf committed
800
801
802
### Tokenizers:

#### `BertTokenizer`
thomwolf's avatar
thomwolf committed
803

thomwolf's avatar
thomwolf committed
804
`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
thomwolf's avatar
thomwolf committed
805

806
This class has five arguments:
thomwolf's avatar
thomwolf committed
807

thomwolf's avatar
thomwolf committed
808
809
- `vocab_file`: path to a vocabulary file.
- `do_lower_case`: convert text to lower-case while tokenizing. **Default = True**.
thomwolf's avatar
thomwolf committed
810
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
811
- `do_basic_tokenize`: Do basic tokenization before wordpice tokenization. Set to false if text is pre-tokenized. **Default = True**.
thomwolf's avatar
thomwolf committed
812
- `never_split`: a list of tokens that should not be splitted during tokenization. **Default = `["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]`**
thomwolf's avatar
thomwolf committed
813

thomwolf's avatar
thomwolf committed
814
and three methods:
Thomas Wolf's avatar
typos  
Thomas Wolf committed
815

thomwolf's avatar
thomwolf committed
816
817
818
- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
thomwolf's avatar
thomwolf committed
819

thomwolf's avatar
thomwolf committed
820
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
thomwolf's avatar
thomwolf committed
821

thomwolf's avatar
thomwolf committed
822
823
824
825
#### `OpenAIGPTTokenizer`

`OpenAIGPTTokenizer` perform Byte-Pair-Encoding (BPE) tokenization.

thomwolf's avatar
thomwolf committed
826
This class has four arguments:
thomwolf's avatar
thomwolf committed
827
828
829

- `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges.
thomwolf's avatar
thomwolf committed
830
831
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
- `special_tokens`: a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's `BasicTokenizer` is used as the pre-BPE tokenizer, these tokens are not split. **Default= None**
thomwolf's avatar
thomwolf committed
832

thomwolf's avatar
thomwolf committed
833
and five methods:
thomwolf's avatar
thomwolf committed
834
835
836
837

- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
thomwolf's avatar
thomwolf committed
838
839
- `set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
- `decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
thomwolf's avatar
thomwolf committed
840
841
842

Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.

thomwolf's avatar
thomwolf committed
843
844
#### `TransfoXLTokenizer`

845
`TransfoXLTokenizer` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper ([Efficient softmax approximation for GPUs](http://arxiv.org/abs/1609.04309)) for more details.
thomwolf's avatar
thomwolf committed
846

847
Please refer to the doc strings and code in [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) for the details of these additional methods in `TransfoXLTokenizer`.
thomwolf's avatar
thomwolf committed
848

thomwolf's avatar
thomwolf committed
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
#### `GPT2Tokenizer`

`GPT2Tokenizer` perform byte-level Byte-Pair-Encoding (BPE) tokenization.

This class has three arguments:

- `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges.
- `errors`: How to handle unicode decoding errors. **Default = `replace`**

and two methods:

- `encode(text)`: convert a `str` in a list of `int` tokens by performing byte-level BPE.
- `decode(tokens)`: convert back a list of `int` tokens in a `str`.

Please refer to [`tokenization_gpt2.py`](./pytorch_pretrained_bert/tokenization_gpt2.py) for more details on the `GPT2Tokenizer`.


thomwolf's avatar
thomwolf committed
867
868
869
### Optimizers:

#### `BertAdam`
thomwolf's avatar
thomwolf committed
870

thomwolf's avatar
thomwolf committed
871
`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
thomwolf's avatar
thomwolf committed
872

thomwolf's avatar
thomwolf committed
873
874
- BertAdam implements weight decay fix,
- BertAdam doesn't compensate for bias as in the regular Adam optimizer.
thomwolf's avatar
thomwolf committed
875
876
877
878

The optimizer accepts the following arguments:

- `lr` : learning rate
Thomas Wolf's avatar
Thomas Wolf committed
879
- `warmup` : portion of `t_total` for the warmup, `-1`  means no warmup. Default : `-1`
thomwolf's avatar
thomwolf committed
880
- `t_total` : total number of training steps for the learning
Thomas Wolf's avatar
Thomas Wolf committed
881
882
883
884
885
    rate schedule, `-1`  means constant learning rate. Default : `-1`
- `schedule` : schedule to use for the warmup (see above). Default : `'warmup_linear'`
- `b1` : Adams b1. Default : `0.9`
- `b2` : Adams b2. Default : `0.999`
- `e` : Adams epsilon. Default : `1e-6`
886
- `weight_decay:` Weight decay. Default : `0.01`
Thomas Wolf's avatar
Thomas Wolf committed
887
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
thomwolf's avatar
thomwolf committed
888

thomwolf's avatar
thomwolf committed
889
890
891
892
893
894
895
#### `OpenAIGPTAdam`

`OpenAIGPTAdam` is similar to `BertAdam`.
The differences with `BertAdam` is that `OpenAIGPTAdam` compensate for bias as in the regular Adam optimizer.

`OpenAIGPTAdam` accepts the same arguments as `BertAdam`.

thomwolf's avatar
thomwolf committed
896
## Examples
thomwolf's avatar
thomwolf committed
897

thomwolf's avatar
thomwolf committed
898
899
900
| Sub-section | Description |
|-|-|
| [Training large models: introduction, tools and examples](#Training-large-models-introduction,-tools-and-examples) | How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models |
tholor's avatar
tholor committed
901
| [Fine-tuning with BERT: running the examples](#Fine-tuning-with-BERT-running-the-examples) | Running the examples in [`./examples`](./examples/): `extract_classif.py`, `run_classifier.py`, `run_squad.py` and `run_lm_finetuning.py` |
902
| [Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2](#Fine-tuning-with-OpenAI-GPT-Transformer-XL-and-GPT-2) | Running the examples in [`./examples`](./examples/): `run_openai_gpt.py`, `run_transfo_xl.py` and `run_gpt2.py` |
thomwolf's avatar
thomwolf committed
903
904
| [Fine-tuning BERT-large on GPUs](#Fine-tuning-BERT-large-on-GPUs) | How to fine tune `BERT large`|

thomwolf's avatar
thomwolf committed
905
### Training large models: introduction, tools and examples
thomwolf's avatar
thomwolf committed
906

Thomas Wolf's avatar
Thomas Wolf committed
907
BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
thomwolf's avatar
thomwolf committed
908

909
To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts [`run_classifier.py`](./examples/run_classifier.py) and [`run_squad.py`](./examples/run_squad.py): gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read [the tips on training large batches in PyTorch](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) that I published earlier this month.
thomwolf's avatar
thomwolf committed
910

thomwolf's avatar
thomwolf committed
911
Here is how to use these techniques in our scripts:
thomwolf's avatar
thomwolf committed
912

thomwolf's avatar
thomwolf committed
913
914
- **Gradient Accumulation**: Gradient accumulation can be used by supplying a integer greater than 1 to the `--gradient_accumulation_steps` argument. The batch at each step will be divided by this integer and gradient will be accumulated over `gradient_accumulation_steps` steps.
- **Multi-GPU**: Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
thomwolf's avatar
thomwolf committed
915
- **Distributed training**: Distributed training can be activated by supplying an integer greater or equal to 0 to the `--local_rank` argument (see below).
Julien Chaumond's avatar
Julien Chaumond committed
916
- **16-bits training**: 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found [here](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) and a full documentation is [here](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html). In our scripts, this option can be activated by setting the `--fp16` flag and you can play with loss scaling using the `--loss_scale` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
917

Julien Chaumond's avatar
Julien Chaumond committed
918
To use 16-bits training and distributed training, you need to install NVIDIA's apex extension [as detailed here](https://github.com/nvidia/apex). You will find more information regarding the internals of `apex` and how to use `apex` in [the doc and the associated repository](https://github.com/nvidia/apex). The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in [the relevant PR of the present repository](https://github.com/huggingface/pytorch-pretrained-BERT/pull/116).
thomwolf's avatar
thomwolf committed
919

thomwolf's avatar
thomwolf committed
920
Note: To use *Distributed Training*, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see [the above mentioned blog post]((https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)) for more details):
thomwolf's avatar
thomwolf committed
921
922
923
```bash
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
```
924
Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address `192.168.1.1` and an open port `1234`.
thomwolf's avatar
thomwolf committed
925

thomwolf's avatar
thomwolf committed
926
### Fine-tuning with BERT: running the examples
VictorSanh's avatar
VictorSanh committed
927

928
We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/):
VictorSanh's avatar
VictorSanh committed
929

930
- a *sequence-level classifier* on nine different GLUE tasks,
thomwolf's avatar
thomwolf committed
931
932
- a *token-level classifier* on the question answering dataset SQuAD, and
- a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
tholor's avatar
tholor committed
933
- a *BERT language model* on another target corpus
Joel Grus's avatar
Joel Grus committed
934

935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
#### GLUE results on dev set

We get the following results on the dev set of GLUE benchmark with an uncased BERT base 
model. All experiments were run on a P100 GPU with a batch size of 32.

| Task | Metric | Result |
|-|-|-|
| CoLA | Matthew's corr. | 57.29 |
| SST-2 | accuracy | 93.00 |
| MRPC | F1/accuracy | 88.85/83.82 |
| STS-B | Pearson/Spearman corr. | 89.70/89.37 |
| QQP | accuracy/F1 | 90.72/87.41 |
| MNLI | matched acc./mismatched acc.| 83.95/84.39 |
| QNLI | accuracy | 89.04 |
| RTE | accuracy | 61.01 |
| WNLI | accuracy | 53.52 |

Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.

Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

```shell
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python run_classifier.py \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
```

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.

984
985
986
987
988
989
#### MRPC

This example code fine-tunes BERT on the Microsoft Research Paraphrase
Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.

Before running this example you should download the
VictorSanh's avatar
VictorSanh committed
990
991
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
992
and unpack it to some directory `$GLUE_DIR`.
VictorSanh's avatar
VictorSanh committed
993
994
995
996

```shell
export GLUE_DIR=/path/to/glue

997
python run_classifier.py \
VictorSanh's avatar
VictorSanh committed
998
999
1000
  --task_name MRPC \
  --do_train \
  --do_eval \
1001
  --do_lower_case \
VictorSanh's avatar
VictorSanh committed
1002
  --data_dir $GLUE_DIR/MRPC/ \
thomwolf's avatar
thomwolf committed
1003
  --bert_model bert-base-uncased \
VictorSanh's avatar
VictorSanh committed
1004
1005
1006
1007
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
1008
  --output_dir /tmp/mrpc_output/
VictorSanh's avatar
VictorSanh committed
1009
1010
```

Thomas Wolf's avatar
Thomas Wolf committed
1011
Our test ran on a few seeds with [the original implementation hyper-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation results between 84% and 88%.
thomwolf's avatar
thomwolf committed
1012

1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
**Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
First install apex as indicated [here](https://github.com/NVIDIA/apex).
Then run
```shell
export GLUE_DIR=/path/to/glue

python run_classifier.py \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
1030
1031
  --output_dir /tmp/mrpc_output/ \
  --fp16
1032
1033
1034
1035
```

#### SQuAD

thomwolf's avatar
thomwolf committed
1036
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.
VictorSanh's avatar
VictorSanh committed
1037

VictorSanh's avatar
VictorSanh committed
1038
The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory.
1039

VictorSanh's avatar
VictorSanh committed
1040
1041
1042
1043
*   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
*   [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
*   [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)

VictorSanh's avatar
VictorSanh committed
1044
```shell
VictorSanh's avatar
VictorSanh committed
1045
export SQUAD_DIR=/path/to/SQUAD
VictorSanh's avatar
VictorSanh committed
1046

1047
python run_squad.py \
thomwolf's avatar
thomwolf committed
1048
  --bert_model bert-base-uncased \
VictorSanh's avatar
VictorSanh committed
1049
1050
  --do_train \
  --do_predict \
1051
  --do_lower_case \
Thomas Wolf's avatar
Thomas Wolf committed
1052
  --train_file $SQUAD_DIR/train-v1.1.json \
thomwolf's avatar
thomwolf committed
1053
1054
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
Thomas Wolf's avatar
Thomas Wolf committed
1055
  --learning_rate 3e-5 \
thomwolf's avatar
thomwolf committed
1056
1057
1058
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
thomwolf's avatar
thomwolf committed
1059
  --output_dir /tmp/debug_squad/
thomwolf's avatar
thomwolf committed
1060
```
1061

Thomas Wolf's avatar
Thomas Wolf committed
1062
Training with the previous hyper-parameters gave us the following results:
1063
```bash
Thomas Wolf's avatar
Thomas Wolf committed
1064
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
1065
```
1066

thomwolf's avatar
thomwolf committed
1067
1068
1069
#### SWAG

The data for SWAG can be downloaded by cloning the following [repository](https://github.com/rowanz/swagaf)
1070
1071
1072
1073
1074
1075
1076

```shell
export SWAG_DIR=/path/to/SWAG

python run_swag.py \
  --bert_model bert-base-uncased \
  --do_train \
thomwolf's avatar
thomwolf committed
1077
  --do_lower_case \
1078
  --do_eval \
thomwolf's avatar
thomwolf committed
1079
  --data_dir $SWAG_DIR/data \
1080
  --train_batch_size 16 \
1081
1082
1083
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --max_seq_length 80 \
thomwolf's avatar
thomwolf committed
1084
  --output_dir /tmp/swag_output/ \
1085
  --gradient_accumulation_steps 4
1086
1087
```

1088
Training with the previous hyper-parameters on a single GPU gave us the following results:
1089
```
1090
1091
1092
1093
eval_accuracy = 0.8062081375587323
eval_loss = 0.5966546792367169
global_step = 13788
loss = 0.06423990014260186
1094
1095
```

tholor's avatar
tholor committed
1096
1097
1098
#### LM Fine-tuning

The data should be a text file in the same format as [sample_text.txt](./samples/sample_text.txt)  (one sentence per line, docs separated by empty line).
Joel Grus's avatar
Joel Grus committed
1099
You can download an [exemplary training corpus](https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt) generated from wikipedia articles and splitted into ~500k sentences with spaCy.
1100
Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:
tholor's avatar
tholor committed
1101
1102


thomwolf's avatar
thomwolf committed
1103
Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the [`README`](./examples/lm_finetuning/README.md) of the [`examples/lm_finetuning/`](./examples/lm_finetuning/) folder.
tholor's avatar
tholor committed
1104

thomwolf's avatar
thomwolf committed
1105
### OpenAI GPT, Transformer-XL and GPT-2: running the examples
thomwolf's avatar
thomwolf committed
1106

thomwolf's avatar
thomwolf committed
1107
We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
thomwolf's avatar
thomwolf committed
1108
1109
1110

- fine-tuning OpenAI GPT on the ROCStories dataset
- evaluating Transformer-XL on Wikitext 103
thomwolf's avatar
thomwolf committed
1111
- unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
thomwolf's avatar
thomwolf committed
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122

#### Fine-tuning OpenAI GPT on the RocStories dataset

This example code fine-tunes OpenAI GPT on the RocStories dataset.

Before running this example you should download the
[RocStories dataset](https://github.com/snigdhac/StoryComprehension_EMNLP/tree/master/Dataset/RoCStories) and unpack it to some directory `$ROC_STORIES_DIR`.

```shell
export ROC_STORIES_DIR=/path/to/RocStories

thomwolf's avatar
thomwolf committed
1123
1124
python run_openai_gpt.py \
  --model_name openai-gpt \
thomwolf's avatar
thomwolf committed
1125
1126
  --do_train \
  --do_eval \
thomwolf's avatar
thomwolf committed
1127
1128
1129
1130
  --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
  --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
  --output_dir ../log \
  --train_batch_size 16 \
thomwolf's avatar
thomwolf committed
1131
1132
```

1133
This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
thomwolf's avatar
thomwolf committed
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143

#### Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset

This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset.
This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.

```shell
python run_transfo_xl.py --work_dir ../log
```

1144
This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).
thomwolf's avatar
thomwolf committed
1145

thomwolf's avatar
thomwolf committed
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
#### Unconditional and conditional generation from OpenAI's GPT-2 model

This example code is identical to the original unconditional and conditional generation codes.

Conditional generation:
```shell
python run_gpt2.py
```

Unconditional generation:
```shell
python run_gpt2.py --unconditional
```

The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.

thomwolf's avatar
thomwolf committed
1162
## Fine-tuning BERT-large on GPUs
1163
1164
1165

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

Thomas Wolf's avatar
Thomas Wolf committed
1166
For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):
1167
1168
1169
```bash
{"exact_match": 84.56953642384106, "f1": 91.04028647786927}
```
Thomas Wolf's avatar
Thomas Wolf committed
1170
To get these results we used a combination of:
1171
1172
1173
1174
- multi-GPU training (automatically activated on a multi-GPU server),
- 2 steps of gradient accumulation and
- perform the optimization step on CPU to store Adam's averages in RAM.

thomwolf's avatar
thomwolf committed
1175
Here is the full list of hyper-parameters for this run:
1176
```bash
Thomas Wolf's avatar
Thomas Wolf committed
1177
python ./run_squad.py \
thomwolf's avatar
thomwolf committed
1178
  --bert_model bert-large-uncased \
Thomas Wolf's avatar
Thomas Wolf committed
1179
1180
  --do_train \
  --do_predict \
1181
  --do_lower_case \
Thomas Wolf's avatar
Thomas Wolf committed
1182
1183
1184
1185
1186
1187
1188
1189
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
Joel Grus's avatar
Joel Grus committed
1190
  --gradient_accumulation_steps 2
1191
```
1192
1193
1194
1195
1196
1197

If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:
```bash
python ./run_squad.py \
thomwolf's avatar
thomwolf committed
1198
  --bert_model bert-large-uncased \
1199
1200
  --do_train \
  --do_predict \
1201
  --do_lower_case \
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
  --train_file $SQUAD_TRAIN \
  --predict_file $SQUAD_EVAL \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir $OUTPUT_DIR \
  --train_batch_size 24 \
  --fp16 \
  --loss_scale 128
```

The results were similar to the above FP32 results (actually slightly higher):
```bash
{"exact_match": 84.65468306527909, "f1": 91.238669287002}
```
thomwolf's avatar
thomwolf committed
1218

thomwolf's avatar
thomwolf committed
1219
## Notebooks
thomwolf's avatar
thomwolf committed
1220

Thomas Wolf's avatar
Thomas Wolf committed
1221
We include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
thomwolf's avatar
thomwolf committed
1222

thomwolf's avatar
thomwolf committed
1223
1224
1225
- The first NoteBook ([Comparing-TF-and-PT-models.ipynb](./notebooks/Comparing-TF-and-PT-models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.

- The second NoteBook ([Comparing-TF-and-PT-models-SQuAD.ipynb](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
thomwolf's avatar
thomwolf committed
1226

Thomas Wolf's avatar
Thomas Wolf committed
1227
- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
thomwolf's avatar
thomwolf committed
1228

thomwolf's avatar
thomwolf committed
1229
Please follow the instructions given in the notebooks to run and modify them.
thomwolf's avatar
thomwolf committed
1230

thomwolf's avatar
thomwolf committed
1231
## Command-line interface
thomwolf's avatar
thomwolf committed
1232

thomwolf's avatar
thomwolf committed
1233
1234
1235
A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the `OpenAIGPTModel` class  (for OpenAI GPT).

### BERT
thomwolf's avatar
thomwolf committed
1236

Weixin Wang's avatar
Weixin Wang committed
1237
You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py ) script.
thomwolf's avatar
thomwolf committed
1238

Weixin Wang's avatar
Weixin Wang committed
1239
This CLI takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using `torch.load()` (see examples in [`extract_features.py`](./examples/extract_features.py), [`run_classifier.py`](./examples/run_classifier.py) and [`run_squad.py`](./examples/run_squad.py)).
thomwolf's avatar
thomwolf committed
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249

You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too.

To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch.

Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model:

```shell
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

thomwolf's avatar
thomwolf committed
1250
pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
thomwolf's avatar
thomwolf committed
1251
1252
1253
  $BERT_BASE_DIR/bert_model.ckpt \
  $BERT_BASE_DIR/bert_config.json \
  $BERT_BASE_DIR/pytorch_model.bin
thomwolf's avatar
thomwolf committed
1254
1255
1256
1257
```

You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).

thomwolf's avatar
thomwolf committed
1258
1259
### OpenAI GPT

thomwolf's avatar
thomwolf committed
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see [here](https://github.com/openai/finetune-transformer-lm))

```shell
export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights

pytorch_pretrained_bert convert_openai_checkpoint \
  $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
  $PYTORCH_DUMP_OUTPUT \
  [OPENAI_GPT_CONFIG]
```

### Transformer-XL

Here is an example of the conversion process for a pre-trained Transformer-XL model (see [here](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models))
thomwolf's avatar
thomwolf committed
1274
1275

```shell
thomwolf's avatar
thomwolf committed
1276
export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
thomwolf's avatar
thomwolf committed
1277

thomwolf's avatar
thomwolf committed
1278
1279
pytorch_pretrained_bert convert_transfo_xl_checkpoint \
  $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
thomwolf's avatar
thomwolf committed
1280
  $PYTORCH_DUMP_OUTPUT \
thomwolf's avatar
thomwolf committed
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
  [TRANSFO_XL_CONFIG]
```

### GPT-2

Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.

```shell
export GPT2_DIR=/path/to/gpt2/checkpoint

pytorch_pretrained_bert convert_gpt2_checkpoint \
  $GPT2_DIR/model.ckpt \
  $PYTORCH_DUMP_OUTPUT \
  [GPT2_CONFIG]
thomwolf's avatar
thomwolf committed
1295
1296
```

thomwolf's avatar
thomwolf committed
1297
## TPU
thomwolf's avatar
thomwolf committed
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307

TPU support and pretraining scripts

TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)).

We will add TPU support when this next release is published.

The original TensorFlow code further comprises two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py).

Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details [here](https://github.com/google-research/bert#pre-training-with-bert)) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.