"vscode:/vscode.git/clone" did not exist on "cd8fad339826b15c109f8a9487ac0a7577f98b3b"
README.md 99.3 KB
Newer Older
Thomas Wolf's avatar
Thomas Wolf committed
1
# PyTorch Pretrained BERT: The Big & Extending Repository of pretrained Transformers
VictorSanh's avatar
VictorSanh committed
2

thomwolf's avatar
thomwolf committed
3
[![CircleCI](https://circleci.com/gh/huggingface/pytorch-pretrained-bert.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-pretrained-bert)
Julien Chaumond's avatar
Julien Chaumond committed
4

thomwolf's avatar
thomwolf committed
5
This repository contains op-for-op PyTorch implementations, pre-trained models and fine-tuning examples for:
VictorSanh's avatar
VictorSanh committed
6

thomwolf's avatar
thomwolf committed
7
- [Google's BERT model](https://github.com/google-research/bert),
thomwolf's avatar
thomwolf committed
8
- [OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm),
Thomas Wolf's avatar
Thomas Wolf committed
9
- [OpenAI's GPT-2 model](https://blog.openai.com/better-language-models/).
thomwolf's avatar
thomwolf committed
10
11
12
- [Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl), and
- [Google/CMU's XLNet model](https://github.com/zihangdai/xlnet/).
- [Facebook's XLM model](https://github.com/facebookresearch/XLM/).
thomwolf's avatar
thomwolf committed
13
14
15
16
17

These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the [Examples](#examples) section below.

Here are some information on these models:

thomwolf's avatar
thomwolf committed
18
**BERT** was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. This PyTorch implementation of BERT is provided with [Google's pre-trained models](https://github.com/google-research/bert), examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
thomwolf's avatar
thomwolf committed
19

thomwolf's avatar
thomwolf committed
20
**OpenAI GPT** was released together with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. This PyTorch implementation of OpenAI GPT is an adaptation of the [PyTorch implementation by HuggingFace](https://github.com/huggingface/pytorch-openai-transformer-lm) and is provided with [OpenAI's pre-trained model](https://github.com/openai/finetune-transformer-lm) and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
thomwolf's avatar
thomwolf committed
21

thomwolf's avatar
thomwolf committed
22
**OpenAI GPT-2** was released together with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. This PyTorch implementation of OpenAI GPT-2 is an adaptation of the [OpenAI's implementation](https://github.com/openai/gpt-2) and is provided with [OpenAI's pre-trained model](https://github.com/openai/gpt-2) and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
23

thomwolf's avatar
thomwolf committed
24
25
**Google/CMU's Transformer-XL** was released together with the paper [鈥媂LNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
This PyTorch implementation of XLNet is an adaptation of the original [PyTorch implementation](https://github.com/kimiyoung/transformer-xl) which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
thomwolf's avatar
thomwolf committed
26

thomwolf's avatar
thomwolf committed
27
28
29
30
31
**Google/CMU's XLNet** was released together with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
This PyTorch implementation of XLNet is provided with [Google/CMU's pre-trained models](https://github.com/zihangdai/xlnet) and examples. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.

**Facebook's XLM** was released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
This PyTorch implementation of XLM is an adaptation of the original [PyTorch implementation](https://github.com/facebookresearch/XLM). A command-line interface is provided to convert original PyTorch checkpoints in PyTorch models according to the present repository.
thomwolf's avatar
thomwolf committed
32

thomwolf's avatar
thomwolf committed
33
## Content
34

thomwolf's avatar
thomwolf committed
35
| Section | Description |
thomwolf's avatar
thomwolf committed
36
| - | - |
thomwolf's avatar
thomwolf committed
37
38
39
40
41
42
| [Installation](#installation) | How to install the package |
| [Overview](#overview) | Overview of the package |
| [Usage](#usage) | Quickstart examples |
| [Doc](#doc) |  Detailed documentation |
| [Examples](#examples) | Detailed examples on how to fine-tune Bert |
| [Notebooks](#notebooks) | Introduction on the provided Jupyter Notebooks |
thomwolf's avatar
thomwolf committed
43
| [TPU](#tpu) | Notes on TPU support and pretraining scripts |
thomwolf's avatar
thomwolf committed
44
| [Command-line interface](#Command-line-interface) | Convert a TensorFlow checkpoint in a PyTorch dump |
thomwolf's avatar
thomwolf committed
45

thomwolf's avatar
thomwolf committed
46
## Installation
VictorSanh's avatar
VictorSanh committed
47

thomwolf's avatar
thomwolf committed
48
This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
VictorSanh's avatar
VictorSanh committed
49

thomwolf's avatar
thomwolf committed
50
### With pip
thomwolf's avatar
thomwolf committed
51

thomwolf's avatar
thomwolf committed
52
PyTorch pretrained bert can be installed by pip as follows:
thomwolf's avatar
thomwolf committed
53

thomwolf's avatar
thomwolf committed
54
```bash
thomwolf's avatar
thomwolf committed
55
pip install pytorch-transformers
thomwolf's avatar
thomwolf committed
56
```
VictorSanh's avatar
VictorSanh committed
57

thomwolf's avatar
thomwolf committed
58
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
thomwolf's avatar
thomwolf committed
59

thomwolf's avatar
thomwolf committed
60
61
62
63
64
```bash
pip install spacy ftfy==4.4.3
python -m spacy download en
```

thomwolf's avatar
thomwolf committed
65
66
If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).

thomwolf's avatar
thomwolf committed
67
### From source
thomwolf's avatar
thomwolf committed
68
69

Clone the repository and run:
thomwolf's avatar
thomwolf committed
70

thomwolf's avatar
thomwolf committed
71
72
73
```bash
pip install [--editable] .
```
VictorSanh's avatar
VictorSanh committed
74

thomwolf's avatar
thomwolf committed
75
Here also, if you want to reproduce the original tokenization process of the `OpenAI GPT` model, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
thomwolf's avatar
thomwolf committed
76

thomwolf's avatar
thomwolf committed
77
78
79
80
81
```bash
pip install spacy ftfy==4.4.3
python -m spacy download en
```

thomwolf's avatar
thomwolf committed
82
Again, if you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage).
thomwolf's avatar
thomwolf committed
83

thomwolf's avatar
thomwolf committed
84
A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-transformers/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
VictorSanh's avatar
VictorSanh committed
85

thomwolf's avatar
thomwolf committed
86
You can run the tests with the command:
thomwolf's avatar
thomwolf committed
87

thomwolf's avatar
thomwolf committed
88
89
```bash
python -m pytest -sv tests/
VictorSanh's avatar
VictorSanh committed
90
91
```

thomwolf's avatar
thomwolf committed
92
## Overview
thomwolf's avatar
thomwolf committed
93

thomwolf's avatar
thomwolf committed
94
This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:
thomwolf's avatar
thomwolf committed
95

thomwolf's avatar
thomwolf committed
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
- Eight **Bert** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling.py`](./pytorch_transformers/modeling.py) file):
  - [`BertModel`](./pytorch_transformers/modeling.py#L639) - raw BERT Transformer model (**fully pre-trained**),
  - [`BertForMaskedLM`](./pytorch_transformers/modeling.py#L793) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
  - [`BertForNextSentencePrediction`](./pytorch_transformers/modeling.py#L854) - BERT Transformer with the pre-trained next sentence prediction classifier on top  (**fully pre-trained**),
  - [`BertForPreTraining`](./pytorch_transformers/modeling.py#L722) - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
  - [`BertForSequenceClassification`](./pytorch_transformers/modeling.py#L916) - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
  - [`BertForMultipleChoice`](./pytorch_transformers/modeling.py#L982) - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
  - [`BertForTokenClassification`](./pytorch_transformers/modeling.py#L1051) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
  - [`BertForQuestionAnswering`](./pytorch_transformers/modeling.py#L1124) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).

- Three **OpenAI GPT** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_openai.py`](./pytorch_transformers/modeling_openai.py) file):
  - [`OpenAIGPTModel`](./pytorch_transformers/modeling_openai.py#L536) - raw OpenAI GPT Transformer model (**fully pre-trained**),
  - [`OpenAIGPTLMHeadModel`](./pytorch_transformers/modeling_openai.py#L643) - OpenAI GPT Transformer with the tied language modeling head on top (**fully pre-trained**),
  - [`OpenAIGPTDoubleHeadsModel`](./pytorch_transformers/modeling_openai.py#L722) - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),

- Two **Transformer-XL** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_transfo_xl.py`](./pytorch_transformers/modeling_transfo_xl.py) file):
  - [`TransfoXLModel`](./pytorch_transformers/modeling_transfo_xl.py#L983) - Transformer-XL model which outputs the last hidden state and memory cells (**fully pre-trained**),
  - [`TransfoXLLMHeadModel`](./pytorch_transformers/modeling_transfo_xl.py#L1260) - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (**fully pre-trained**),

- Three **OpenAI GPT-2** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_gpt2.py`](./pytorch_transformers/modeling_gpt2.py) file):
  - [`GPT2Model`](./pytorch_transformers/modeling_gpt2.py#L479) - raw OpenAI GPT-2 Transformer model (**fully pre-trained**),
  - [`GPT2LMHeadModel`](./pytorch_transformers/modeling_gpt2.py#L559) - OpenAI GPT-2 Transformer with the tied language modeling head on top (**fully pre-trained**),
  - [`GPT2DoubleHeadsModel`](./pytorch_transformers/modeling_gpt2.py#L624) - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),

- Tokenizers for **BERT** (using word-piece) (in the [`tokenization.py`](./pytorch_transformers/tokenization.py) file):
thomwolf's avatar
thomwolf committed
121
122
123
124
  - `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
  - `WordpieceTokenizer` - WordPiece tokenization,
  - `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.

thomwolf's avatar
thomwolf committed
125
- Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the [`tokenization_openai.py`](./pytorch_transformers/tokenization_openai.py) file):
thomwolf's avatar
thomwolf committed
126
127
  - `OpenAIGPTTokenizer` - perform Byte-Pair-Encoding (BPE) tokenization.

thomwolf's avatar
thomwolf committed
128
- Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the [`tokenization_transfo_xl.py`](./pytorch_transformers/tokenization_transfo_xl.py) file):
thomwolf's avatar
thomwolf committed
129
  - `OpenAIGPTTokenizer` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
thomwolf's avatar
thomwolf committed
130

thomwolf's avatar
thomwolf committed
131
- Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt2.py) file):
thomwolf's avatar
thomwolf committed
132
133
  - `GPT2Tokenizer` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.

thomwolf's avatar
thomwolf committed
134
- Optimizer for **BERT** (in the [`optimization.py`](./pytorch_transformers/optimization.py) file):
thomwolf's avatar
thomwolf committed
135
  - `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
thomwolf's avatar
thomwolf committed
136

thomwolf's avatar
thomwolf committed
137
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_transformers/optimization_openai.py) file):
138
  - `OpenAIAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
thomwolf's avatar
thomwolf committed
139

thomwolf's avatar
thomwolf committed
140
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_transformers/modeling.py), [`modeling_openai.py`](./pytorch_transformers/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_transformers/modeling_transfo_xl.py) files):
Julien Chaumond's avatar
Julien Chaumond committed
141
  - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
thomwolf's avatar
thomwolf committed
142
  - `OpenAIGPTConfig` - Configuration class to store the configuration of a `OpenAIGPTModel` with utilities to read and write from JSON configuration files.
143
  - `GPT2Config` - Configuration class to store the configuration of a `GPT2Model` with utilities to read and write from JSON configuration files.
thomwolf's avatar
thomwolf committed
144
  - `TransfoXLConfig` - Configuration class to store the configuration of a `TransfoXLModel` with utilities to read and write from JSON configuration files.
thomwolf's avatar
thomwolf committed
145

thomwolf's avatar
thomwolf committed
146
147
The repository further comprises:

thomwolf's avatar
thomwolf committed
148
- Five examples on how to use **BERT** (in the [`examples` folder](./examples)):
149
150
151
  - [`run_bert_extract_features.py`](./examples/run_bert_extract_features.py) - Show how to extract hidden states from an instance of `BertModel`,
  - [`run_bert_classifier.py`](./examples/run_bert_classifier.py) - Show how to fine-tune an instance of `BertForSequenceClassification` on GLUE's MRPC task,
  - [`run_bert_squad.py`](./examples/run_bert_squad.py) - Show how to fine-tune an instance of `BertForQuestionAnswering` on SQuAD v1.0 and SQuAD v2.0 tasks.
152
  - [`run_swag.py`](./examples/run_swag.py) - Show how to fine-tune an instance of `BertForMultipleChoice` on Swag task.
Sepehr Sameni's avatar
Sepehr Sameni committed
153
  - [`simple_lm_finetuning.py`](./examples/lm_finetuning/simple_lm_finetuning.py) - Show how to fine-tune an instance of `BertForPretraining` on a target text corpus.
thomwolf's avatar
thomwolf committed
154
155

- One example on how to use **OpenAI GPT** (in the [`examples` folder](./examples)):
Thomas Wolf's avatar
Thomas Wolf committed
156
  - [`run_openai_gpt.py`](./examples/run_openai_gpt.py) - Show how to fine-tune an instance of `OpenGPTDoubleHeadsModel` on the RocStories task.
thomwolf's avatar
thomwolf committed
157

Thomas Wolf's avatar
Thomas Wolf committed
158
159
- One example on how to use **Transformer-XL** (in the [`examples` folder](./examples)):
  - [`run_transfo_xl.py`](./examples/run_transfo_xl.py) - Show how to load and evaluate a pre-trained model of `TransfoXLLMHeadModel` on WikiText 103.
thomwolf's avatar
thomwolf committed
160

thomwolf's avatar
thomwolf committed
161
162
163
- One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the [`examples` folder](./examples)):
  - [`run_gpt2.py`](./examples/run_gpt2.py) - Show how to use OpenAI GPT-2 an instance of `GPT2LMHeadModel` to generate text (same as the original OpenAI GPT-2 examples).

thomwolf's avatar
thomwolf committed
164
  These examples are detailed in the [Examples](#examples) section of this readme.
thomwolf's avatar
thomwolf committed
165
166
167
168
169
170

- Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the [`notebooks` folder](./notebooks)):
  - [`Comparing-TF-and-PT-models.ipynb`](./notebooks/Comparing-TF-and-PT-models.ipynb) - Compare the hidden states predicted by `BertModel`,
  - [`Comparing-TF-and-PT-models-SQuAD.ipynb`](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb) - Compare the spans predicted by  `BertForQuestionAnswering` instances,
  - [`Comparing-TF-and-PT-models-MLM-NSP.ipynb`](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb) - Compare the predictions of the `BertForPretraining` instances.

thomwolf's avatar
thomwolf committed
171
  These notebooks are detailed in the [Notebooks](#notebooks) section of this readme.
thomwolf's avatar
thomwolf committed
172

thomwolf's avatar
thomwolf committed
173
- A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
thomwolf's avatar
thomwolf committed
174

thomwolf's avatar
thomwolf committed
175
  This CLI is detailed in the [Command-line interface](#Command-line-interface) section of this readme.
thomwolf's avatar
thomwolf committed
176
177

## Usage
thomwolf's avatar
thomwolf committed
178

thomwolf's avatar
thomwolf committed
179
180
### BERT

thomwolf's avatar
thomwolf committed
181
Here is a quick-start example using `BertTokenizer`, `BertModel` and `BertForMaskedLM` class with Google AI's pre-trained `Bert base uncased` model. See the [doc section](#doc) below for all the details on these classes.
thomwolf's avatar
thomwolf committed
182

thomwolf's avatar
thomwolf committed
183
First let's prepare a tokenized input with `BertTokenizer`
thomwolf's avatar
thomwolf committed
184
185
186

```python
import torch
thomwolf's avatar
thomwolf committed
187
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
thomwolf's avatar
thomwolf committed
188

thomwolf's avatar
thomwolf committed
189
190
191
192
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
193
# Load pre-trained model tokenizer (vocabulary)
thomwolf's avatar
thomwolf committed
194
195
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

thomwolf's avatar
thomwolf committed
196
# Tokenized input
thomwolf's avatar
thomwolf committed
197
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
thomwolf's avatar
thomwolf committed
198
tokenized_text = tokenizer.tokenize(text)
thomwolf's avatar
thomwolf committed
199
200

# Mask a token that we will try to predict back with `BertForMaskedLM`
Liang Niu's avatar
Liang Niu committed
201
masked_index = 8
thomwolf's avatar
thomwolf committed
202
tokenized_text[masked_index] = '[MASK]'
thomwolf's avatar
thomwolf committed
203
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
thomwolf's avatar
thomwolf committed
204
205
206

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
thomwolf's avatar
thomwolf committed
207
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
thomwolf's avatar
thomwolf committed
208
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
thomwolf's avatar
thomwolf committed
209

thomwolf's avatar
thomwolf committed
210
# Convert inputs to PyTorch tensors
thomwolf's avatar
thomwolf committed
211
212
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
thomwolf's avatar
thomwolf committed
213
214
215
216
217
218
219
```

Let's see how to use `BertModel` to get hidden states

```python
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
thomwolf's avatar
thomwolf committed
220
model.eval()
thomwolf's avatar
thomwolf committed
221

thomwolf's avatar
thomwolf committed
222
223
224
225
226
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
227
# Predict hidden states features for each layer
thomwolf's avatar
thomwolf committed
228
229
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)
thomwolf's avatar
thomwolf committed
230
231
232
233
234
235
236
237
238
239
240
# We have a hidden states for each of the 12 layers in model bert-base-uncased
assert len(encoded_layers) == 12
```

And how to use `BertForMaskedLM`

```python
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

thomwolf's avatar
thomwolf committed
241
242
243
244
245
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
246
# Predict all tokens
thomwolf's avatar
thomwolf committed
247
248
with torch.no_grad():
    predictions = model(tokens_tensor, segments_tensors)
thomwolf's avatar
thomwolf committed
249

thomwolf's avatar
thomwolf committed
250
# confirm we were able to predict 'henson'
thomwolf's avatar
thomwolf committed
251
predicted_index = torch.argmax(predictions[0, masked_index]).item()
252
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
253
254
255
assert predicted_token == 'henson'
```

thomwolf's avatar
thomwolf committed
256
257
258
259
260
261
262
263
### OpenAI GPT

Here is a quick-start example using `OpenAIGPTTokenizer`, `OpenAIGPTModel` and `OpenAIGPTLMHeadModel` class with OpenAI's pre-trained  model. See the [doc section](#doc) below for all the details on these classes.

First let's prepare a tokenized input with `OpenAIGPTTokenizer`

```python
import torch
thomwolf's avatar
thomwolf committed
264
from pytorch_transformers import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
thomwolf's avatar
thomwolf committed
265

thomwolf's avatar
thomwolf committed
266
267
268
269
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
```

Let's see how to use `OpenAIGPTModel` to get hidden states
thomwolf's avatar
thomwolf committed
285
286
287
288
289
290

```python
# Load pre-trained model (weights)
model = OpenAIGPTModel.from_pretrained('openai-gpt')
model.eval()

thomwolf's avatar
thomwolf committed
291
292
293
294
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
295
# Predict hidden states features for each layer
thomwolf's avatar
thomwolf committed
296
297
with torch.no_grad():
    hidden_states = model(tokens_tensor)
thomwolf's avatar
thomwolf committed
298
299
300
301
302
303
304
305
306
```

And how to use `OpenAIGPTLMHeadModel`

```python
# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()

thomwolf's avatar
thomwolf committed
307
308
309
310
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
311
# Predict all tokens
thomwolf's avatar
thomwolf committed
312
313
with torch.no_grad():
    predictions = model(tokens_tensor)
thomwolf's avatar
thomwolf committed
314
315

# get the predicted last token
thomwolf's avatar
thomwolf committed
316
predicted_index = torch.argmax(predictions[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
317
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
318
assert predicted_token == '.</w>'
thomwolf's avatar
thomwolf committed
319
320
```

thomwolf's avatar
thomwolf committed
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
And how to use `OpenAIGPTDoubleHeadsModel`

```python
# Load pre-trained model (weights)
model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
model.eval()

#  Prepare tokenized input
text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
tokenized_text1 = tokenizer.tokenize(text1)
tokenized_text2 = tokenizer.tokenize(text2)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])

# Predict hidden states features for each layer
with torch.no_grad():
    lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)
```

thomwolf's avatar
thomwolf committed
343
344
### Transformer-XL

thomwolf's avatar
thomwolf committed
345
Here is a quick-start example using `TransfoXLTokenizer`, `TransfoXLModel` and `TransfoXLModelLMHeadModel` class with the Transformer-XL model pre-trained on WikiText-103. See the [doc section](#doc) below for all the details on these classes.
thomwolf's avatar
thomwolf committed
346

thomwolf's avatar
thomwolf committed
347
First let's prepare a tokenized input with `TransfoXLTokenizer`
thomwolf's avatar
thomwolf committed
348
349
350

```python
import torch
thomwolf's avatar
thomwolf committed
351
from pytorch_transformers import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
thomwolf's avatar
thomwolf committed
352

thomwolf's avatar
thomwolf committed
353
354
355
356
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
357
358
# Load pre-trained model tokenizer (vocabulary from wikitext 103)
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
359
360

# Tokenized input
thomwolf's avatar
thomwolf committed
361
362
363
364
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
tokenized_text_1 = tokenizer.tokenize(text_1)
tokenized_text_2 = tokenizer.tokenize(text_2)
thomwolf's avatar
thomwolf committed
365
366

# Convert token to vocabulary indices
thomwolf's avatar
thomwolf committed
367
368
indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
thomwolf's avatar
thomwolf committed
369
370

# Convert inputs to PyTorch tensors
thomwolf's avatar
thomwolf committed
371
372
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])
thomwolf's avatar
thomwolf committed
373
374
```

thomwolf's avatar
thomwolf committed
375
Let's see how to use `TransfoXLModel` to get hidden states
thomwolf's avatar
thomwolf committed
376
377
378

```python
# Load pre-trained model (weights)
thomwolf's avatar
thomwolf committed
379
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
380
381
model.eval()

thomwolf's avatar
thomwolf committed
382
383
384
385
386
387
388
389
390
391
# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

with torch.no_grad():
    # Predict hidden states features for each layer
    hidden_states_1, mems_1 = model(tokens_tensor_1)
    # We can re-use the memory cells in a subsequent call to attend a longer context
    hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
thomwolf's avatar
thomwolf committed
392
393
```

thomwolf's avatar
thomwolf committed
394
And how to use `TransfoXLLMHeadModel`
thomwolf's avatar
thomwolf committed
395
396
397

```python
# Load pre-trained model (weights)
thomwolf's avatar
thomwolf committed
398
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
399
400
model.eval()

thomwolf's avatar
thomwolf committed
401
402
403
404
405
406
407
408
409
410
# If you have a GPU, put everything on cuda
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
model.to('cuda')

with torch.no_grad():
    # Predict all tokens
    predictions_1, mems_1 = model(tokens_tensor_1)
    # We can re-use the memory cells in a subsequent call to attend a longer context
    predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
thomwolf's avatar
thomwolf committed
411
412

# get the predicted last token
thomwolf's avatar
thomwolf committed
413
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
414
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
415
assert predicted_token == 'who'
thomwolf's avatar
thomwolf committed
416
417
```

thomwolf's avatar
thomwolf committed
418
419
420
421
422
423
424
425
### OpenAI GPT-2

Here is a quick-start example using `GPT2Tokenizer`, `GPT2Model` and `GPT2LMHeadModel` class with OpenAI's pre-trained  model. See the [doc section](#doc) below for all the details on these classes.

First let's prepare a tokenized input with `GPT2Tokenizer`

```python
import torch
thomwolf's avatar
thomwolf committed
426
from pytorch_transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
thomwolf's avatar
thomwolf committed
427
428
429
430
431
432
433
434

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

435
436
437
438
439
# Encode some inputs
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
indexed_tokens_1 = tokenizer.encode(text_1)
indexed_tokens_2 = tokenizer.encode(text_2)
thomwolf's avatar
thomwolf committed
440
441

# Convert inputs to PyTorch tensors
442
443
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])
thomwolf's avatar
thomwolf committed
444
445
446
447
448
449
450
451
452
453
```

Let's see how to use `GPT2Model` to get hidden states

```python
# Load pre-trained model (weights)
model = GPT2Model.from_pretrained('gpt2')
model.eval()

# If you have a GPU, put everything on cuda
454
455
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
thomwolf's avatar
thomwolf committed
456
457
458
459
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
460
461
    hidden_states_1, past = model(tokens_tensor_1)
    # past can be used to reuse precomputed hidden state in a subsequent predictions
462
    # (see beam-search examples in the run_gpt2.py example).
463
    hidden_states_2, past = model(tokens_tensor_2, past=past)
thomwolf's avatar
thomwolf committed
464
465
466
467
468
469
470
471
472
473
```

And how to use `GPT2LMHeadModel`

```python
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# If you have a GPU, put everything on cuda
474
475
tokens_tensor_1 = tokens_tensor_1.to('cuda')
tokens_tensor_2 = tokens_tensor_2.to('cuda')
thomwolf's avatar
thomwolf committed
476
477
478
479
model.to('cuda')

# Predict all tokens
with torch.no_grad():
480
481
    predictions_1, past = model(tokens_tensor_1)
    # past can be used to reuse precomputed hidden state in a subsequent predictions
482
    # (see beam-search examples in the run_gpt2.py example).
483
    predictions_2, past = model(tokens_tensor_2, past=past)
thomwolf's avatar
thomwolf committed
484
485

# get the predicted last token
486
predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
487
488
489
predicted_token = tokenizer.decode([predicted_index])
```

thomwolf's avatar
thomwolf committed
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
And how to use `GPT2DoubleHeadsModel`

```python
# Load pre-trained model (weights)
model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
model.eval()

#  Prepare tokenized input
text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
tokenized_text1 = tokenizer.tokenize(text1)
tokenized_text2 = tokenizer.tokenize(text2)
indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])

# Predict hidden states features for each layer
with torch.no_grad():
    lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)
```

thomwolf's avatar
thomwolf committed
512
## Doc
thomwolf's avatar
thomwolf committed
513

thomwolf's avatar
thomwolf committed
514
515
516
517
Here is a detailed documentation of the classes in the package and how to use them:

| Sub-section | Description |
|-|-|
thomwolf's avatar
thomwolf committed
518
519
520
521
| [Loading pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump) | How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
| [Serialization best-practices](#serialization-best-practices) | How to save and reload a fine-tuned model |
| [Configurations](#configurations) | API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL |
| [Models](#models) | API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL |
Thomas Wolf's avatar
Thomas Wolf committed
522
| [Tokenizers](#tokenizers) | API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL|
thomwolf's avatar
thomwolf committed
523
| [Optimizers](#optimizers) |  API of the optimizers |
thomwolf's avatar
thomwolf committed
524

Desiree Vogt-Lee's avatar
Desiree Vogt-Lee committed
525
### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
thomwolf's avatar
thomwolf committed
526

527
528
529
### `from_pretrained()` method

To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated using the `from_pretrained()` method:
thomwolf's avatar
thomwolf committed
530
531

```python
532
model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
thomwolf's avatar
thomwolf committed
533
534
535
536
```

where

thomwolf's avatar
thomwolf committed
537
- `BERT_CLASS` is either a tokenizer to load the vocabulary (`BertTokenizer` or `OpenAIGPTTokenizer` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification`, `BertForMultipleChoice`, `BertForQuestionAnswering`, `OpenAIGPTModel`, `OpenAIGPTLMHeadModel` or `OpenAIGPTDoubleHeadsModel`, and
Thomas Wolf's avatar
Thomas Wolf committed
538
- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
thomwolf's avatar
thomwolf committed
539

thomwolf's avatar
thomwolf committed
540
  - the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
thomwolf's avatar
thomwolf committed
541

thomwolf's avatar
thomwolf committed
542
543
544
    - `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
    - `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
    - `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
thomwolf's avatar
thomwolf committed
545
546
    - `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
    - `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
547
    - `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
548
    - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
549
    - `bert-base-german-cased`: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters [Performance Evaluation](https://deepset.ai/german-bert)
550
551
    - `bert-large-uncased-whole-word-masking`: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    - `bert-large-cased-whole-word-masking`: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
552
    - `bert-large-uncased-whole-word-masking-finetuned-squad`: The `bert-large-uncased-whole-word-masking` model finetuned on SQuAD (using the `run_bert_squad.py` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
553
    - `openai-gpt`: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
thomwolf's avatar
thomwolf committed
554
    - `gpt2`: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
555
556
    - `gpt2-medium`: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
    - `transfo-xl-wt103`: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
thomwolf's avatar
thomwolf committed
557

thomwolf's avatar
thomwolf committed
558
  - a path or url to a pretrained model archive containing:
thomwolf's avatar
thomwolf committed
559

thomwolf's avatar
thomwolf committed
560
    - `bert_config.json` or `openai_gpt_config.json` a configuration file for the model, and
thomwolf's avatar
thomwolf committed
561
    - `pytorch_model.bin` a PyTorch dump of a pre-trained instance of `BertForPreTraining`, `OpenAIGPTModel`, `TransfoXLModel`, `GPT2LMHeadModel` (saved with the usual `torch.save()`)
thomwolf's avatar
thomwolf committed
562

thomwolf's avatar
thomwolf committed
563
  If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_transformers/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_transformers/`).
564

Thomas Wolf's avatar
Thomas Wolf committed
565
- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
566
567
568
569
- `from_tf`: should we load the weights from a locally saved TensorFlow checkpoint
- `state_dict`: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
- `*inputs`, `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)

570
571
`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.

Thomas Wolf's avatar
Thomas Wolf committed
572
**When using an `uncased model`, make sure to pass `--do_lower_case` to the example training scripts (or pass `do_lower_case=True` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).**
573

thomwolf's avatar
thomwolf committed
574
Examples:
thomwolf's avatar
thomwolf committed
575

thomwolf's avatar
thomwolf committed
576
```python
thomwolf's avatar
thomwolf committed
577
# BERT
578
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
thomwolf's avatar
thomwolf committed
579
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
thomwolf's avatar
thomwolf committed
580
581
582
583

# OpenAI GPT
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTModel.from_pretrained('openai-gpt')
thomwolf's avatar
thomwolf committed
584
585
586
587

# Transformer-XL
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
thomwolf's avatar
thomwolf committed
588
589
590
591
592

# OpenAI GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

thomwolf's avatar
thomwolf committed
593
594
```

595
596
#### Cache directory

thomwolf's avatar
thomwolf committed
597
`pytorch_transformers` save the pretrained weights in a cache directory which is located at (in this order of priority):
598
599
600

- `cache_dir` optional arguments to the `from_pretrained()` method (see above),
- shell environment variable `PYTORCH_PRETRAINED_BERT_CACHE`,
thomwolf's avatar
thomwolf committed
601
- PyTorch cache home + `/pytorch_transformers/`
602
603
604
605
606
  where PyTorch cache home is defined by (in this order):
  - shell environment variable `ENV_TORCH_HOME`
  - shell environment variable `ENV_XDG_CACHE_HOME` + `/torch/`)
  - default: `~/.cache/torch/`

thomwolf's avatar
thomwolf committed
607
Usually, if you don't set any specific environment variable, `pytorch_transformers` cache will be at `~/.cache/torch/pytorch_transformers/`.
608

thomwolf's avatar
thomwolf committed
609
You can alsways safely delete `pytorch_transformers` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
610

thomwolf's avatar
thomwolf committed
611
### Serialization best-practices
612

thomwolf's avatar
thomwolf committed
613
This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
614
615
616
617
618
619
There are three types of files you need to save to be able to reload a fine-tuned model:

- the model it-self which should be saved following PyTorch serialization [best practices](https://pytorch.org/docs/stable/notes/serialization.html#best-practices),
- the configuration file of the model which is saved as a JSON file, and
- the vocabulary (and the merges for the BPE-based models GPT and GPT-2).

620
The *default filenames* of these files are as follow:
621
622
623
624
625
626

- the model weights file: `pytorch_model.bin`,
- the configuration file: `config.json`,
- the vocabulary file: `vocab.txt` for BERT and Transformer-XL, `vocab.json` for GPT/GPT-2 (BPE vocabulary),
- for GPT/GPT-2 (BPE vocabulary) the additional merges file: `merges.txt`.

627
628
**If you save a model using these *default filenames*, you can then re-load the model and tokenizer using the `from_pretrained()` method.**

629
630
631
Here is the recommended way of saving the model, configuration and vocabulary to an `output_dir` directory and reloading the model and tokenizer afterwards:

```python
thomwolf's avatar
thomwolf committed
632
from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696

output_dir = "./models/"

# Step 1: Save a model, configuration and vocabulary that you have fine-tuned

# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model_to_save = model.module if hasattr(model, 'module') else model

# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)

torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_dir)

# Step 2: Re-load the saved model and vocabulary

# Example for a Bert model
model = BertForQuestionAnswering.from_pretrained(output_dir)
tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
# Example for a GPT model
model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
```

Here is another way you can save and reload the model if you want to use specific paths for each type of files:

```python
output_model_file = "./models/my_own_model_file.bin"
output_config_file = "./models/my_own_config_file.bin"
output_vocab_file = "./models/my_own_vocab_file.bin"

# Step 1: Save a model, configuration and vocabulary that you have fine-tuned

# If we have a distributed model, save only the encapsulated model
# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
model_to_save = model.module if hasattr(model, 'module') else model

torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(output_vocab_file)

# Step 2: Re-load the saved model and vocabulary

# We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
# Here is how to do it in this situation:

# Example for a Bert model
config = BertConfig.from_json_file(output_config_file)
model = BertForQuestionAnswering(config)
state_dict = torch.load(output_model_file)
model.load_state_dict(state_dict)
tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)

# Example for a GPT model
config = OpenAIGPTConfig.from_json_file(output_config_file)
model = OpenAIGPTDoubleHeadsModel(config)
state_dict = torch.load(output_model_file)
model.load_state_dict(state_dict)
tokenizer = OpenAIGPTTokenizer(output_vocab_file)
```

thomwolf's avatar
thomwolf committed
697
### Configurations
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713

Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which containes the parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON configuration files. The respective configuration classes are:

- `BertConfig` for `BertModel` and BERT classes instances.
- `OpenAIGPTConfig` for `OpenAIGPTModel` and OpenAI GPT classes instances.
- `GPT2Config` for `GPT2Model` and OpenAI GPT-2 classes instances.
- `TransfoXLConfig` for `TransfoXLModel` and Transformer-XL classes instances.

These configuration classes contains a few utilities to load and save configurations:

- `from_dict(cls, json_object)`: A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
- `from_json_file(cls, json_file)`: A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
- `to_dict()`: Serializes an instance to a Python dictionary. Returns a dictionary.
- `to_json_string()`: Serializes an instance to a JSON string. Returns a string.
- `to_json_file(json_file_path)`: Save an instance to a json file.

thomwolf's avatar
thomwolf committed
714
### Models
thomwolf's avatar
thomwolf committed
715

thomwolf's avatar
thomwolf committed
716
#### 1. `BertModel`
thomwolf's avatar
thomwolf committed
717

thomwolf's avatar
thomwolf committed
718
719
`BertModel` is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).

720
721
722
723
724
725
726
Instantiation:
The model can be instantiated with the following arguments:

- `config`: a `BertConfig` class instance with the configuration to build a new model.
- `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
- `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False

thomwolf's avatar
thomwolf committed
727
728
The inputs and output are **identical to the TensorFlow model inputs and outputs**.

thomwolf's avatar
thomwolf committed
729
We detail them here. This model takes as *inputs*:
thomwolf's avatar
thomwolf committed
730
[`modeling.py`](./pytorch_transformers/modeling.py)
thomwolf's avatar
thomwolf committed
731

732
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts [`run_bert_extract_features.py`](./examples/run_bert_extract_features.py), [`run_bert_classifier.py`](./examples/run_bert_classifier.py) and [`run_bert_squad.py`](./examples/run_bert_squad.py)), and
Clement's avatar
typos  
Clement committed
733
- `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
Thomas Wolf's avatar
Thomas Wolf committed
734
- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if some input sequence lengths are smaller than the max input sequence length of the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
thomwolf's avatar
thomwolf committed
735
- `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
thomwolf's avatar
thomwolf committed
736
- `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.
thomwolf's avatar
thomwolf committed
737

thomwolf's avatar
thomwolf committed
738
This model *outputs* a tuple composed of:
thomwolf's avatar
thomwolf committed
739

thomwolf's avatar
thomwolf committed
740
741
- `encoded_layers`: controled by the value of the `output_encoded_layers` argument:

Thomas Wolf's avatar
Thomas Wolf committed
742
743
  - `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
  - `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block, i.e. a single torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
thomwolf's avatar
thomwolf committed
744

thomwolf's avatar
thomwolf committed
745
- `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
thomwolf's avatar
thomwolf committed
746

747
An example on how to use this class is given in the [`run_bert_extract_features.py`](./examples/run_bert_extract_features.py) script which can be used to extract the hidden states of the model for a given input.
thomwolf's avatar
thomwolf committed
748

thomwolf's avatar
thomwolf committed
749
#### 2. `BertForPreTraining`
thomwolf's avatar
thomwolf committed
750
751
752
753
754
755

`BertForPreTraining` includes the `BertModel` Transformer followed by the two pre-training heads:

- the masked language modeling head, and
- the next sentence classification head.

thomwolf's avatar
thomwolf committed
756
*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus two optional labels:
thomwolf's avatar
thomwolf committed
757
758
759
760
761
762
763
764

- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]
- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

*Outputs*:

- if `masked_lm_labels` and `next_sentence_label` are not `None`: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss.
- if `masked_lm_labels` or `next_sentence_label` is `None`: Outputs a tuple comprising
Thomas Wolf's avatar
Thomas Wolf committed
765

thomwolf's avatar
thomwolf committed
766
767
  - the masked language modeling logits, and
  - the next sentence classification logits.
Joel Grus's avatar
Joel Grus committed
768

tholor's avatar
tholor committed
769
770
An example on how to use this class is given in the [`run_lm_finetuning.py`](./examples/run_lm_finetuning.py) script which can be used to fine-tune the BERT language model on your specific different text corpus. This should improve model performance, if the language style is different from the original BERT training corpus (Wiki + BookCorpus).

thomwolf's avatar
thomwolf committed
771
#### 3. `BertForMaskedLM`
thomwolf's avatar
thomwolf committed
772
773
774

`BertForMaskedLM` includes the `BertModel` Transformer followed by the (possibly) pre-trained  masked language modeling head.

thomwolf's avatar
thomwolf committed
775
*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus optional label:
thomwolf's avatar
thomwolf committed
776
777
778
779
780
781
782
783

- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]

*Outputs*:

- if `masked_lm_labels` is not `None`: Outputs the masked language modeling loss.
- if `masked_lm_labels` is `None`: Outputs the masked language modeling logits.

thomwolf's avatar
thomwolf committed
784
#### 4. `BertForNextSentencePrediction`
thomwolf's avatar
thomwolf committed
785
786
787

`BertForNextSentencePrediction` includes the `BertModel` Transformer followed by the next sentence classification head.

thomwolf's avatar
thomwolf committed
788
*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus an optional label:
thomwolf's avatar
thomwolf committed
789
790
791
792
793
794
795
796

- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.

*Outputs*:

- if `next_sentence_label` is not `None`: Outputs the next sentence classification loss.
- if `next_sentence_label` is `None`: Outputs the next sentence classification logits.

thomwolf's avatar
thomwolf committed
797
#### 5. `BertForSequenceClassification`
thomwolf's avatar
thomwolf committed
798

Thomas Wolf's avatar
typos  
Thomas Wolf committed
799
`BertForSequenceClassification` is a fine-tuning model that includes `BertModel` and a sequence-level (sequence or pair of sequences) classifier on top of the `BertModel`.
thomwolf's avatar
thomwolf committed
800

Thomas Wolf's avatar
Thomas Wolf committed
801
The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).
thomwolf's avatar
thomwolf committed
802

803
An example on how to use this class is given in the [`run_bert_classifier.py`](./examples/run_bert_classifier.py) script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.
thomwolf's avatar
thomwolf committed
804

805
806
807
808
#### 6. `BertForMultipleChoice`

`BertForMultipleChoice` is a fine-tuning model that includes `BertModel` and a linear layer on top of the `BertModel`.

Gr茅gory Ch芒tel's avatar
Gr茅gory Ch芒tel committed
809
The linear layer outputs a single value for each choice of a multiple choice problem, then all the outputs corresponding to an instance are passed through a softmax to get the model choice.
810
811
812
813
814
815

This implementation is largely inspired by the work of OpenAI in [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) and the answer of Jacob Devlin in the following [issue](https://github.com/google-research/bert/issues/38).

An example on how to use this class is given in the [`run_swag.py`](./examples/run_swag.py) script which can be used to fine-tune a multiple choice classifier using BERT, for example for the Swag task.

#### 7. `BertForTokenClassification`
816
817
818
819
820

`BertForTokenClassification` is a fine-tuning model that includes `BertModel` and a token-level classifier on top of the `BertModel`.

The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.

821
#### 8. `BertForQuestionAnswering`
thomwolf's avatar
thomwolf committed
822

Knut Ole Sj酶li's avatar
Knut Ole Sj酶li committed
823
`BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.
thomwolf's avatar
thomwolf committed
824

Thomas Wolf's avatar
Thomas Wolf committed
825
The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a `start_span` and a `end_span` token (see Figures 3c and 3d in the BERT paper).
thomwolf's avatar
thomwolf committed
826

827
An example on how to use this class is given in the [`run_bert_squad.py`](./examples/run_bert_squad.py) script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
thomwolf's avatar
thomwolf committed
828

thomwolf's avatar
thomwolf committed
829
830
831
832
#### 9. `OpenAIGPTModel`

`OpenAIGPTModel` is the basic OpenAI GPT Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.

833
834
835
836
837
838
OpenAI GPT use a single embedding matrix to store the word and special embeddings.
Special tokens embeddings are additional tokens that are not pre-trained: `[SEP]`, `[CLS]`...
Special tokens need to be trained during the fine-tuning if you use them.
The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.

The embeddings are ordered as follow in the token embeddings matrice:
thomwolf's avatar
thomwolf committed
839

840
```python
thomwolf's avatar
thomwolf committed
841
842
843
844
845
    [0,                                                         ----------------------
      ...                                                        -> word embeddings
      config.vocab_size - 1,                                     ______________________
      config.vocab_size,
      ...                                                        -> special embeddings
846
847
      config.vocab_size + config.n_special - 1]                  ______________________
```
thomwolf's avatar
thomwolf committed
848

849
850
where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
    `total_tokens_embeddings = config.vocab_size + config.n_special`
thomwolf's avatar
thomwolf committed
851
852
You should use the associate indices to index the embeddings.

853
854
855
856
857
858
859
Instantiation:
The model can be instantiated with the following arguments:

- `config`: a `OpenAIConfig` class instance with the configuration to build a new model.
- `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
- `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False

thomwolf's avatar
thomwolf committed
860
861
862
The inputs and output are **identical to the TensorFlow model inputs and outputs**.

We detail them here. This model takes as *inputs*:
thomwolf's avatar
thomwolf committed
863
[`modeling_openai.py`](./pytorch_transformers/modeling_openai.py)
thomwolf's avatar
thomwolf committed
864

865
866
867
868
869
870
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
    with the position indices (selected in the range [0, config.n_positions - 1[.
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
    You can use it to add a third type of embedding to each input token in the sequence
    (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
thomwolf's avatar
thomwolf committed
871
- `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.
thomwolf's avatar
thomwolf committed
872
873

This model *outputs*:
thomwolf's avatar
thomwolf committed
874

875
- `hidden_states`: a list of all the encoded-hidden-states in the model (length of the list: number of layers + 1 for the output of the embeddings) as torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
thomwolf's avatar
thomwolf committed
876
877
878
879
880
881

#### 10. `OpenAIGPTLMHeadModel`

`OpenAIGPTLMHeadModel` includes the `OpenAIGPTModel` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).

*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus optional labels:
thomwolf's avatar
thomwolf committed
882

thomwolf's avatar
thomwolf committed
883
884
885
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].

*Outputs*:
thomwolf's avatar
thomwolf committed
886

thomwolf's avatar
thomwolf committed
887
888
889
- if `lm_labels` is not `None`:
  Outputs the language modeling loss.
- else:
890
  Outputs `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
thomwolf's avatar
thomwolf committed
891
892
893
894

#### 11. `OpenAIGPTDoubleHeadsModel`

`OpenAIGPTDoubleHeadsModel` includes the `OpenAIGPTModel` Transformer followed by two heads:
thomwolf's avatar
thomwolf committed
895

thomwolf's avatar
thomwolf committed
896
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
897
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).
thomwolf's avatar
thomwolf committed
898
899

*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus a classification mask and two optional labels:
thomwolf's avatar
thomwolf committed
900

901
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
thomwolf's avatar
thomwolf committed
902
903
904
905
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].

*Outputs*:
thomwolf's avatar
thomwolf committed
906

thomwolf's avatar
thomwolf committed
907
908
909
- if `lm_labels` and `multiple_choice_labels` are not `None`:
  Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
- else Outputs a tuple with:
910
  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
thomwolf's avatar
thomwolf committed
911
912
  - `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]

thomwolf's avatar
thomwolf committed
913
914
915
916
917
918
919
920
921
922
#### 12. `TransfoXLModel`

The Transformer-XL model is described in "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".

Transformer XL use a relative positioning with sinusiodal patterns and adaptive softmax inputs which means that:

- you don't need to specify positioning embeddings indices
- the tokens in the vocabulary have to be sorted to decreasing frequency.

This model takes as *inputs*:
thomwolf's avatar
thomwolf committed
923
[`modeling_transfo_xl.py`](./pytorch_transformers/modeling_transfo_xl.py)
thomwolf's avatar
thomwolf committed
924

925
926
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the token indices selected in the range [0, self.config.n_token[
- `mems`: an optional memory of hidden states from previous forward passes as a list (num layers) of hidden states at the entry of each layer. Each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
thomwolf's avatar
thomwolf committed
927
928

This model *outputs* a tuple of (last_hidden_state, new_mems)
thomwolf's avatar
thomwolf committed
929

930
931
- `last_hidden_state`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
thomwolf's avatar
thomwolf committed
932

thomwolf's avatar
thomwolf committed
933
934
##### Extracting a list of the hidden states at each layer of the Transformer-XL from `last_hidden_state` and `new_mems`

Thomas Wolf's avatar
Thomas Wolf committed
935
936
937
938
939
940
941
942
943
944
945
The `new_mems` contain all the hidden states PLUS the output of the embeddings (`new_mems[0]`). `new_mems[-1]` is the output of the hidden state of the layer below the last layer and `last_hidden_state` is the output of the last layer (i.E. the input of the softmax when we have a language modeling head on top).

There are two differences between the shapes of `new_mems` and `last_hidden_state`: `new_mems` have transposed first dimensions and are longer (of size `self.config.mem_len`). Here is how to extract the full list of hidden states from the model output:

```python
hidden_states, mems = model(tokens_tensor)
seq_length = hidden_states.size(1)
lower_hidden_states = list(t[-seq_length:, ...].transpose(0, 1) for t in mems)
all_hidden_states = lower_hidden_states + [hidden_states]
```

thomwolf's avatar
thomwolf committed
946
947
948
949
950
#### 13. `TransfoXLLMHeadModel`

`TransfoXLLMHeadModel` includes the `TransfoXLModel` Transformer followed by an (adaptive) softmax head with weights tied to the input embeddings.

*Inputs* are the same as the inputs of the [`TransfoXLModel`](#-12.-`TransfoXLModel`) class plus optional labels:
thomwolf's avatar
thomwolf committed
951

thomwolf's avatar
thomwolf committed
952
- `labels`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the labels token indices selected in the range [0, self.config.n_token[
thomwolf's avatar
thomwolf committed
953
954

*Outputs* a tuple of (last_hidden_state, new_mems)
thomwolf's avatar
thomwolf committed
955

thomwolf's avatar
thomwolf committed
956
- `softmax_output`: output of the (adaptive) softmax:
thomwolf's avatar
thomwolf committed
957
  - if labels is None: log probabilities of tokens, shape [batch_size, sequence_length, n_tokens]
thomwolf's avatar
thomwolf committed
958
  - else: Negative log likelihood of labels tokens with shape [batch_size, sequence_length]
959
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
thomwolf's avatar
thomwolf committed
960

thomwolf's avatar
thomwolf committed
961
962
963
964
#### 14. `GPT2Model`

`GPT2Model` is the OpenAI GPT-2 Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.

965
966
967
968
969
970
971
Instantiation:
The model can be instantiated with the following arguments:

- `config`: a `GPT2Config` class instance with the configuration to build a new model.
- `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
- `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False

thomwolf's avatar
thomwolf committed
972
973
974
The inputs and output are **identical to the TensorFlow model inputs and outputs**.

We detail them here. This model takes as *inputs*:
thomwolf's avatar
thomwolf committed
975
[`modeling_gpt2.py`](./pytorch_transformers/modeling_gpt2.py)
thomwolf's avatar
thomwolf committed
976

thomwolf's avatar
thomwolf committed
977
978
979
980
981
982
983
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, vocab_size[
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
    with the position indices (selected in the range [0, config.n_positions - 1[.
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
    You can use it to add a third type of embedding to each input token in the sequence
    (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
- `past`: an optional list of torch.LongTensor that contains pre-computed hidden-states (key and values in the attention blocks) to speed up sequential decoding (this is the `presents` output of the model, cf. below).
thomwolf's avatar
thomwolf committed
984
- `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.
thomwolf's avatar
thomwolf committed
985
986

This model *outputs*:
thomwolf's avatar
thomwolf committed
987

988
- `hidden_states`: a list of all the encoded-hidden-states in the model (length of the list: number of layers + 1 for the output of the embeddings) as torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
thomwolf's avatar
thomwolf committed
989
990
991
992
993
994
995
- `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).

#### 15. `GPT2LMHeadModel`

`GPT2LMHeadModel` includes the `GPT2Model` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).

*Inputs* are the same as the inputs of the [`GPT2Model`](#-14.-`GPT2Model`) class plus optional labels:
thomwolf's avatar
thomwolf committed
996

thomwolf's avatar
thomwolf committed
997
998
999
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].

*Outputs*:
thomwolf's avatar
thomwolf committed
1000

thomwolf's avatar
thomwolf committed
1001
1002
- if `lm_labels` is not `None`:
  Outputs the language modeling loss.
Joel Grus's avatar
Joel Grus committed
1003
- else: a tuple of
thomwolf's avatar
thomwolf committed
1004
1005
1006
1007
1008
1009
  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
  - `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).

#### 16. `GPT2DoubleHeadsModel`

`GPT2DoubleHeadsModel` includes the `GPT2Model` Transformer followed by two heads:
thomwolf's avatar
thomwolf committed
1010

thomwolf's avatar
thomwolf committed
1011
1012
1013
1014
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).

*Inputs* are the same as the inputs of the [`GPT2Model`](#-14.-`GPT2Model`) class plus a classification mask and two optional labels:
thomwolf's avatar
thomwolf committed
1015

thomwolf's avatar
thomwolf committed
1016
1017
1018
1019
1020
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].

*Outputs*:
thomwolf's avatar
thomwolf committed
1021

thomwolf's avatar
thomwolf committed
1022
1023
1024
1025
1026
1027
1028
- if `lm_labels` and `multiple_choice_labels` are not `None`:
  Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
- else Outputs a tuple with:
  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
  - `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
  - `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).

thomwolf's avatar
thomwolf committed
1029
### Tokenizers
thomwolf's avatar
thomwolf committed
1030
1031

#### `BertTokenizer`
thomwolf's avatar
thomwolf committed
1032

thomwolf's avatar
thomwolf committed
1033
`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
thomwolf's avatar
thomwolf committed
1034

1035
This class has five arguments:
thomwolf's avatar
thomwolf committed
1036

thomwolf's avatar
thomwolf committed
1037
1038
- `vocab_file`: path to a vocabulary file.
- `do_lower_case`: convert text to lower-case while tokenizing. **Default = True**.
thomwolf's avatar
thomwolf committed
1039
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
1040
- `do_basic_tokenize`: Do basic tokenization before wordpice tokenization. Set to false if text is pre-tokenized. **Default = True**.
thomwolf's avatar
thomwolf committed
1041
- `never_split`: a list of tokens that should not be splitted during tokenization. **Default = `["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]`**
thomwolf's avatar
thomwolf committed
1042

thomwolf's avatar
thomwolf committed
1043
and three methods:
Thomas Wolf's avatar
typos  
Thomas Wolf committed
1044

thomwolf's avatar
thomwolf committed
1045
1046
1047
- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
thomwolf's avatar
thomwolf committed
1048
- `save_vocabulary(directory_path)`: save the vocabulary file to `directory_path`. Return the path to the saved vocabulary file: `vocab_file_path`. The vocabulary can be reloaded with `BertTokenizer.from_pretrained('vocab_file_path')` or `BertTokenizer.from_pretrained('directory_path')`.
thomwolf's avatar
thomwolf committed
1049

thomwolf's avatar
thomwolf committed
1050
Please refer to the doc strings and code in [`tokenization.py`](./pytorch_transformers/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
thomwolf's avatar
thomwolf committed
1051

thomwolf's avatar
thomwolf committed
1052
1053
1054
1055
#### `OpenAIGPTTokenizer`

`OpenAIGPTTokenizer` perform Byte-Pair-Encoding (BPE) tokenization.

thomwolf's avatar
thomwolf committed
1056
This class has four arguments:
thomwolf's avatar
thomwolf committed
1057
1058
1059

- `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges.
thomwolf's avatar
thomwolf committed
1060
1061
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
- `special_tokens`: a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's `BasicTokenizer` is used as the pre-BPE tokenizer, these tokens are not split. **Default= None**
thomwolf's avatar
thomwolf committed
1062

thomwolf's avatar
thomwolf committed
1063
and five methods:
thomwolf's avatar
thomwolf committed
1064

1065
- `tokenize(text)`: convert a `str` in a list of `str` tokens by performing BPE tokenization.
thomwolf's avatar
thomwolf committed
1066
1067
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
thomwolf's avatar
thomwolf committed
1068
- `set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
1069
- `encode(text)`: convert a `str` in a list of `int` tokens by performing BPE encoding.
thomwolf's avatar
thomwolf committed
1070
- `decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
thomwolf's avatar
thomwolf committed
1071
- `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
thomwolf's avatar
thomwolf committed
1072

thomwolf's avatar
thomwolf committed
1073
Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_transformers/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.
thomwolf's avatar
thomwolf committed
1074

thomwolf's avatar
thomwolf committed
1075
1076
#### `TransfoXLTokenizer`

1077
`TransfoXLTokenizer` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper ([Efficient softmax approximation for GPUs](http://arxiv.org/abs/1609.04309)) for more details.
thomwolf's avatar
thomwolf committed
1078

thomwolf's avatar
thomwolf committed
1079
1080
The API is similar to the API of `BertTokenizer` (see above).

thomwolf's avatar
thomwolf committed
1081
Please refer to the doc strings and code in [`tokenization_transfo_xl.py`](./pytorch_transformers/tokenization_transfo_xl.py) for the details of these additional methods in `TransfoXLTokenizer`.
thomwolf's avatar
thomwolf committed
1082

thomwolf's avatar
thomwolf committed
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
#### `GPT2Tokenizer`

`GPT2Tokenizer` perform byte-level Byte-Pair-Encoding (BPE) tokenization.

This class has three arguments:

- `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges.
- `errors`: How to handle unicode decoding errors. **Default = `replace`**

and two methods:

1095
1096
1097
1098
- `tokenize(text)`: convert a `str` in a list of `str` tokens by performing byte-level BPE.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
- `set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
thomwolf's avatar
thomwolf committed
1099
1100
- `encode(text)`: convert a `str` in a list of `int` tokens by performing byte-level BPE.
- `decode(tokens)`: convert back a list of `int` tokens in a `str`.
thomwolf's avatar
thomwolf committed
1101
- `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
thomwolf's avatar
thomwolf committed
1102

thomwolf's avatar
thomwolf committed
1103
Please refer to [`tokenization_gpt2.py`](./pytorch_transformers/tokenization_gpt2.py) for more details on the `GPT2Tokenizer`.
thomwolf's avatar
thomwolf committed
1104

thomwolf's avatar
thomwolf committed
1105
### Optimizers
thomwolf's avatar
thomwolf committed
1106
1107

#### `BertAdam`
thomwolf's avatar
thomwolf committed
1108

thomwolf's avatar
thomwolf committed
1109
`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
thomwolf's avatar
thomwolf committed
1110

thomwolf's avatar
thomwolf committed
1111
1112
- BertAdam implements weight decay fix,
- BertAdam doesn't compensate for bias as in the regular Adam optimizer.
thomwolf's avatar
thomwolf committed
1113
1114
1115
1116

The optimizer accepts the following arguments:

- `lr` : learning rate
Thomas Wolf's avatar
Thomas Wolf committed
1117
- `warmup` : portion of `t_total` for the warmup, `-1`  means no warmup. Default : `-1`
thomwolf's avatar
thomwolf committed
1118
- `t_total` : total number of training steps for the learning
Thomas Wolf's avatar
Thomas Wolf committed
1119
    rate schedule, `-1`  means constant learning rate. Default : `-1`
lukovnikov's avatar
lukovnikov committed
1120
1121
1122
1123
- `schedule` : schedule to use for the warmup (see above).
    Can be `'warmup_linear'`, `'warmup_constant'`, `'warmup_cosine'`, `'none'`, `None` or a `_LRSchedule` object (see below).
    If `None` or `'none'`, learning rate is always kept constant.
    Default : `'warmup_linear'`
Thomas Wolf's avatar
Thomas Wolf committed
1124
1125
1126
- `b1` : Adams b1. Default : `0.9`
- `b2` : Adams b2. Default : `0.999`
- `e` : Adams epsilon. Default : `1e-6`
1127
- `weight_decay:` Weight decay. Default : `0.01`
Thomas Wolf's avatar
Thomas Wolf committed
1128
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
thomwolf's avatar
thomwolf committed
1129

1130
#### `OpenAIAdam`
thomwolf's avatar
thomwolf committed
1131

1132
1133
`OpenAIAdam` is similar to `BertAdam`.
The differences with `BertAdam` is that `OpenAIAdam` compensate for bias as in the regular Adam optimizer.
thomwolf's avatar
thomwolf committed
1134

1135
`OpenAIAdam` accepts the same arguments as `BertAdam`.
thomwolf's avatar
thomwolf committed
1136

lukovnikov's avatar
lukovnikov committed
1137
#### Learning Rate Schedules
thomwolf's avatar
thomwolf committed
1138

lukovnikov's avatar
lukovnikov committed
1139
1140
The `.optimization` module also provides additional schedules in the form of schedule objects that inherit from `_LRSchedule`.
All `_LRSchedule` subclasses accept `warmup` and `t_total` arguments at construction.
thomwolf's avatar
thomwolf committed
1141
1142
When an `_LRSchedule` object is passed into `BertAdam` or `OpenAIAdam`,
the `warmup` and `t_total` arguments on the optimizer are ignored and the ones in the `_LRSchedule` object are used.
lukovnikov's avatar
lukovnikov committed
1143
An overview of the implemented schedules:
thomwolf's avatar
thomwolf committed
1144

lukovnikov's avatar
lukovnikov committed
1145
1146
1147
- `ConstantLR`: always returns learning rate 1.
- `WarmupConstantSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
    Keeps learning rate equal to 1. after warmup.
LysandreJik's avatar
LysandreJik committed
1148
    ![](docs/source/imgs/warmup_constant_schedule.png)
lukovnikov's avatar
lukovnikov committed
1149
1150
- `WarmupLinearSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
    Linearly decreases learning rate from 1. to 0. over remaining `1 - warmup` steps.
LysandreJik's avatar
LysandreJik committed
1151
    ![](docs/source/imgs/warmup_linear_schedule.png)
lukovnikov's avatar
lukovnikov committed
1152
1153
1154
-  `WarmupCosineSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
    Decreases learning rate from 1. to 0. over remaining `1 - warmup` steps following a cosine curve.
    If `cycles` (default=0.5) is different from default, learning rate follows cosine function after warmup.
LysandreJik's avatar
LysandreJik committed
1155
    ![](docs/source/imgs/warmup_cosine_schedule.png)
lukovnikov's avatar
lukovnikov committed
1156
1157
- `WarmupCosineWithHardRestartsSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
    If `cycles` (default=1.) is different from default, learning rate follows `cycles` times a cosine decaying learning rate (with hard restarts).
LysandreJik's avatar
LysandreJik committed
1158
    ![](docs/source/imgs/warmup_cosine_hard_restarts_schedule.png)
lukovnikov's avatar
lukovnikov committed
1159
1160
1161
1162
- `WarmupCosineWithWarmupRestartsSchedule`: All training progress is divided in `cycles` (default=1.) parts of equal length.
    Every part follows a schedule with the first `warmup` fraction of the training steps linearly increasing from 0. to 1.,
    followed by a learning rate decreasing from 1. to 0. following a cosine curve.
    Note that the total number of all warmup steps over all cycles together is equal to `warmup` * `cycles`
1163
    ![warmup cosine warm restarts schedule](docs/source/imgs/warmup_cosine_warm_restarts_schedule.png)
lukovnikov's avatar
lukovnikov committed
1164

thomwolf's avatar
thomwolf committed
1165
## Examples
thomwolf's avatar
thomwolf committed
1166

thomwolf's avatar
thomwolf committed
1167
1168
1169
| Sub-section | Description |
|-|-|
| [Training large models: introduction, tools and examples](#Training-large-models-introduction,-tools-and-examples) | How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models |
1170
| [Fine-tuning with BERT: running the examples](#Fine-tuning-with-BERT-running-the-examples) | Running the examples in [`./examples`](./examples/): `extract_classif.py`, `run_bert_classifier.py`, `run_bert_squad.py` and `run_lm_finetuning.py` |
Colanim's avatar
Colanim committed
1171
| [Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2](#openai-gpt-transformer-xl-and-gpt-2-running-the-examples) | Running the examples in [`./examples`](./examples/): `run_openai_gpt.py`, `run_transfo_xl.py` and `run_gpt2.py` |
thomwolf's avatar
thomwolf committed
1172
1173
| [Fine-tuning BERT-large on GPUs](#Fine-tuning-BERT-large-on-GPUs) | How to fine tune `BERT large`|

thomwolf's avatar
thomwolf committed
1174
### Training large models: introduction, tools and examples
thomwolf's avatar
thomwolf committed
1175

Thomas Wolf's avatar
Thomas Wolf committed
1176
BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
thomwolf's avatar
thomwolf committed
1177

1178
To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts [`run_bert_classifier.py`](./examples/run_bert_classifier.py) and [`run_bert_squad.py`](./examples/run_bert_squad.py): gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read [the tips on training large batches in PyTorch](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) that I published earlier this month.
thomwolf's avatar
thomwolf committed
1179

thomwolf's avatar
thomwolf committed
1180
Here is how to use these techniques in our scripts:
thomwolf's avatar
thomwolf committed
1181

thomwolf's avatar
thomwolf committed
1182
1183
- **Gradient Accumulation**: Gradient accumulation can be used by supplying a integer greater than 1 to the `--gradient_accumulation_steps` argument. The batch at each step will be divided by this integer and gradient will be accumulated over `gradient_accumulation_steps` steps.
- **Multi-GPU**: Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
thomwolf's avatar
thomwolf committed
1184
- **Distributed training**: Distributed training can be activated by supplying an integer greater or equal to 0 to the `--local_rank` argument (see below).
Julien Chaumond's avatar
Julien Chaumond committed
1185
- **16-bits training**: 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found [here](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) and a full documentation is [here](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html). In our scripts, this option can be activated by setting the `--fp16` flag and you can play with loss scaling using the `--loss_scale` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
1186

thomwolf's avatar
thomwolf committed
1187
To use 16-bits training and distributed training, you need to install NVIDIA's apex extension [as detailed here](https://github.com/nvidia/apex). You will find more information regarding the internals of `apex` and how to use `apex` in [the doc and the associated repository](https://github.com/nvidia/apex). The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in [the relevant PR of the present repository](https://github.com/huggingface/pytorch-transformers/pull/116).
thomwolf's avatar
thomwolf committed
1188

thomwolf's avatar
thomwolf committed
1189
Note: To use *Distributed Training*, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see [the above mentioned blog post]((https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)) for more details):
thomwolf's avatar
thomwolf committed
1190

thomwolf's avatar
thomwolf committed
1191
```bash
1192
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_bert_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
thomwolf's avatar
thomwolf committed
1193
```
thomwolf's avatar
thomwolf committed
1194

1195
Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address `192.168.1.1` and an open port `1234`.
thomwolf's avatar
thomwolf committed
1196

thomwolf's avatar
thomwolf committed
1197
### Fine-tuning with BERT: running the examples
VictorSanh's avatar
VictorSanh committed
1198

1199
We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/):
VictorSanh's avatar
VictorSanh committed
1200

1201
- a *sequence-level classifier* on nine different GLUE tasks,
thomwolf's avatar
thomwolf committed
1202
1203
- a *token-level classifier* on the question answering dataset SQuAD, and
- a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
tholor's avatar
tholor committed
1204
- a *BERT language model* on another target corpus
Joel Grus's avatar
Joel Grus committed
1205

1206
1207
#### GLUE results on dev set

thomwolf's avatar
thomwolf committed
1208
We get the following results on the dev set of GLUE benchmark with an uncased BERT base
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
model. All experiments were run on a P100 GPU with a batch size of 32.

| Task | Metric | Result |
|-|-|-|
| CoLA | Matthew's corr. | 57.29 |
| SST-2 | accuracy | 93.00 |
| MRPC | F1/accuracy | 88.85/83.82 |
| STS-B | Pearson/Spearman corr. | 89.70/89.37 |
| QQP | accuracy/F1 | 90.72/87.41 |
| MNLI | matched acc./mismatched acc.| 83.95/84.39 |
| QNLI | accuracy | 89.04 |
| RTE | accuracy | 61.01 |
| WNLI | accuracy | 53.52 |

Some of these results are significantly different from the ones reported on the test set
of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.

Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

```shell
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

1235
python run_bert_classifier.py \
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/
```

where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.

1255
1256
1257
1258
1259
1260
#### MRPC

This example code fine-tunes BERT on the Microsoft Research Paraphrase
Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.

Before running this example you should download the
VictorSanh's avatar
VictorSanh committed
1261
1262
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
1263
and unpack it to some directory `$GLUE_DIR`.
VictorSanh's avatar
VictorSanh committed
1264
1265
1266
1267

```shell
export GLUE_DIR=/path/to/glue

1268
python run_bert_classifier.py \
VictorSanh's avatar
VictorSanh committed
1269
1270
1271
  --task_name MRPC \
  --do_train \
  --do_eval \
1272
  --do_lower_case \
VictorSanh's avatar
VictorSanh committed
1273
  --data_dir $GLUE_DIR/MRPC/ \
thomwolf's avatar
thomwolf committed
1274
  --bert_model bert-base-uncased \
VictorSanh's avatar
VictorSanh committed
1275
1276
1277
1278
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
1279
  --output_dir /tmp/mrpc_output/
VictorSanh's avatar
VictorSanh committed
1280
1281
```

Thomas Wolf's avatar
Thomas Wolf committed
1282
Our test ran on a few seeds with [the original implementation hyper-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation results between 84% and 88%.
thomwolf's avatar
thomwolf committed
1283

1284
1285
1286
**Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
First install apex as indicated [here](https://github.com/NVIDIA/apex).
Then run
thomwolf's avatar
thomwolf committed
1287

1288
1289
1290
```shell
export GLUE_DIR=/path/to/glue

1291
python run_bert_classifier.py \
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --bert_model bert-base-uncased \
  --max_seq_length 128 \
  --train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
1302
1303
  --output_dir /tmp/mrpc_output/ \
  --fp16
1304
1305
```

1306
**Distributed training**
thomwolf's avatar
thomwolf committed
1307
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking model to reach a F1 > 92 on MRPC:
1308
1309

```bash
1310
python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name MRPC --do_train   --do_eval   --do_lower_case   --data_dir $GLUE_DIR/MRPC/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0  --output_dir /tmp/mrpc_output/
1311
1312
1313
```

Training with these hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
1314

1315
```bash
thomwolf's avatar
thomwolf committed
1316
1317
1318
1319
1320
1321
  acc = 0.8823529411764706
  acc_and_f1 = 0.901702786377709
  eval_loss = 0.3418912578906332
  f1 = 0.9210526315789473
  global_step = 174
  loss = 0.07231863956341798
1322
1323
```

thomwolf's avatar
thomwolf committed
1324
1325
1326
Here is an example on MNLI:

```bash
1327
python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name mnli --do_train   --do_eval   --do_lower_case   --data_dir /datadrive/bert_data/glue_data//MNLI/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir ../models/wwm-uncased-finetuned-mnli/ --overwrite_output_dir
thomwolf's avatar
thomwolf committed
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
```

```bash
***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904

***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904
```

This is the example of the `bert-large-uncased-whole-word-masking-finetuned-mnli` model

1346
1347
#### SQuAD

thomwolf's avatar
thomwolf committed
1348
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.
VictorSanh's avatar
VictorSanh committed
1349

VictorSanh's avatar
VictorSanh committed
1350
The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory.
1351

thomwolf's avatar
thomwolf committed
1352
1353
1354
- [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
- [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
- [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
VictorSanh's avatar
VictorSanh committed
1355

VictorSanh's avatar
VictorSanh committed
1356
```shell
VictorSanh's avatar
VictorSanh committed
1357
export SQUAD_DIR=/path/to/SQUAD
VictorSanh's avatar
VictorSanh committed
1358

1359
python run_bert_squad.py \
thomwolf's avatar
thomwolf committed
1360
  --bert_model bert-base-uncased \
VictorSanh's avatar
VictorSanh committed
1361
1362
  --do_train \
  --do_predict \
1363
  --do_lower_case \
Thomas Wolf's avatar
Thomas Wolf committed
1364
  --train_file $SQUAD_DIR/train-v1.1.json \
thomwolf's avatar
thomwolf committed
1365
1366
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
Thomas Wolf's avatar
Thomas Wolf committed
1367
  --learning_rate 3e-5 \
thomwolf's avatar
thomwolf committed
1368
1369
1370
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
thomwolf's avatar
thomwolf committed
1371
  --output_dir /tmp/debug_squad/
thomwolf's avatar
thomwolf committed
1372
```
1373

Thomas Wolf's avatar
Thomas Wolf committed
1374
Training with the previous hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
1375

1376
```bash
1377
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /tmp/debug_squad/predictions.json
Thomas Wolf's avatar
Thomas Wolf committed
1378
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
1379
```
1380

thomwolf's avatar
thomwolf committed
1381
##### distributed training
thomwolf's avatar
thomwolf committed
1382
1383

Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
1384
1385
1386

```bash
python -m torch.distributed.launch --nproc_per_node=8 \
1387
 run_bert_squad.py \
thomwolf's avatar
thomwolf committed
1388
 --bert_model bert-large-uncased-whole-word-masking  \
1389
1390
1391
1392
1393
1394
1395
1396
1397
 --do_train \
 --do_predict \
 --do_lower_case \
 --train_file $SQUAD_DIR/train-v1.1.json \
 --predict_file $SQUAD_DIR/dev-v1.1.json \
 --learning_rate 3e-5 \
 --num_train_epochs 2 \
 --max_seq_length 384 \
 --doc_stride 128 \
thomwolf's avatar
thomwolf committed
1398
 --output_dir ../models/wwm_uncased_finetuned_squad/ \
1399
1400
1401
1402
1403
 --train_batch_size 24 \
 --gradient_accumulation_steps 12
```

Training with these hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
1404

1405
```bash
thomwolf's avatar
thomwolf committed
1406
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
1407
1408
1409
{"exact_match": 86.91579943235573, "f1": 93.1532499015869}
```

thomwolf's avatar
thomwolf committed
1410
1411
1412
1413
1414
This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.

And here is the model provided as `bert-large-cased-whole-word-masking-finetuned-squad`:

```bash
1415
python -m torch.distributed.launch --nproc_per_node=8  run_bert_squad.py  --bert_model bert-large-cased-whole-word-masking   --do_train  --do_predict  --do_lower_case  --train_file $SQUAD_DIR/train-v1.1.json  --predict_file $SQUAD_DIR/dev-v1.1.json  --learning_rate 3e-5  --num_train_epochs 2  --max_seq_length 384  --doc_stride 128  --output_dir ../models/wwm_cased_finetuned_squad/  --train_batch_size 24  --gradient_accumulation_steps 12
thomwolf's avatar
thomwolf committed
1416
1417
1418
```

Training with these hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
1419

thomwolf's avatar
thomwolf committed
1420
1421
1422
1423
1424
```bash
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
{"exact_match": 84.18164616840113, "f1": 91.58645594850135}
```

thomwolf's avatar
thomwolf committed
1425
1426
1427
#### SWAG

The data for SWAG can be downloaded by cloning the following [repository](https://github.com/rowanz/swagaf)
1428
1429
1430
1431

```shell
export SWAG_DIR=/path/to/SWAG

1432
python run_bert_swag.py \
1433
1434
  --bert_model bert-base-uncased \
  --do_train \
thomwolf's avatar
thomwolf committed
1435
  --do_lower_case \
1436
  --do_eval \
thomwolf's avatar
thomwolf committed
1437
  --data_dir $SWAG_DIR/data \
1438
  --train_batch_size 16 \
1439
1440
1441
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --max_seq_length 80 \
thomwolf's avatar
thomwolf committed
1442
  --output_dir /tmp/swag_output/ \
1443
  --gradient_accumulation_steps 4
1444
1445
```

1446
Training with the previous hyper-parameters on a single GPU gave us the following results:
thomwolf's avatar
thomwolf committed
1447
1448

```bash
1449
1450
1451
1452
eval_accuracy = 0.8062081375587323
eval_loss = 0.5966546792367169
global_step = 13788
loss = 0.06423990014260186
1453
1454
```

tholor's avatar
tholor committed
1455
1456
1457
#### LM Fine-tuning

The data should be a text file in the same format as [sample_text.txt](./samples/sample_text.txt)  (one sentence per line, docs separated by empty line).
Joel Grus's avatar
Joel Grus committed
1458
You can download an [exemplary training corpus](https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt) generated from wikipedia articles and splitted into ~500k sentences with spaCy.
1459
Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:
tholor's avatar
tholor committed
1460

thomwolf's avatar
thomwolf committed
1461
Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the [`README`](./examples/lm_finetuning/README.md) of the [`examples/lm_finetuning/`](./examples/lm_finetuning/) folder.
tholor's avatar
tholor committed
1462

thomwolf's avatar
thomwolf committed
1463
### OpenAI GPT, Transformer-XL and GPT-2: running the examples
thomwolf's avatar
thomwolf committed
1464

thomwolf's avatar
thomwolf committed
1465
We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
thomwolf's avatar
thomwolf committed
1466
1467
1468

- fine-tuning OpenAI GPT on the ROCStories dataset
- evaluating Transformer-XL on Wikitext 103
thomwolf's avatar
thomwolf committed
1469
- unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
thomwolf's avatar
thomwolf committed
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480

#### Fine-tuning OpenAI GPT on the RocStories dataset

This example code fine-tunes OpenAI GPT on the RocStories dataset.

Before running this example you should download the
[RocStories dataset](https://github.com/snigdhac/StoryComprehension_EMNLP/tree/master/Dataset/RoCStories) and unpack it to some directory `$ROC_STORIES_DIR`.

```shell
export ROC_STORIES_DIR=/path/to/RocStories

thomwolf's avatar
thomwolf committed
1481
1482
python run_openai_gpt.py \
  --model_name openai-gpt \
thomwolf's avatar
thomwolf committed
1483
1484
  --do_train \
  --do_eval \
thomwolf's avatar
thomwolf committed
1485
1486
1487
1488
  --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
  --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
  --output_dir ../log \
  --train_batch_size 16 \
thomwolf's avatar
thomwolf committed
1489
1490
```

1491
This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
thomwolf's avatar
thomwolf committed
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501

#### Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset

This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset.
This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.

```shell
python run_transfo_xl.py --work_dir ../log
```

1502
This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).
thomwolf's avatar
thomwolf committed
1503

thomwolf's avatar
thomwolf committed
1504
1505
1506
1507
1508
#### Unconditional and conditional generation from OpenAI's GPT-2 model

This example code is identical to the original unconditional and conditional generation codes.

Conditional generation:
thomwolf's avatar
thomwolf committed
1509

thomwolf's avatar
thomwolf committed
1510
1511
1512
1513
1514
```shell
python run_gpt2.py
```

Unconditional generation:
thomwolf's avatar
thomwolf committed
1515

thomwolf's avatar
thomwolf committed
1516
1517
1518
1519
1520
1521
```shell
python run_gpt2.py --unconditional
```

The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.

thomwolf's avatar
thomwolf committed
1522
## Fine-tuning BERT-large on GPUs
1523
1524
1525

The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.

Thomas Wolf's avatar
Thomas Wolf committed
1526
For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):
thomwolf's avatar
thomwolf committed
1527

1528
1529
1530
```bash
{"exact_match": 84.56953642384106, "f1": 91.04028647786927}
```
thomwolf's avatar
thomwolf committed
1531

Thomas Wolf's avatar
Thomas Wolf committed
1532
To get these results we used a combination of:
thomwolf's avatar
thomwolf committed
1533

1534
1535
1536
1537
- multi-GPU training (automatically activated on a multi-GPU server),
- 2 steps of gradient accumulation and
- perform the optimization step on CPU to store Adam's averages in RAM.

thomwolf's avatar
thomwolf committed
1538
Here is the full list of hyper-parameters for this run:
thomwolf's avatar
thomwolf committed
1539

1540
```bash
1541
1542
export SQUAD_DIR=/path/to/SQUAD

1543
python ./run_bert_squad.py \
thomwolf's avatar
thomwolf committed
1544
  --bert_model bert-large-uncased \
Thomas Wolf's avatar
Thomas Wolf committed
1545
1546
  --do_train \
  --do_predict \
1547
  --do_lower_case \
1548
1549
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
Thomas Wolf's avatar
Thomas Wolf committed
1550
1551
1552
1553
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
1554
  --output_dir /tmp/debug_squad/ \
Thomas Wolf's avatar
Thomas Wolf committed
1555
  --train_batch_size 24 \
Joel Grus's avatar
Joel Grus committed
1556
  --gradient_accumulation_steps 2
1557
```
1558
1559
1560
1561

If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).

Here is an example of hyper-parameters for a FP16 run we tried:
thomwolf's avatar
thomwolf committed
1562

1563
```bash
1564
1565
export SQUAD_DIR=/path/to/SQUAD

1566
python ./run_bert_squad.py \
thomwolf's avatar
thomwolf committed
1567
  --bert_model bert-large-uncased \
1568
1569
  --do_train \
  --do_predict \
1570
  --do_lower_case \
1571
1572
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
1573
1574
1575
1576
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
1577
  --output_dir /tmp/debug_squad/ \
1578
1579
1580
1581
1582
1583
  --train_batch_size 24 \
  --fp16 \
  --loss_scale 128
```

The results were similar to the above FP32 results (actually slightly higher):
thomwolf's avatar
thomwolf committed
1584

1585
1586
1587
```bash
{"exact_match": 84.65468306527909, "f1": 91.238669287002}
```
thomwolf's avatar
thomwolf committed
1588

1589
1590
1591
1592
Here is an example with the recent `bert-large-uncased-whole-word-masking`:

```bash
python -m torch.distributed.launch --nproc_per_node=8 \
1593
  run_bert_squad.py \
1594
1595
1596
1597
1598
1599
1600
  --bert_model bert-large-uncased-whole-word-masking \
  --do_train \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --learning_rate 3e-5 \
1601
  --num_train_epochs 2 \
1602
1603
  --max_seq_length 384 \
  --doc_stride 128 \
1604
1605
1606
  --output_dir /tmp/debug_squad/ \
  --train_batch_size 24 \
  --gradient_accumulation_steps 2
1607
1608
```

1609
1610
## Fine-tuning XLNet

thomwolf's avatar
thomwolf committed
1611
### STS-B
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637

This example code fine-tunes XLNet on the STS-B corpus.

Before running this example you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.

```shell
export GLUE_DIR=/path/to/glue

python run_xlnet_classifier.py \
 --task_name STS-B \
 --do_train \
 --do_eval \
 --data_dir $GLUE_DIR/STS-B/ \
 --max_seq_length 128 \
 --train_batch_size 8 \
 --gradient_accumulation_steps 1 \
 --learning_rate 5e-5 \
 --num_train_epochs 3.0 \
 --output_dir /tmp/mrpc_output/
```

Our test ran on a few seeds with [the original implementation hyper-parameters](https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus) gave evaluation results between 84% and 88%.

thomwolf's avatar
thomwolf committed
1638
1639
### Distributed training

1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
Here is an example using distributed training on 8 V100 GPUs to reach XXXX:

```bash
python -m torch.distributed.launch --nproc_per_node 8 \
 run_xlnet_classifier.py \
 --task_name STS-B \
 --do_train \
 --do_eval \
 --data_dir $GLUE_DIR/STS-B/ \
 --max_seq_length 128 \
 --train_batch_size 8 \
 --gradient_accumulation_steps 1 \
 --learning_rate 5e-5 \
 --num_train_epochs 3.0 \
 --output_dir /tmp/mrpc_output/
```

Training with these hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
1658

1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
```bash
  acc = 0.8823529411764706
  acc_and_f1 = 0.901702786377709
  eval_loss = 0.3418912578906332
  f1 = 0.9210526315789473
  global_step = 174
  loss = 0.07231863956341798
```

Here is an example on MNLI:

```bash
python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name mnli --do_train   --do_eval   --data_dir /datadrive/bert_data/glue_data//MNLI/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir ../models/wwm-uncased-finetuned-mnli/ --overwrite_output_dir
```

```bash
***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904

***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904
```

This is the example of the `bert-large-uncased-whole-word-masking-finetuned-mnli` model

thomwolf's avatar
thomwolf committed
1690
1691
1692
1693
## BERTology

There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:

thomwolf's avatar
thomwolf committed
1694
1695
1696
- [BERT Rediscovers the Classical NLP Pipeline](https://arxiv.org/abs/1905.05950) by Ian Tenney, Dipanjan Das, Ellie Pavlick
- [Are Sixteen Heads Really Better than One?](https://arxiv.org/abs/1905.10650) by Paul Michel, Omer Levy, Graham Neubig
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341) by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning
thomwolf's avatar
thomwolf committed
1697

thomwolf's avatar
thomwolf committed
1698
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of [Michel et al.](https://arxiv.org/abs/1905.10650):
thomwolf's avatar
thomwolf committed
1699
1700
1701

- accessing all the hidden-states of BERT/GPT/GPT-2,
- accessing all the attention weights for each head of BERT/GPT/GPT-2,
thomwolf's avatar
thomwolf committed
1702
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in [Michel et al.](https://arxiv.org/abs/1905.10650).
thomwolf's avatar
thomwolf committed
1703
1704
1705

To help you understand and use these features, we have added a specific example script: [`bertology.py`](./examples/bertology.py) while extract information and prune a model pre-trained on MRPC.

thomwolf's avatar
thomwolf committed
1706
## Notebooks
thomwolf's avatar
thomwolf committed
1707

thomwolf's avatar
thomwolf committed
1708
We include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-transformers/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
thomwolf's avatar
thomwolf committed
1709

thomwolf's avatar
thomwolf committed
1710
1711
1712
- The first NoteBook ([Comparing-TF-and-PT-models.ipynb](./notebooks/Comparing-TF-and-PT-models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.

- The second NoteBook ([Comparing-TF-and-PT-models-SQuAD.ipynb](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
thomwolf's avatar
thomwolf committed
1713

Thomas Wolf's avatar
Thomas Wolf committed
1714
- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
thomwolf's avatar
thomwolf committed
1715

thomwolf's avatar
thomwolf committed
1716
Please follow the instructions given in the notebooks to run and modify them.
thomwolf's avatar
thomwolf committed
1717

thomwolf's avatar
thomwolf committed
1718
## Command-line interface
thomwolf's avatar
thomwolf committed
1719

thomwolf's avatar
thomwolf committed
1720
1721
A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the `OpenAIGPTModel` class  (for OpenAI GPT).

thomwolf's avatar
thomwolf committed
1722
### BERT CLI
thomwolf's avatar
thomwolf committed
1723

thomwolf's avatar
thomwolf committed
1724
You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](./pytorch_transformers/convert_tf_checkpoint_to_pytorch.py ) script.
thomwolf's avatar
thomwolf committed
1725

1726
This CLI takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using `torch.load()` (see examples in [`run_bert_extract_features.py`](./examples/run_bert_extract_features.py), [`run_bert_classifier.py`](./examples/run_bert_classifier.py) and [`run_bert_squad.py`](./examples/run_bert_squad.py)).
thomwolf's avatar
thomwolf committed
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736

You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too.

To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch.

Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model:

```shell
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12

thomwolf's avatar
thomwolf committed
1737
pytorch_transformers bert \
thomwolf's avatar
thomwolf committed
1738
1739
1740
  $BERT_BASE_DIR/bert_model.ckpt \
  $BERT_BASE_DIR/bert_config.json \
  $BERT_BASE_DIR/pytorch_model.bin
thomwolf's avatar
thomwolf committed
1741
1742
1743
1744
```

You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).

thomwolf's avatar
thomwolf committed
1745
### OpenAI GPT CLI
thomwolf's avatar
thomwolf committed
1746

thomwolf's avatar
thomwolf committed
1747
1748
1749
1750
1751
Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see [here](https://github.com/openai/finetune-transformer-lm))

```shell
export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights

thomwolf's avatar
thomwolf committed
1752
pytorch_transformers gpt \
thomwolf's avatar
thomwolf committed
1753
1754
1755
1756
1757
  $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
  $PYTORCH_DUMP_OUTPUT \
  [OPENAI_GPT_CONFIG]
```

thomwolf's avatar
thomwolf committed
1758
### Transformer-XL CLI
thomwolf's avatar
thomwolf committed
1759
1760

Here is an example of the conversion process for a pre-trained Transformer-XL model (see [here](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models))
thomwolf's avatar
thomwolf committed
1761
1762

```shell
thomwolf's avatar
thomwolf committed
1763
export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
thomwolf's avatar
thomwolf committed
1764

thomwolf's avatar
thomwolf committed
1765
pytorch_transformers transfo_xl \
thomwolf's avatar
thomwolf committed
1766
  $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
thomwolf's avatar
thomwolf committed
1767
  $PYTORCH_DUMP_OUTPUT \
thomwolf's avatar
thomwolf committed
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
  [TRANSFO_XL_CONFIG]
```

### GPT-2

Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.

```shell
export GPT2_DIR=/path/to/gpt2/checkpoint

thomwolf's avatar
thomwolf committed
1778
pytorch_transformers gpt2 \
thomwolf's avatar
thomwolf committed
1779
1780
1781
  $GPT2_DIR/model.ckpt \
  $PYTORCH_DUMP_OUTPUT \
  [GPT2_CONFIG]
thomwolf's avatar
thomwolf committed
1782
1783
```

1784
1785
1786
1787
1788
1789
1790
1791
### XLNet

Here is an example of the conversion process for a pre-trained XLNet model, fine-tuned on STS-B using the TensorFlow script:

```shell
export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config

thomwolf's avatar
thomwolf committed
1792
pytorch_transformers xlnet \
1793
1794
1795
1796
1797
1798
  $TRANSFO_XL_CHECKPOINT_PATH \
  $TRANSFO_XL_CONFIG_PATH \
  $PYTORCH_DUMP_OUTPUT \
  STS-B \
```

thomwolf's avatar
thomwolf committed
1799
## TPU
thomwolf's avatar
thomwolf committed
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809

TPU support and pretraining scripts

TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)).

We will add TPU support when this next release is published.

The original TensorFlow code further comprises two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py).

Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details [here](https://github.com/google-research/bert#pre-training-with-bert)) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.