README.md 19.5 KB
Newer Older
thomwolf's avatar
thomwolf committed
1
# 馃懢 PyTorch-Transformers
VictorSanh's avatar
VictorSanh committed
2

thomwolf's avatar
thomwolf committed
3
[![CircleCI](https://circleci.com/gh/huggingface/pytorch-transformers.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-transformers)
Julien Chaumond's avatar
Julien Chaumond committed
4

thomwolf's avatar
indeed  
thomwolf committed
5
6
7
PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).

The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:
VictorSanh's avatar
VictorSanh committed
8

thomwolf's avatar
thomwolf committed
9
10
11
12
13
14
1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
3. **[GPT-2](https://blog.openai.com/better-language-models/)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [鈥媂LNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
thomwolf's avatar
thomwolf committed
15

thomwolf's avatar
thomwolf committed
16
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).
17

thomwolf's avatar
thomwolf committed
18
| Section | Description |
thomwolf's avatar
thomwolf committed
19
|-|-|
thomwolf's avatar
thomwolf committed
20
| [Installation](#installation) | How to install the package |
thomwolf's avatar
thomwolf committed
21
| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
thomwolf's avatar
thomwolf committed
22
23
| [Quick tour: Fine-tuning/usage scripts](#quick-tour-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
thomwolf's avatar
thomwolf committed
24
| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |
thomwolf's avatar
thomwolf committed
25

thomwolf's avatar
thomwolf committed
26
## Installation
VictorSanh's avatar
VictorSanh committed
27

thomwolf's avatar
thomwolf committed
28
This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0
VictorSanh's avatar
VictorSanh committed
29

thomwolf's avatar
thomwolf committed
30
### With pip
thomwolf's avatar
thomwolf committed
31

thomwolf's avatar
thomwolf committed
32
PyTorch-Transformers can be installed by pip as follows:
thomwolf's avatar
thomwolf committed
33

thomwolf's avatar
thomwolf committed
34
```bash
thomwolf's avatar
thomwolf committed
35
pip install pytorch-transformers
thomwolf's avatar
thomwolf committed
36
```
VictorSanh's avatar
VictorSanh committed
37

thomwolf's avatar
thomwolf committed
38
### From source
thomwolf's avatar
thomwolf committed
39
40

Clone the repository and run:
thomwolf's avatar
thomwolf committed
41

thomwolf's avatar
thomwolf committed
42
43
44
```bash
pip install [--editable] .
```
VictorSanh's avatar
VictorSanh committed
45

thomwolf's avatar
thomwolf committed
46
### Tests
thomwolf's avatar
thomwolf committed
47

thomwolf's avatar
thomwolf committed
48
A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/pytorch-transformers/tree/master/examples).
thomwolf's avatar
thomwolf committed
49

thomwolf's avatar
thomwolf committed
50
These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
thomwolf's avatar
thomwolf committed
51

thomwolf's avatar
thomwolf committed
52
You can run the tests from the root of the cloned repository with the commands:
thomwolf's avatar
thomwolf committed
53

thomwolf's avatar
thomwolf committed
54
55
56
57
```bash
python -m pytest -sv ./pytorch_transformers/tests/
python -m pytest -sv ./examples/
```
thomwolf's avatar
thomwolf committed
58

thomwolf's avatar
thomwolf committed
59
## Quick tour: Usage
thomwolf's avatar
thomwolf committed
60

thomwolf's avatar
thomwolf committed
61
Here are two quick-start examples using `Bert` and `GPT2` with pre-trained models.
thomwolf's avatar
thomwolf committed
62

thomwolf's avatar
thomwolf committed
63
See the [documentation](#documentation) for the details of all the models and classes.
thomwolf's avatar
thomwolf committed
64

thomwolf's avatar
thomwolf committed
65
### BERT example
thomwolf's avatar
thomwolf committed
66

thomwolf's avatar
thomwolf committed
67
First let's prepare a tokenized input from a text string using `BertTokenizer`
thomwolf's avatar
thomwolf committed
68
69
70

```python
import torch
thomwolf's avatar
thomwolf committed
71
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
thomwolf's avatar
thomwolf committed
72

thomwolf's avatar
thomwolf committed
73
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
thomwolf's avatar
thomwolf committed
74
75
76
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
77
# Load pre-trained model tokenizer (vocabulary)
thomwolf's avatar
thomwolf committed
78
79
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

thomwolf's avatar
thomwolf committed
80
# Tokenize input
thomwolf's avatar
thomwolf committed
81
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
thomwolf's avatar
thomwolf committed
82
tokenized_text = tokenizer.tokenize(text)
thomwolf's avatar
thomwolf committed
83
84

# Mask a token that we will try to predict back with `BertForMaskedLM`
Liang Niu's avatar
Liang Niu committed
85
masked_index = 8
thomwolf's avatar
thomwolf committed
86
tokenized_text[masked_index] = '[MASK]'
thomwolf's avatar
thomwolf committed
87
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
thomwolf's avatar
thomwolf committed
88
89
90

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
thomwolf's avatar
thomwolf committed
91
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
thomwolf's avatar
thomwolf committed
92
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
thomwolf's avatar
thomwolf committed
93

thomwolf's avatar
thomwolf committed
94
# Convert inputs to PyTorch tensors
thomwolf's avatar
thomwolf committed
95
96
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
thomwolf's avatar
thomwolf committed
97
98
```

thomwolf's avatar
thomwolf committed
99
Let's see how we can use `BertModel` to encode our inputs in hidden-states:
thomwolf's avatar
thomwolf committed
100
101
102
103

```python
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
thomwolf's avatar
thomwolf committed
104
105
106

# Set the model in evaluation mode to desactivate the DropOut modules
# This is IMPORTANT to have reproductible results during evaluation!
thomwolf's avatar
thomwolf committed
107
model.eval()
thomwolf's avatar
thomwolf committed
108

thomwolf's avatar
thomwolf committed
109
110
111
112
113
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
114
# Predict hidden states features for each layer
thomwolf's avatar
thomwolf committed
115
with torch.no_grad():
thomwolf's avatar
thomwolf committed
116
117
118
119
120
121
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # PyTorch-Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]
122

thomwolf's avatar
thomwolf committed
123
124
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
thomwolf's avatar
thomwolf committed
125
126
```

thomwolf's avatar
thomwolf committed
127
And how to use `BertForMaskedLM` to predict a masked token:
thomwolf's avatar
thomwolf committed
128
129
130
131
132
133

```python
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

thomwolf's avatar
thomwolf committed
134
135
136
137
138
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
139
# Predict all tokens
thomwolf's avatar
thomwolf committed
140
with torch.no_grad():
thomwolf's avatar
thomwolf committed
141
142
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]
thomwolf's avatar
thomwolf committed
143

thomwolf's avatar
thomwolf committed
144
# confirm we were able to predict 'henson'
thomwolf's avatar
thomwolf committed
145
predicted_index = torch.argmax(predictions[0, masked_index]).item()
146
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
thomwolf's avatar
thomwolf committed
147
148
149
assert predicted_token == 'henson'
```

thomwolf's avatar
thomwolf committed
150
### OpenAI GPT-2
thomwolf's avatar
thomwolf committed
151

thomwolf's avatar
thomwolf committed
152
Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.
thomwolf's avatar
thomwolf committed
153

thomwolf's avatar
thomwolf committed
154
First let's prepare a tokenized input from our text string using `GPT2Tokenizer`
thomwolf's avatar
thomwolf committed
155
156
157

```python
import torch
thomwolf's avatar
thomwolf committed
158
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
thomwolf's avatar
thomwolf committed
159

thomwolf's avatar
thomwolf committed
160
161
162
163
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

thomwolf's avatar
thomwolf committed
164
# Load pre-trained model tokenizer (vocabulary)
thomwolf's avatar
thomwolf committed
165
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
thomwolf's avatar
thomwolf committed
166

thomwolf's avatar
thomwolf committed
167
# Encode a text inputs
thomwolf's avatar
thomwolf committed
168
169
text = "Who was Jim Henson ? Jim Henson was a"
indexed_tokens = tokenizer.encode(text)
thomwolf's avatar
thomwolf committed
170

thomwolf's avatar
thomwolf committed
171
# Convert indexed tokens in a PyTorch tensor
thomwolf's avatar
thomwolf committed
172
173
174
tokens_tensor = torch.tensor([indexed_tokens])
```

thomwolf's avatar
thomwolf committed
175
Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:
thomwolf's avatar
thomwolf committed
176
177
178

```python
# Load pre-trained model (weights)
thomwolf's avatar
thomwolf committed
179
model = GPT2LMHeadModel.from_pretrained('gpt2')
thomwolf's avatar
thomwolf committed
180

thomwolf's avatar
thomwolf committed
181
182
# Set the model in evaluation mode to desactivate the DropOut modules
# This is IMPORTANT to have reproductible results during evaluation!
thomwolf's avatar
thomwolf committed
183
184
model.eval()

thomwolf's avatar
thomwolf committed
185
186
187
188
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

thomwolf's avatar
thomwolf committed
189
# Predict all tokens
thomwolf's avatar
thomwolf committed
190
with torch.no_grad():
thomwolf's avatar
thomwolf committed
191
192
    outputs = model(tokens_tensor)
    predictions = outputs[0]
thomwolf's avatar
thomwolf committed
193

thomwolf's avatar
thomwolf committed
194
# get the predicted next sub-word (in our case, the word 'man')
thomwolf's avatar
thomwolf committed
195
predicted_index = torch.argmax(predictions[0, -1, :]).item()
thomwolf's avatar
thomwolf committed
196
197
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
thomwolf's avatar
thomwolf committed
198
199
```

thomwolf's avatar
thomwolf committed
200
Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).
thomwolf's avatar
thomwolf committed
201

thomwolf's avatar
thomwolf committed
202
## Quick tour: Fine-tuning/usage scripts
thomwolf's avatar
thomwolf committed
203

thomwolf's avatar
thomwolf committed
204
The library comprises several example scripts with SOTA performances for NLU and NLG tasks:
thomwolf's avatar
thomwolf committed
205

thomwolf's avatar
thomwolf committed
206
207
208
209
- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
- other model-specific examples (see the documentation).
thomwolf's avatar
thomwolf committed
210

thomwolf's avatar
thomwolf committed
211
Here are three quick usage examples for these scripts:
thomwolf's avatar
thomwolf committed
212

thomwolf's avatar
thomwolf committed
213
### `run_glue.py`: Fine-tuning on GLUE tasks for sequence classification
thomwolf's avatar
thomwolf committed
214

thomwolf's avatar
thomwolf committed
215
The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.
thomwolf's avatar
thomwolf committed
216

thomwolf's avatar
thomwolf committed
217
218
219
220
Before running anyone of these GLUE tasks you should download the
[GLUE data](https://gluebenchmark.com/tasks) by running
[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
and unpack it to some directory `$GLUE_DIR`.
thomwolf's avatar
thomwolf committed
221

222
223
224
225
226
227
You should also install the additional packages required by the examples:

```shell
pip install -r ./examples/requirements.txt
```

thomwolf's avatar
thomwolf committed
228
229
230
```shell
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
thomwolf's avatar
thomwolf committed
231

232
233
234
235
236
237
238
239
240
241
242
243
244
245
python ./examples/run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/
thomwolf's avatar
thomwolf committed
246
247
```

thomwolf's avatar
thomwolf committed
248
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
thomwolf's avatar
thomwolf committed
249

thomwolf's avatar
thomwolf committed
250
The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
thomwolf's avatar
thomwolf committed
251

thomwolf's avatar
thomwolf committed
252
#### Fine-tuning XLNet model on the STS-B regression task
thomwolf's avatar
thomwolf committed
253

thomwolf's avatar
thomwolf committed
254
This example code fine-tunes XLNet on the STS-B corpus using parallel training on a server with 4 V100 GPUs.
255
Parallel training is a simple way to use several GPUs (but is slower and less flexible than distributed training, see below).
thomwolf's avatar
thomwolf committed
256

thomwolf's avatar
thomwolf committed
257
258
```shell
export GLUE_DIR=/path/to/glue
thomwolf's avatar
thomwolf committed
259

thomwolf's avatar
thomwolf committed
260
261
262
263
python ./examples/run_glue.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train  \
264
    --do_eval   \
thomwolf's avatar
thomwolf committed
265
266
267
268
269
270
271
272
273
274
275
276
    --task_name=sts-b     \
    --data_dir=${GLUE_DIR}/STS-B  \
    --output_dir=./proc_data/sts-b-110   \
    --max_seq_length=128   \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --gradient_accumulation_steps=1 \
    --max_steps=1200  \
    --model_name=xlnet-large-cased   \
    --overwrite_output_dir   \
    --overwrite_cache \
    --warmup_steps=120
thomwolf's avatar
thomwolf committed
277
278
```

279
On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine. These hyper-parameters should results in a Pearson correlation coefficient of `+0.917` on the development set.
thomwolf's avatar
thomwolf committed
280

thomwolf's avatar
thomwolf committed
281
#### Fine-tuning Bert model on the MRPC classification task
thomwolf's avatar
thomwolf committed
282

thomwolf's avatar
thomwolf committed
283
This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.
thomwolf's avatar
thomwolf committed
284

thomwolf's avatar
thomwolf committed
285
```bash
286
python -m torch.distributed.launch --nproc_per_node 8 ./examples/run_glue.py   \
thomwolf's avatar
thomwolf committed
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --task_name MRPC \
    --do_train   \
    --do_eval   \
    --do_lower_case   \
    --data_dir $GLUE_DIR/MRPC/   \
    --max_seq_length 128   \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5   \
    --num_train_epochs 3.0  \
    --output_dir /tmp/mrpc_output/ \
    --overwrite_output_dir   \
    --overwrite_cache \
thomwolf's avatar
thomwolf committed
302
303
```

thomwolf's avatar
thomwolf committed
304
Training with these hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
305

thomwolf's avatar
thomwolf committed
306
307
308
309
310
311
312
```bash
  acc = 0.8823529411764706
  acc_and_f1 = 0.901702786377709
  eval_loss = 0.3418912578906332
  f1 = 0.9210526315789473
  global_step = 174
  loss = 0.07231863956341798
thomwolf's avatar
thomwolf committed
313
314
```

thomwolf's avatar
thomwolf committed
315
### `run_squad.py`: Fine-tuning on SQuAD for question-answering
thomwolf's avatar
thomwolf committed
316

thomwolf's avatar
thomwolf committed
317
This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
thomwolf's avatar
thomwolf committed
318

thomwolf's avatar
thomwolf committed
319
```bash
320
python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
thomwolf's avatar
thomwolf committed
321
322
323
324
325
326
327
328
329
330
331
332
333
334
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_predict \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \
thomwolf's avatar
thomwolf committed
335
336
```

thomwolf's avatar
thomwolf committed
337
Training with these hyper-parameters gave us the following results:
thomwolf's avatar
thomwolf committed
338

thomwolf's avatar
thomwolf committed
339
340
341
```bash
python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
{"exact_match": 86.91579943235573, "f1": 93.1532499015869}
thomwolf's avatar
thomwolf committed
342
343
```

thomwolf's avatar
thomwolf committed
344
This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.
345

thomwolf's avatar
thomwolf committed
346
### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet
347

thomwolf's avatar
thomwolf committed
348
349
A conditional generation script is also included to generate text from a prompt.
The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
350

thomwolf's avatar
thomwolf committed
351
Here is how to run the script with the small version of OpenAI GPT-2 model:
352

thomwolf's avatar
thomwolf committed
353
354
355
356
357
```shell
python ./examples/run_glue.py \
    --model_type=gpt2 \
    --length=20 \
    --model_name_or_path=gpt2 \
358
359
```

thomwolf's avatar
thomwolf committed
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
## Migrating from pytorch-pretrained-bert to pytorch-transformers

Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers`

### Models always output `tuples`

The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.

The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).

In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.

Here is a `pytorch-pretrained-bert` to `pytorch-transformers` conversion example for a `BertForSequenceClassification` classification model:

```python
# Let's load our model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

# Now just use this line in pytorch-transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]

# In pytorch-transformers you can also have access to the logits:
loss, logits = outputs[:2]

# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs
```

### Serialization

While not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.

Here is an example:

```python
### Let's load a model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Do some stuff to our model and tokenizer
# Ex: add new tokens to the vocabulary and embeddings of our model
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))
# Train our model
train(model)

### Now let's save our model and tokenizer to a directory
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

### Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
```

### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.

The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.

Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:

```python
# Parameters:
lr = 1e-3
num_total_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1

### Previously BertAdam optimizer was instantiated like this:
optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_total_steps)
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    optimizer.step()

### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler
### and used like this:
for batch in train_data:
    loss = model(batch)
    loss.backward()
    scheduler.step()
    optimizer.step()
```

thomwolf's avatar
thomwolf committed
456
## Citation
thomwolf's avatar
thomwolf committed
457

thomwolf's avatar
thomwolf committed
458
At the moment, there is no paper associated to PyTorch-Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.