Merge branch 'xlnet'

f31154cb · thomwolf · 78462aad · 1b35d05d · f31154cb · f31154cb
Commit f31154cb authored Jul 16, 2019 by thomwolf
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
 version: 2
 jobs:
    build_py3:
-        working_directory: ~/pytorch-pretrained-BERT
+        working_directory: ~/pytorch-transformers
        docker:
            - image: circleci/python:3.5
+        resource_class: large
+        parallelism: 4
        steps:
            - checkout
            - run: sudo pip install --progress-bar off .
            - run: sudo pip install pytest codecov pytest-cov
-            - run: sudo pip install spacy ftfy==4.4.3
-            - run: sudo python -m spacy download en
-            - run: python -m pytest -sv tests/ --cov
+            - run: sudo pip install tensorboardX scikit-learn
+            - run: python -m pytest -sv ./pytorch_transformers/tests/ --cov
+            - run: python -m pytest -sv ./examples/
            - run: codecov
    build_py2:
-        working_directory: ~/pytorch-pretrained-BERT
+        working_directory: ~/pytorch-transformers
+        resource_class: large
+        parallelism: 4
        docker:
            - image: circleci/python:2.7
        steps:
            - checkout
            - run: sudo pip install --progress-bar off .
            - run: sudo pip install pytest codecov pytest-cov
-            - run: sudo pip install spacy ftfy==4.4.3
-            - run: sudo python -m spacy download en
-            - run: python -m pytest -sv tests/ --cov
+            - run: python -m pytest -sv ./pytorch_transformers/tests/ --cov
            - run: codecov
 workflows:
  version: 2

--- a/.coveragerc
+++ b/.coveragerc
 [run]
-source=pytorch_pretrained_bert
+source=pytorch_transformers
+omit =
+    # skip convertion scripts from testing for now
+    */convert_*
+    */__main__.py
 [report]
 exclude_lines =
    pragma: no cover

--- a/.gitignore
+++ b/.gitignore
@@ -122,4 +122,9 @@ dmypy.json
 tensorflow_code

 # Models
-models
\ No newline at end of file
+models
+proc_data
+
+# examples
+runs
+examples/runs
\ No newline at end of file
--- a/README.md
+++ b/README.md
-# PyTorch Pretrained BERT: The Big & Extending Repository of pretrained Transformers
+# 👾 PyTorch-Transformers

 [![CircleCI](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)

-This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
+PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).

- [Google's BERT model](https://github.com/google-research/bert),
- [OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm),
- [Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl), and
- [OpenAI's GPT-2 model](https://blog.openai.com/better-language-models/).
+The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:

-These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the [Examples](#examples) section below.
+- **[Google's BERT model](https://github.com/google-research/bert)** released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+- **[OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm)** released  with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+- **[OpenAI's GPT-2 model](https://blog.openai.com/better-language-models/)** released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
+- **[Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl)** released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+- **[Google/CMU's XLNet model](https://github.com/zihangdai/xlnet/)** released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+- **[Facebook's XLM model](https://github.com/facebookresearch/XLM/)** released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.

-Here are some information on these models:
-
-**BERT** was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-This PyTorch implementation of BERT is provided with [Google's pre-trained models](https://github.com/google-research/bert), examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
-
-**OpenAI GPT** was released together with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
-This PyTorch implementation of OpenAI GPT is an adaptation of the [PyTorch implementation by HuggingFace](https://github.com/huggingface/pytorch-openai-transformer-lm) and is provided with [OpenAI's pre-trained model](https://github.com/openai/finetune-transformer-lm) and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
-
-**Google/CMU's Transformer-XL** was released together with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](http://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-This PyTorch implementation of Transformer-XL is an adaptation of the original [PyTorch implementation](https://github.com/kimiyoung/transformer-xl) which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
-
-**OpenAI GPT-2** was released together with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
-This PyTorch implementation of OpenAI GPT-2 is an adaptation of the [OpenAI's implementation](https://github.com/openai/gpt-2) and is provided with [OpenAI's pre-trained model](https://github.com/openai/gpt-2) and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
-
-
-## Content
+These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](#documentation).

 | Section | Description |
 |-|-|
 | [Installation](#installation) | How to install the package |
-| [Overview](#overview) | Overview of the package |
-| [Usage](#usage) | Quickstart examples |
-| [Doc](#doc) |  Detailed documentation |
-| [Examples](#examples) | Detailed examples on how to fine-tune Bert |
-| [Notebooks](#notebooks) | Introduction on the provided Jupyter Notebooks |
-| [TPU](#tpu) | Notes on TPU support and pretraining scripts |
-| [Command-line interface](#Command-line-interface) | Convert a TensorFlow checkpoint in a PyTorch dump |
+| [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
+| [Quick tour: Fine-tuning/usage scripts](#quick-tour-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
+| [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
+| [Documentation](#documentation) | Full API documentation and more |

 ## Installation

-This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
+This repo is tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1 to 1.1.0

 ### With pip

-PyTorch pretrained bert can be installed by pip as follows:
-```bash
-pip install pytorch-pretrained-bert
-```
+PyTorch-Transformers can be installed by pip as follows:

-If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
 ```bash
-pip install spacy ftfy==4.4.3
-python -m spacy download en
+pip install pytorch-transformers
 ```

-If you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
-
 ### From source

 Clone the repository and run:
+
 ```bash
 pip install [--editable] .
 ```

-Here also, if you want to reproduce the original tokenization process of the `OpenAI GPT` model, you will need to install `ftfy` (limit to version 4.4.3 if you are using Python 2) and `SpaCy` :
-```bash
-pip install spacy ftfy==4.4.3
-python -m spacy download en
-```
+### Tests

-Again, if you don't install `ftfy` and `SpaCy`, the `OpenAI GPT` tokenizer will default to tokenize using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage).
+A series of tests is included for the library and the example scripts. Library tests can be found in the [tests folder](https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests) and examples tests in the [examples folder](https://github.com/huggingface/pytorch-transformers/tree/master/examples).

-A series of tests is included in the [tests folder](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests) and can be run using `pytest` (install pytest if needed: `pip install pytest`).
+These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
+
+You can run the tests from the root of the cloned repository with the commands:

-You can run the tests with the command:
 ```bash
-python -m pytest -sv tests/
+python -m pytest -sv ./pytorch_transformers/tests/
+python -m pytest -sv ./examples/
 ```

-## Overview
-
-This package comprises the following classes that can be imported in Python and are detailed in the [Doc](#doc) section of this readme:
-
- Eight **Bert** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling.py`](./pytorch_pretrained_bert/modeling.py) file):
-  - [`BertModel`](./pytorch_pretrained_bert/modeling.py#L639) - raw BERT Transformer model (**fully pre-trained**),
-  - [`BertForMaskedLM`](./pytorch_pretrained_bert/modeling.py#L793) - BERT Transformer with the pre-trained masked language modeling head on top (**fully pre-trained**),
-  - [`BertForNextSentencePrediction`](./pytorch_pretrained_bert/modeling.py#L854) - BERT Transformer with the pre-trained next sentence prediction classifier on top  (**fully pre-trained**),
-  - [`BertForPreTraining`](./pytorch_pretrained_bert/modeling.py#L722) - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (**fully pre-trained**),
-  - [`BertForSequenceClassification`](./pytorch_pretrained_bert/modeling.py#L916) - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**, the sequence classification head **is only initialized and has to be trained**),
-  - [`BertForMultipleChoice`](./pytorch_pretrained_bert/modeling.py#L982) - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
-  - [`BertForTokenClassification`](./pytorch_pretrained_bert/modeling.py#L1051) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**),
-  - [`BertForQuestionAnswering`](./pytorch_pretrained_bert/modeling.py#L1124) - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**, the token classification head **is only initialized and has to be trained**).
-
- Three **OpenAI GPT** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py) file):
-  - [`OpenAIGPTModel`](./pytorch_pretrained_bert/modeling_openai.py#L536) - raw OpenAI GPT Transformer model (**fully pre-trained**),
-  - [`OpenAIGPTLMHeadModel`](./pytorch_pretrained_bert/modeling_openai.py#L643) - OpenAI GPT Transformer with the tied language modeling head on top (**fully pre-trained**),
-  - [`OpenAIGPTDoubleHeadsModel`](./pytorch_pretrained_bert/modeling_openai.py#L722) - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
-
- Two **Transformer-XL** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) file):
-  - [`TransfoXLModel`](./pytorch_pretrained_bert/modeling_transfo_xl.py#L983) - Transformer-XL model which outputs the last hidden state and memory cells (**fully pre-trained**),
-  - [`TransfoXLLMHeadModel`](./pytorch_pretrained_bert/modeling_transfo_xl.py#L1260) - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (**fully pre-trained**),
-
- Three **OpenAI GPT-2** PyTorch models (`torch.nn.Module`) with pre-trained weights (in the [`modeling_gpt2.py`](./pytorch_pretrained_bert/modeling_gpt2.py) file):
-  - [`GPT2Model`](./pytorch_pretrained_bert/modeling_gpt2.py#L479) - raw OpenAI GPT-2 Transformer model (**fully pre-trained**),
-  - [`GPT2LMHeadModel`](./pytorch_pretrained_bert/modeling_gpt2.py#L559) - OpenAI GPT-2 Transformer with the tied language modeling head on top (**fully pre-trained**),
-  - [`GPT2DoubleHeadsModel`](./pytorch_pretrained_bert/modeling_gpt2.py#L624) - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**, the multiple choice classification head **is only initialized and has to be trained**),
-
- Tokenizers for **BERT** (using word-piece) (in the [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) file):
-  - `BasicTokenizer` - basic tokenization (punctuation splitting, lower casing, etc.),
-  - `WordpieceTokenizer` - WordPiece tokenization,
-  - `BertTokenizer` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
-
- Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) file):
-  - `OpenAIGPTTokenizer` - perform Byte-Pair-Encoding (BPE) tokenization.
-
- Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) file):
-  - `OpenAIGPTTokenizer` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
-
- Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the [`tokenization_gpt2.py`](./pytorch_pretrained_bert/tokenization_gpt2.py) file):
-  - `GPT2Tokenizer` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
-
- Optimizer for **BERT** (in the [`optimization.py`](./pytorch_pretrained_bert/optimization.py) file):
-  - `BertAdam` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
-
- Optimizer for **OpenAI GPT** (in the [`optimization_openai.py`](./pytorch_pretrained_bert/optimization_openai.py) file):
-  - `OpenAIAdam` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
-
- Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective [`modeling.py`](./pytorch_pretrained_bert/modeling.py), [`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py), [`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py) files):
-  - `BertConfig` - Configuration class to store the configuration of a `BertModel` with utilities to read and write from JSON configuration files.
-  - `OpenAIGPTConfig` - Configuration class to store the configuration of a `OpenAIGPTModel` with utilities to read and write from JSON configuration files.
-  - `GPT2Config` - Configuration class to store the configuration of a `GPT2Model` with utilities to read and write from JSON configuration files.
-  - `TransfoXLConfig` - Configuration class to store the configuration of a `TransfoXLModel` with utilities to read and write from JSON configuration files.
-
-The repository further comprises:
-
- Five examples on how to use **BERT** (in the [`examples` folder](./examples)):
-  - [`extract_features.py`](./examples/extract_features.py) - Show how to extract hidden states from an instance of `BertModel`,
-  - [`run_classifier.py`](./examples/run_classifier.py) - Show how to fine-tune an instance of `BertForSequenceClassification` on GLUE's MRPC task,
-  - [`run_squad.py`](./examples/run_squad.py) - Show how to fine-tune an instance of `BertForQuestionAnswering` on SQuAD v1.0 and SQuAD v2.0 tasks.
-  - [`run_swag.py`](./examples/run_swag.py) - Show how to fine-tune an instance of `BertForMultipleChoice` on Swag task.
-  - [`simple_lm_finetuning.py`](./examples/lm_finetuning/simple_lm_finetuning.py) - Show how to fine-tune an instance of `BertForPretraining` on a target text corpus.
-
- One example on how to use **OpenAI GPT** (in the [`examples` folder](./examples)):
-  - [`run_openai_gpt.py`](./examples/run_openai_gpt.py) - Show how to fine-tune an instance of `OpenGPTDoubleHeadsModel` on the RocStories task.
-
- One example on how to use **Transformer-XL** (in the [`examples` folder](./examples)):
-  - [`run_transfo_xl.py`](./examples/run_transfo_xl.py) - Show how to load and evaluate a pre-trained model of `TransfoXLLMHeadModel` on WikiText 103.
-
- One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the [`examples` folder](./examples)):
-  - [`run_gpt2.py`](./examples/run_gpt2.py) - Show how to use OpenAI GPT-2 an instance of `GPT2LMHeadModel` to generate text (same as the original OpenAI GPT-2 examples).
-
-  These examples are detailed in the [Examples](#examples) section of this readme.
-
- Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the [`notebooks` folder](./notebooks)):
-  - [`Comparing-TF-and-PT-models.ipynb`](./notebooks/Comparing-TF-and-PT-models.ipynb) - Compare the hidden states predicted by `BertModel`,
-  - [`Comparing-TF-and-PT-models-SQuAD.ipynb`](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb) - Compare the spans predicted by  `BertForQuestionAnswering` instances,
-  - [`Comparing-TF-and-PT-models-MLM-NSP.ipynb`](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb) - Compare the predictions of the `BertForPretraining` instances.
+## Quick tour: Usage

-  These notebooks are detailed in the [Notebooks](#notebooks) section of this readme.
+Here are two quick-start examples using `Bert` and `GPT2` with pre-trained models.

- A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
+See the [documentation](#documentation) for the details of all the models and classes.

-  This CLI is detailed in the [Command-line interface](#Command-line-interface) section of this readme.
+### BERT example

-## Usage
-
-### BERT
-
-Here is a quick-start example using `BertTokenizer`, `BertModel` and `BertForMaskedLM` class with Google AI's pre-trained `Bert base uncased` model. See the [doc section](#doc) below for all the details on these classes.
-
-First let's prepare a tokenized input with `BertTokenizer`
+First let's prepare a tokenized input from a text string using `BertTokenizer`

 ```python
 import torch
-from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
+from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM

-# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
 import logging
 logging.basicConfig(level=logging.INFO)

 # Load pre-trained model tokenizer (vocabulary)
 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

-# Tokenized input
+# Tokenize input
 text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
 tokenized_text = tokenizer.tokenize(text)

@@ -203,11 +96,14 @@ tokens_tensor = torch.tensor([indexed_tokens])
 segments_tensors = torch.tensor([segments_ids])
 ```

-Let's see how to use `BertModel` to get hidden states
+Let's see how we can use `BertModel` to encode our inputs in hidden-states:

 ```python
 # Load pre-trained model (weights)
 model = BertModel.from_pretrained('bert-base-uncased')
+
+# Set the model in evaluation mode to desactivate the DropOut modules
+# This is IMPORTANT to have reproductible results during evaluation!
 model.eval()

 # If you have a GPU, put everything on cuda
@@ -217,12 +113,17 @@ model.to('cuda')

 # Predict hidden states features for each layer
 with torch.no_grad():
-    encoded_layers, _ = model(tokens_tensor, segments_tensors)
-# We have a hidden states for each of the 12 layers in model bert-base-uncased
-assert len(encoded_layers) == 12
+    # See the models docstrings for the detail of the inputs
+    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
+    # PyTorch-Transformers models always output tuples.
+    # See the models docstrings for the detail of all the outputs
+    # In our case, the first element is the hidden state of the last layer of the Bert model
+    encoded_layers = outputs[0]
+# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
+assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
 ```

-And how to use `BertForMaskedLM`
+And how to use `BertForMaskedLM` to predict a masked token:

 ```python
 # Load pre-trained model (weights)
@@ -236,7 +137,8 @@ model.to('cuda')

 # Predict all tokens
 with torch.no_grad():
-    predictions = model(tokens_tensor, segments_tensors)
+    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
+    predictions = outputs[0]

 # confirm we were able to predict 'henson'
 predicted_index = torch.argmax(predictions[0, masked_index]).item()
@@ -244,55 +146,39 @@ predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
 assert predicted_token == 'henson'
 ```

-### OpenAI GPT
+### OpenAI GPT-2

-Here is a quick-start example using `OpenAIGPTTokenizer`, `OpenAIGPTModel` and `OpenAIGPTLMHeadModel` class with OpenAI's pre-trained  model. See the [doc section](#doc) below for all the details on these classes.
+Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.

-First let's prepare a tokenized input with `OpenAIGPTTokenizer`
+First let's prepare a tokenized input from our text string using `GPT2Tokenizer`

 ```python
 import torch
-from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
+from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

 # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
 import logging
 logging.basicConfig(level=logging.INFO)

 # Load pre-trained model tokenizer (vocabulary)
-tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-
-# Tokenized input
-text = "Who was Jim Henson ? Jim Henson was a puppeteer"
-tokenized_text = tokenizer.tokenize(text)
+tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

-# Convert token to vocabulary indices
-indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+# Encode a text inputs
+text = "Who was Jim Henson ? Jim Henson was a"
+indexed_tokens = tokenizer.encode(text)

-# Convert inputs to PyTorch tensors
+# Convert indexed tokens in a PyTorch tensor
 tokens_tensor = torch.tensor([indexed_tokens])
 ```

-Let's see how to use `OpenAIGPTModel` to get hidden states
+Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:

 ```python
 # Load pre-trained model (weights)
-model = OpenAIGPTModel.from_pretrained('openai-gpt')
-model.eval()
-
-# If you have a GPU, put everything on cuda
-tokens_tensor = tokens_tensor.to('cuda')
-model.to('cuda')
-
-# Predict hidden states features for each layer
-with torch.no_grad():
-    hidden_states = model(tokens_tensor)
-```
-
-And how to use `OpenAIGPTLMHeadModel`
+model = GPT2LMHeadModel.from_pretrained('gpt2')

-```python
-# Load pre-trained model (weights)
-model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
+# Set the model in evaluation mode to desactivate the DropOut modules
+# This is IMPORTANT to have reproductible results during evaluation!
 model.eval()

 # If you have a GPU, put everything on cuda
@@ -301,1369 +187,266 @@ model.to('cuda')

 # Predict all tokens
 with torch.no_grad():
-    predictions = model(tokens_tensor)
+    outputs = model(tokens_tensor)
+    predictions = outputs[0]

-# get the predicted last token
+# get the predicted next sub-word (in our case, the word 'man')
 predicted_index = torch.argmax(predictions[0, -1, :]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-assert predicted_token == '.</w>'
+predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
+assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
 ```

-And how to use `OpenAIGPTDoubleHeadsModel`
-
-```python
-# Load pre-trained model (weights)
-model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
-model.eval()
-
-#  Prepare tokenized input
-text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-tokenized_text1 = tokenizer.tokenize(text1)
-tokenized_text2 = tokenizer.tokenize(text2)
-indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
-
-# Predict hidden states features for each layer
-with torch.no_grad():
-    lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)
-```
+Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).

-### Transformer-XL
+## Quick tour: Fine-tuning/usage scripts

-Here is a quick-start example using `TransfoXLTokenizer`, `TransfoXLModel` and `TransfoXLModelLMHeadModel` class with the Transformer-XL model pre-trained on WikiText-103. See the [doc section](#doc) below for all the details on these classes.
+The library comprises several example scripts with SOTA performances for NLU and NLG tasks:

-First let's prepare a tokenized input with `TransfoXLTokenizer`
+- fine-tuning Bert/XLNet/XLM with a *sequence-level classifier* on nine different GLUE tasks,
+- fine-tuning Bert/XLNet/XLM with a *token-level classifier* on the question answering dataset SQuAD 2.0, and
+- using GPT/GPT-2/Transformer-XL and XLNet for conditional language generation.

-```python
-import torch
-from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
+Here are three quick usage examples for these scripts:

-# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-import logging
-logging.basicConfig(level=logging.INFO)
+### Fine-tuning for sequence classification: GLUE tasks examples

-# Load pre-trained model tokenizer (vocabulary from wikitext 103)
-tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

-# Tokenized input
-text_1 = "Who was Jim Henson ?"
-text_2 = "Jim Henson was a puppeteer"
-tokenized_text_1 = tokenizer.tokenize(text_1)
-tokenized_text_2 = tokenizer.tokenize(text_2)
+Before running anyone of these GLUE tasks you should download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.

-# Convert token to vocabulary indices
-indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
-indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
+```shell
+export GLUE_DIR=/path/to/glue
+export TASK_NAME=MRPC

-# Convert inputs to PyTorch tensors
-tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+python run_bert_classifier.py \
+  --task_name $TASK_NAME \
+  --do_train \
+  --do_eval \
+  --do_lower_case \
+  --data_dir $GLUE_DIR/$TASK_NAME \
+  --bert_model bert-base-uncased \
+  --max_seq_length 128 \
+  --train_batch_size 32 \
+  --learning_rate 2e-5 \
+  --num_train_epochs 3.0 \
+  --output_dir /tmp/$TASK_NAME/
 ```

-Let's see how to use `TransfoXLModel` to get hidden states
-
-```python
-# Load pre-trained model (weights)
-model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
-model.eval()
-
-# If you have a GPU, put everything on cuda
-tokens_tensor_1 = tokens_tensor_1.to('cuda')
-tokens_tensor_2 = tokens_tensor_2.to('cuda')
-model.to('cuda')
-
-with torch.no_grad():
-    # Predict hidden states features for each layer
-    hidden_states_1, mems_1 = model(tokens_tensor_1)
-    # We can re-use the memory cells in a subsequent call to attend a longer context
-    hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
-```
+where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

-And how to use `TransfoXLLMHeadModel`
+The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.

-```python
-# Load pre-trained model (weights)
-model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
-model.eval()
+#### Fine-tuning XLNet model on the STS-B regression task

-# If you have a GPU, put everything on cuda
-tokens_tensor_1 = tokens_tensor_1.to('cuda')
-tokens_tensor_2 = tokens_tensor_2.to('cuda')
-model.to('cuda')
+This example code fine-tunes XLNet on the STS-B corpus using parallel training on a server with 4 V100 GPUs.
+Parallel training is a simple way to use several GPU (but it is slower and less flexible than distributed training, see below).

-with torch.no_grad():
-    # Predict all tokens
-    predictions_1, mems_1 = model(tokens_tensor_1)
-    # We can re-use the memory cells in a subsequent call to attend a longer context
-    predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
+```shell
+export GLUE_DIR=/path/to/glue

-# get the predicted last token
-predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-assert predicted_token == 'who'
+python ./examples/run_glue.py \
+    --model_type xlnet \
+    --model_name_or_path xlnet-large-cased \
+    --do_train  \
+    --task_name=sts-b     \
+    --data_dir=${GLUE_DIR}/STS-B  \
+    --output_dir=./proc_data/sts-b-110   \
+    --max_seq_length=128   \
+    --per_gpu_eval_batch_size=8   \
+    --per_gpu_train_batch_size=8   \
+    --gradient_accumulation_steps=1 \
+    --max_steps=1200  \
+    --model_name=xlnet-large-cased   \
+    --overwrite_output_dir   \
+    --overwrite_cache \
+    --warmup_steps=120
 ```

-### OpenAI GPT-2
-
-Here is a quick-start example using `GPT2Tokenizer`, `GPT2Model` and `GPT2LMHeadModel` class with OpenAI's pre-trained  model. See the [doc section](#doc) below for all the details on these classes.
-
-First let's prepare a tokenized input with `GPT2Tokenizer`
-
-```python
-import torch
-from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
-
-# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-import logging
-logging.basicConfig(level=logging.INFO)
+On this machine we thus have a batch size of 32, please increase `gradient_accumulation_steps` to reach the same batch size if you have a smaller machine.
+These hyper-parameters give evaluation results pearsonr of `0.918`.

-# Load pre-trained model tokenizer (vocabulary)
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+#### Fine-tuning Bert model on the MRPC classification task

-# Encode some inputs
-text_1 = "Who was Jim Henson ?"
-text_2 = "Jim Henson was a puppeteer"
-indexed_tokens_1 = tokenizer.encode(text_1)
-indexed_tokens_2 = tokenizer.encode(text_2)
+This example code fine-tunes the Bert Whole Word Masking model on the Microsoft Research Paraphrase Corpus (MRPC) corpus using distributed training on 8 V100 GPUs to reach a F1 > 92.

-# Convert inputs to PyTorch tensors
-tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+```bash
+python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py   \
+    --model_type bert \
+    --model_name_or_path bert-large-uncased-whole-word-masking \
+    --task_name MRPC \
+    --do_train   \
+    --do_eval   \
+    --do_lower_case   \
+    --data_dir $GLUE_DIR/MRPC/   \
+    --max_seq_length 128   \
+    --per_gpu_eval_batch_size=8   \
+    --per_gpu_train_batch_size=8   \
+    --learning_rate 2e-5   \
+    --num_train_epochs 3.0  \
+    --output_dir /tmp/mrpc_output/ \
+    --overwrite_output_dir   \
+    --overwrite_cache \
 ```

-Let's see how to use `GPT2Model` to get hidden states
-
-```python
-# Load pre-trained model (weights)
-model = GPT2Model.from_pretrained('gpt2')
-model.eval()
-
-# If you have a GPU, put everything on cuda
-tokens_tensor_1 = tokens_tensor_1.to('cuda')
-tokens_tensor_2 = tokens_tensor_2.to('cuda')
-model.to('cuda')
+Training with these hyper-parameters gave us the following results:

-# Predict hidden states features for each layer
-with torch.no_grad():
-    hidden_states_1, past = model(tokens_tensor_1)
-    # past can be used to reuse precomputed hidden state in a subsequent predictions
-    # (see beam-search examples in the run_gpt2.py example).
-    hidden_states_2, past = model(tokens_tensor_2, past=past)
+```bash
+  acc = 0.8823529411764706
+  acc_and_f1 = 0.901702786377709
+  eval_loss = 0.3418912578906332
+  f1 = 0.9210526315789473
+  global_step = 174
+  loss = 0.07231863956341798
 ```

-And how to use `GPT2LMHeadModel`
+### Fine-tuning for question-answering: SQuAD example

-```python
-# Load pre-trained model (weights)
-model = GPT2LMHeadModel.from_pretrained('gpt2')
-model.eval()
+This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

-# If you have a GPU, put everything on cuda
-tokens_tensor_1 = tokens_tensor_1.to('cuda')
-tokens_tensor_2 = tokens_tensor_2.to('cuda')
-model.to('cuda')
-
-# Predict all tokens
-with torch.no_grad():
-    predictions_1, past = model(tokens_tensor_1)
-    # past can be used to reuse precomputed hidden state in a subsequent predictions
-    # (see beam-search examples in the run_gpt2.py example).
-    predictions_2, past = model(tokens_tensor_2, past=past)
-
-# get the predicted last token
-predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-predicted_token = tokenizer.decode([predicted_index])
+```bash
+python -m torch.distributed.launch --nproc_per_node=8 run_squad.py \
+    --model_type bert \
+    --model_name_or_path bert-large-uncased-whole-word-masking \
+    --do_train \
+    --do_predict \
+    --do_lower_case \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ../models/wwm_uncased_finetuned_squad/ \
+    --per_gpu_eval_batch_size=3   \
+    --per_gpu_train_batch_size=3   \
 ```

-And how to use `GPT2DoubleHeadsModel`
-
-```python
-# Load pre-trained model (weights)
-model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
-model.eval()
-
-#  Prepare tokenized input
-text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-tokenized_text1 = tokenizer.tokenize(text1)
-tokenized_text2 = tokenizer.tokenize(text2)
-indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+Training with these hyper-parameters gave us the following results:

-# Predict hidden states features for each layer
-with torch.no_grad():
-    lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)
+```bash
+python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
+{"exact_match": 86.91579943235573, "f1": 93.1532499015869}
 ```

+This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.

-## Doc
-
-Here is a detailed documentation of the classes in the package and how to use them:
-
-| Sub-section | Description |
-|-|-|
-| [Loading pre-trained weights](#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump) | How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance |
-| [Serialization best-practices](#serialization-best-practices) | How to save and reload a fine-tuned model |
-| [Configurations](#configurations) | API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL |
-| [Models](#models) | API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL |
-| [Tokenizers](#tokenizers) | API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL|
-| [Optimizers](#optimizers) |  API of the optimizers |
-
-### Loading Google AI or OpenAI pre-trained weights or PyTorch dump
+### Conditional generation: Text generation with GPT, GPT-2, Transformer-XL and XLNet

-### `from_pretrained()` method
+A conditional generation script is also included to generate text from a prompt.
+The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).

-To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of `BertForPreTraining` saved with `torch.save()`), the PyTorch model classes and the tokenizer can be instantiated using the `from_pretrained()` method:
+Here is how to run the script with the small version of OpenAI GPT-2 model:

-```python
-model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
+```shell
+python ./examples/run_glue.py \
+    --model_type=gpt2 \
+    --length=20 \
+    --model_name_or_path=gpt2 \
 ```

-where
+## Documentation

- `BERT_CLASS` is either a tokenizer to load the vocabulary (`BertTokenizer` or `OpenAIGPTTokenizer` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): `BertModel`, `BertForMaskedLM`, `BertForNextSentencePrediction`, `BertForPreTraining`, `BertForSequenceClassification`, `BertForTokenClassification`, `BertForMultipleChoice`, `BertForQuestionAnswering`, `OpenAIGPTModel`, `OpenAIGPTLMHeadModel` or `OpenAIGPTDoubleHeadsModel`, and
- `PRE_TRAINED_MODEL_NAME_OR_PATH` is either:
+The full documentation is available at https://huggingface.co/pytorch-transformers/.

-  - the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
+## Migrating from pytorch-pretrained-bert to pytorch-transformers

-    - `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    - `bert-base-cased`: 12-layer, 768-hidden, 12-heads , 110M parameters
-    - `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
-    - `bert-base-multilingual-uncased`: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-base-multilingual-cased`: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-base-chinese`: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `bert-base-german-cased`: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters [Performance Evaluation](https://deepset.ai/german-bert)
-    - `bert-large-uncased-whole-word-masking`: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    - `bert-large-cased-whole-word-masking`: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
-    - `bert-large-uncased-whole-word-masking-finetuned-squad`: The `bert-large-uncased-whole-word-masking` model finetuned on SQuAD (using the `run_squad.py` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
-    - `openai-gpt`: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
-    - `gpt2`: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
-    - `gpt2-medium`: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
-    - `transfo-xl-wt103`: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
+Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers`

-  - a path or url to a pretrained model archive containing:
+### Models always output `tuples`

-    - `bert_config.json` or `openai_gpt_config.json` a configuration file for the model, and
-    - `pytorch_model.bin` a PyTorch dump of a pre-trained instance of `BertForPreTraining`, `OpenAIGPTModel`, `TransfoXLModel`, `GPT2LMHeadModel` (saved with the usual `torch.save()`)
+The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.

-  If `PRE_TRAINED_MODEL_NAME_OR_PATH` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links [here](pytorch_pretrained_bert/modeling.py)) and stored in a cache folder to avoid future download (the cache folder can be found at `~/.pytorch_pretrained_bert/`).
+The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).

- `cache_dir` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example `cache_dir='./pretrained_model_{}'.format(args.local_rank)` (see the section on distributed training for more information).
- `from_tf`: should we load the weights from a locally saved TensorFlow checkpoint
- `state_dict`: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
- `*inputs`, `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
+In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.

+Here is a `pytorch-pretrained-bert` to `pytorch-transformers` conversion example for a `BertForSequenceClassification` classification model:

-`Uncased` means that the text has been lowercased before WordPiece tokenization, e.g., `John Smith` becomes `john smith`. The Uncased model also strips out any accent markers. `Cased` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the [Multilingual README](https://github.com/google-research/bert/blob/master/multilingual.md) or the original TensorFlow repository.
-
-**When using an `uncased model`, make sure to pass `--do_lower_case` to the example training scripts (or pass `do_lower_case=True` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).**
-
-Examples:
 ```python
-# BERT
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
+# Let's load our model
 model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

-# OpenAI GPT
-tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-model = OpenAIGPTModel.from_pretrained('openai-gpt')
+# If you used to have this line in pytorch-pretrained-bert:
+loss = model(input_ids, labels=labels)

-# Transformer-XL
-tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+# Now just use this line in pytorch-transformers to extract the loss from the output tuple:
+outputs = model(input_ids, labels=labels)
+loss = outputs[0]

-# OpenAI GPT-2
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-model = GPT2Model.from_pretrained('gpt2')
+# In pytorch-transformers you can also have access to the logits:
+loss, logits = outputs[:2]

+# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
+model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
+outputs = model(input_ids, labels=labels)
+loss, logits, attentions = outputs
 ```

-#### Cache directory
-
-`pytorch_pretrained_bert` save the pretrained weights in a cache directory which is located at (in this order of priority):
+### Serialization

- `cache_dir` optional arguments to the `from_pretrained()` method (see above),
- shell environment variable `PYTORCH_PRETRAINED_BERT_CACHE`,
- PyTorch cache home + `/pytorch_pretrained_bert/`
-  where PyTorch cache home is defined by (in this order):
-  - shell environment variable `ENV_TORCH_HOME`
-  - shell environment variable `ENV_XDG_CACHE_HOME` + `/torch/`)
-  - default: `~/.cache/torch/`
+While not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.

-Usually, if you don't set any specific environment variable, `pytorch_pretrained_bert` cache will be at `~/.cache/torch/pytorch_pretrained_bert/`.
-
-You can alsways safely delete `pytorch_pretrained_bert` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
-
-### Serialization best-practices
-
-This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
-There are three types of files you need to save to be able to reload a fine-tuned model:
-
- the model it-self which should be saved following PyTorch serialization [best practices](https://pytorch.org/docs/stable/notes/serialization.html#best-practices),
- the configuration file of the model which is saved as a JSON file, and
- the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
-
-The *default filenames* of these files are as follow:
-
- the model weights file: `pytorch_model.bin`,
- the configuration file: `config.json`,
- the vocabulary file: `vocab.txt` for BERT and Transformer-XL, `vocab.json` for GPT/GPT-2 (BPE vocabulary),
- for GPT/GPT-2 (BPE vocabulary) the additional merges file: `merges.txt`.
-
-**If you save a model using these *default filenames*, you can then re-load the model and tokenizer using the `from_pretrained()` method.**
-
-Here is the recommended way of saving the model, configuration and vocabulary to an `output_dir` directory and reloading the model and tokenizer afterwards:
+Here is an example:

 ```python
-from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
-
-output_dir = "./models/"
-
-# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-
-# If we have a distributed model, save only the encapsulated model
-# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-model_to_save = model.module if hasattr(model, 'module') else model
-
-# If we save using the predefined names, we can load using `from_pretrained`
-output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
-output_config_file = os.path.join(output_dir, CONFIG_NAME)
-
-torch.save(model_to_save.state_dict(), output_model_file)
-model_to_save.config.to_json_file(output_config_file)
-tokenizer.save_vocabulary(output_dir)
-
-# Step 2: Re-load the saved model and vocabulary
+### Let's load a model and tokenizer
+model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

-# Example for a Bert model
-model = BertForQuestionAnswering.from_pretrained(output_dir)
-tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
-# Example for a GPT model
-model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
-tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
-```
+### Do some stuff to our model and tokenizer
+# Ex: add new tokens to the vocabulary and embeddings of our model
+tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
+model.resize_token_embeddings(len(tokenizer))
+# Train our model
+train(model)

-Here is another way you can save and reload the model if you want to use specific paths for each type of files:
+### Now let's save our model and tokenizer to a directory
+model.save_pretrained('./my_saved_model_directory/')
+tokenizer.save_pretrained('./my_saved_model_directory/')

-```python
-output_model_file = "./models/my_own_model_file.bin"
-output_config_file = "./models/my_own_config_file.bin"
-output_vocab_file = "./models/my_own_vocab_file.bin"
-
-# Step 1: Save a model, configuration and vocabulary that you have fine-tuned
-
-# If we have a distributed model, save only the encapsulated model
-# (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
-model_to_save = model.module if hasattr(model, 'module') else model
-
-torch.save(model_to_save.state_dict(), output_model_file)
-model_to_save.config.to_json_file(output_config_file)
-tokenizer.save_vocabulary(output_vocab_file)
-
-# Step 2: Re-load the saved model and vocabulary
-
-# We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
-# Here is how to do it in this situation:
-
-# Example for a Bert model
-config = BertConfig.from_json_file(output_config_file)
-model = BertForQuestionAnswering(config)
-state_dict = torch.load(output_model_file)
-model.load_state_dict(state_dict)
-tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
-
-# Example for a GPT model
-config = OpenAIGPTConfig.from_json_file(output_config_file)
-model = OpenAIGPTDoubleHeadsModel(config)
-state_dict = torch.load(output_model_file)
-model.load_state_dict(state_dict)
-tokenizer = OpenAIGPTTokenizer(output_vocab_file)
+### Reload the model and the tokenizer
+model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
+tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
 ```

-### Configurations
-
-Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which containes the parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON configuration files. The respective configuration classes are:
-
- `BertConfig` for `BertModel` and BERT classes instances.
- `OpenAIGPTConfig` for `OpenAIGPTModel` and OpenAI GPT classes instances.
- `GPT2Config` for `GPT2Model` and OpenAI GPT-2 classes instances.
- `TransfoXLConfig` for `TransfoXLModel` and Transformer-XL classes instances.
-
-These configuration classes contains a few utilities to load and save configurations:
-
- `from_dict(cls, json_object)`: A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
- `from_json_file(cls, json_file)`: A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
- `to_dict()`: Serializes an instance to a Python dictionary. Returns a dictionary.
- `to_json_string()`: Serializes an instance to a JSON string. Returns a string.
- `to_json_file(json_file_path)`: Save an instance to a json file.
-
-### Models
-
-#### 1. `BertModel`
-
-`BertModel` is the basic BERT Transformer model with a layer of summed token, position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 for BERT-large).
-
-Instantiation:
-The model can be instantiated with the following arguments:
-
- `config`: a `BertConfig` class instance with the configuration to build a new model.
- `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
- `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False
-
-The inputs and output are **identical to the TensorFlow model inputs and outputs**.
-
-We detail them here. This model takes as *inputs*:
-[`modeling.py`](./pytorch_pretrained_bert/modeling.py)
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary (see the tokens preprocessing logic in the scripts [`extract_features.py`](./examples/extract_features.py), [`run_classifier.py`](./examples/run_classifier.py) and [`run_squad.py`](./examples/run_squad.py)), and
- `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
- `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. It's a mask to be used if some input sequence lengths are smaller than the max input sequence length of the current batch. It's the mask that we typically use for attention when a batch has varying length sentences.
- `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
- `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.
-
-This model *outputs* a tuple composed of:
-
- `encoded_layers`: controled by the value of the `output_encoded_layers` argument:
-
-  - `output_all_encoded_layers=True`: outputs a list of the encoded-hidden-states at the end of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
-  - `output_all_encoded_layers=False`: outputs only the encoded-hidden-states corresponding to the last attention block, i.e. a single torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
-
- `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a classifier pretrained on top of the hidden state associated to the first character of the input (`CLF`) to train on the Next-Sentence task (see BERT's paper).
-
-An example on how to use this class is given in the [`extract_features.py`](./examples/extract_features.py) script which can be used to extract the hidden states of the model for a given input.
-
-#### 2. `BertForPreTraining`
-
-`BertForPreTraining` includes the `BertModel` Transformer followed by the two pre-training heads:
-
- the masked language modeling head, and
- the next sentence classification head.
-
-*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus two optional labels:
-
- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]
- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.
-
-*Outputs*:
-
- if `masked_lm_labels` and `next_sentence_label` are not `None`: Outputs the total_loss which is the sum of the masked language modeling loss and the next sentence classification loss.
- if `masked_lm_labels` or `next_sentence_label` is `None`: Outputs a tuple comprising
-
-  - the masked language modeling logits, and
-  - the next sentence classification logits.
-
-There are two examples on how to use this class is given in the [`lm_finetuning/`](./examples/lm_finetuning/) directory. The scripts in this directory can be used to fine-tune the BERT language model. This should improve model performance, if the language style is different from the original BERT training corpus (Wiki + BookCorpus).
-
-
-#### 3. `BertForMaskedLM`
-
-`BertForMaskedLM` includes the `BertModel` Transformer followed by the (possibly) pre-trained  masked language modeling head.
-
-*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus optional label:
-
- `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size]
+### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules

-*Outputs*:
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.

- if `masked_lm_labels` is not `None`: Outputs the masked language modeling loss.
- if `masked_lm_labels` is `None`: Outputs the masked language modeling logits.
+The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.

-#### 4. `BertForNextSentencePrediction`
-
-`BertForNextSentencePrediction` includes the `BertModel` Transformer followed by the next sentence classification head.
-
-*Inputs* comprises the inputs of the [`BertModel`](#-1.-`BertModel`) class plus an optional label:
-
- `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size] with indices selected in [0, 1]. 0 => next sentence is the continuation, 1 => next sentence is a random sentence.
-
-*Outputs*:
-
- if `next_sentence_label` is not `None`: Outputs the next sentence classification loss.
- if `next_sentence_label` is `None`: Outputs the next sentence classification logits.
-
-#### 5. `BertForSequenceClassification`
-
-`BertForSequenceClassification` is a fine-tuning model that includes `BertModel` and a sequence-level (sequence or pair of sequences) classifier on top of the `BertModel`.
-
-The sequence-level classifier is a linear layer that takes as input the last hidden state of the first character in the input sequence (see Figures 3a and 3b in the BERT paper).
-
-An example on how to use this class is given in the [`run_classifier.py`](./examples/run_classifier.py) script which can be used to fine-tune a single sequence (or pair of sequence) classifier using BERT, for example for the MRPC task.
-
-#### 6. `BertForMultipleChoice`
-
-`BertForMultipleChoice` is a fine-tuning model that includes `BertModel` and a linear layer on top of the `BertModel`.
-
-The linear layer outputs a single value for each choice of a multiple choice problem, then all the outputs corresponding to an instance are passed through a softmax to get the model choice.
-
-This implementation is largely inspired by the work of OpenAI in [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) and the answer of Jacob Devlin in the following [issue](https://github.com/google-research/bert/issues/38).
-
-An example on how to use this class is given in the [`run_swag.py`](./examples/run_swag.py) script which can be used to fine-tune a multiple choice classifier using BERT, for example for the Swag task.
-
-#### 7. `BertForTokenClassification`
-
-`BertForTokenClassification` is a fine-tuning model that includes `BertModel` and a token-level classifier on top of the `BertModel`.
-
-The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.
-
-#### 8. `BertForQuestionAnswering`
-
-`BertForQuestionAnswering` is a fine-tuning model that includes `BertModel` with a token-level classifiers on top of the full sequence of last hidden states.
-
-The token-level classifier takes as input the full sequence of the last hidden state and compute several (e.g. two) scores for each tokens that can for example respectively be the score that a given token is a `start_span` and a `end_span` token (see Figures 3c and 3d in the BERT paper).
-
-An example on how to use this class is given in the [`run_squad.py`](./examples/run_squad.py) script which can be used to fine-tune a token classifier using BERT, for example for the SQuAD task.
-
-#### 9. `OpenAIGPTModel`
-
-`OpenAIGPTModel` is the basic OpenAI GPT Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.
-
-OpenAI GPT use a single embedding matrix to store the word and special embeddings.
-Special tokens embeddings are additional tokens that are not pre-trained: `[SEP]`, `[CLS]`...
-Special tokens need to be trained during the fine-tuning if you use them.
-The number of special embeddings can be controled using the `set_num_special_tokens(num_special_tokens)` function.
-
-The embeddings are ordered as follow in the token embeddings matrice:
+Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:

 ```python
-    [0,                                                         ----------------------
-      ...                                                        -> word embeddings
-      config.vocab_size - 1,                                     ______________________
-      config.vocab_size,
-      ...                                                        -> special embeddings
-      config.vocab_size + config.n_special - 1]                  ______________________
-```
-
-where total_tokens_embeddings can be obtained as config.total_tokens_embeddings and is:
-    `total_tokens_embeddings = config.vocab_size + config.n_special`
-You should use the associate indices to index the embeddings.
-
-Instantiation:
-The model can be instantiated with the following arguments:
-
- `config`: a `OpenAIConfig` class instance with the configuration to build a new model.
- `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
- `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False
-
-The inputs and output are **identical to the TensorFlow model inputs and outputs**.
-
-We detail them here. This model takes as *inputs*:
-[`modeling_openai.py`](./pytorch_pretrained_bert/modeling_openai.py)
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, total_tokens_embeddings[
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
-    with the position indices (selected in the range [0, config.n_positions - 1[.
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
-    You can use it to add a third type of embedding to each input token in the sequence
-    (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
- `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.
-
-This model *outputs*:
- `hidden_states`: a list of all the encoded-hidden-states in the model (length of the list: number of layers + 1 for the output of the embeddings) as torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
-
-#### 10. `OpenAIGPTLMHeadModel`
-
-`OpenAIGPTLMHeadModel` includes the `OpenAIGPTModel` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).
-
-*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus optional labels:
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
-
-*Outputs*:
- if `lm_labels` is not `None`:
-  Outputs the language modeling loss.
- else:
-  Outputs `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
-
-#### 11. `OpenAIGPTDoubleHeadsModel`
-
-`OpenAIGPTDoubleHeadsModel` includes the `OpenAIGPTModel` Transformer followed by two heads:
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).
-
-*Inputs* are the same as the inputs of the [`OpenAIGPTModel`](#-9.-`OpenAIGPTModel`) class plus a classification mask and two optional labels:
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].
-
-*Outputs*:
- if `lm_labels` and `multiple_choice_labels` are not `None`:
-  Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
- else Outputs a tuple with:
-  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
-  - `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
-
-#### 12. `TransfoXLModel`
-
-The Transformer-XL model is described in "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context".
-
-Transformer XL use a relative positioning with sinusiodal patterns and adaptive softmax inputs which means that:
-
- you don't need to specify positioning embeddings indices
- the tokens in the vocabulary have to be sorted to decreasing frequency.
-
-This model takes as *inputs*:
-[`modeling_transfo_xl.py`](./pytorch_pretrained_bert/modeling_transfo_xl.py)
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the token indices selected in the range [0, self.config.n_token[
- `mems`: an optional memory of hidden states from previous forward passes as a list (num layers) of hidden states at the entry of each layer. Each hidden states has shape [self.config.mem_len, bsz, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
-
-This model *outputs* a tuple of (last_hidden_state, new_mems)
- `last_hidden_state`: the encoded-hidden-states at the top of the model as a torch.FloatTensor of size [batch_size, sequence_length, self.config.d_model]
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
-
-##### Extracting a list of the hidden states at each layer of the Transformer-XL from `last_hidden_state` and `new_mems`:
-The `new_mems` contain all the hidden states PLUS the output of the embeddings (`new_mems[0]`). `new_mems[-1]` is the output of the hidden state of the layer below the last layer and `last_hidden_state` is the output of the last layer (i.E. the input of the softmax when we have a language modeling head on top).
-
-There are two differences between the shapes of `new_mems` and `last_hidden_state`: `new_mems` have transposed first dimensions and are longer (of size `self.config.mem_len`). Here is how to extract the full list of hidden states from the model output:
-
-```python
-hidden_states, mems = model(tokens_tensor)
-seq_length = hidden_states.size(1)
-lower_hidden_states = list(t[-seq_length:, ...].transpose(0, 1) for t in mems)
-all_hidden_states = lower_hidden_states + [hidden_states]
-```
-
-#### 13. `TransfoXLLMHeadModel`
-
-`TransfoXLLMHeadModel` includes the `TransfoXLModel` Transformer followed by an (adaptive) softmax head with weights tied to the input embeddings.
-
-*Inputs* are the same as the inputs of the [`TransfoXLModel`](#-12.-`TransfoXLModel`) class plus optional labels:
- `target`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the target token indices selected in the range [0, self.config.n_token[
-
-*Outputs* a tuple of (last_hidden_state, new_mems)
- `softmax_output`: output of the (adaptive) softmax:
-  - if target is None: log probabilities of tokens, shape [batch_size, sequence_length, n_tokens] 
-  - else: Negative log likelihood of target tokens with shape [batch_size, sequence_length]
- `new_mems`: list (num layers) of updated mem states at the entry of each layer each mem state is a torch.FloatTensor of size [self.config.mem_len, batch_size, self.config.d_model]. Note that the first two dimensions are transposed in `mems` with regards to `input_ids`.
-
-#### 14. `GPT2Model`
-
-`GPT2Model` is the OpenAI GPT-2 Transformer model with a layer of summed token and position embeddings followed by a series of 12 identical self-attention blocks.
-
-Instantiation:
-The model can be instantiated with the following arguments:
-
- `config`: a `GPT2Config` class instance with the configuration to build a new model.
- `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
- `keep_multihead_output`: If True, saves output of the multi-head attention module with its gradient. This can be used to compute head importance metrics. Default: False
-
-The inputs and output are **identical to the TensorFlow model inputs and outputs**.
-
-We detail them here. This model takes as *inputs*:
-[`modeling_gpt2.py`](./pytorch_pretrained_bert/modeling_gpt2.py)
- `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] (or more generally [d_1, ..., d_n, sequence_length] were d_1 ... d_n are arbitrary dimensions) with the word BPE token indices selected in the range [0, vocab_size[
- `position_ids`: an optional torch.LongTensor with the same shape as input_ids
-    with the position indices (selected in the range [0, config.n_positions - 1[.
- `token_type_ids`: an optional torch.LongTensor with the same shape as input_ids
-    You can use it to add a third type of embedding to each input token in the sequence
-    (the previous two being the word and position embeddings). The input, position and token_type embeddings are summed inside the Transformer before the first self-attention block.
- `past`: an optional list of torch.LongTensor that contains pre-computed hidden-states (key and values in the attention blocks) to speed up sequential decoding (this is the `presents` output of the model, cf. below).
- `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1. It's a mask to be used to nullify some heads of the transformer. 0.0 => head is fully masked, 1.0 => head is not masked.
-
-This model *outputs*:
- `hidden_states`: a list of all the encoded-hidden-states in the model (length of the list: number of layers + 1 for the output of the embeddings) as torch.FloatTensor of size [batch_size, sequence_length, hidden_size] (or more generally [d_1, ..., d_n, hidden_size] were d_1 ... d_n are the dimension of input_ids)
- `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).
-
-#### 15. `GPT2LMHeadModel`
-
-`GPT2LMHeadModel` includes the `GPT2Model` Transformer followed by a language modeling head with weights tied to the input embeddings (no additional parameters).
-
-*Inputs* are the same as the inputs of the [`GPT2Model`](#-14.-`GPT2Model`) class plus optional labels:
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
-
-*Outputs*:
- if `lm_labels` is not `None`:
-  Outputs the language modeling loss.
- else: a tuple of
-  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, sequence_length, total_tokens_embeddings] (or more generally [d_1, ..., d_n, total_tokens_embeddings] were d_1 ... d_n are the dimension of input_ids)
-  - `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).
-
-#### 16. `GPT2DoubleHeadsModel`
-
-`GPT2DoubleHeadsModel` includes the `GPT2Model` Transformer followed by two heads:
- a language modeling head with weights tied to the input embeddings (no additional parameters) and:
- a multiple choice classifier (linear layer that take as input a hidden state in a sequence to compute a score, see details in paper).
-
-*Inputs* are the same as the inputs of the [`GPT2Model`](#-14.-`GPT2Model`) class plus a classification mask and two optional labels:
- `multiple_choice_token_ids`: a torch.LongTensor of shape [batch_size, num_choices] with the index of the token whose hidden state should be used as input for the multiple choice classifier (usually the [CLS] token for each choice).
- `lm_labels`: optional language modeling labels: torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss is only computed for the labels set in [0, ..., vocab_size].
- `multiple_choice_labels`: optional multiple choice labels: torch.LongTensor of shape [batch_size] with indices selected in [0, ..., num_choices].
-
-*Outputs*:
- if `lm_labels` and `multiple_choice_labels` are not `None`:
-  Outputs a tuple of losses with the language modeling loss and the multiple choice loss.
- else Outputs a tuple with:
-  - `lm_logits`: the language modeling logits as a torch.FloatTensor of size [batch_size, num_choices, sequence_length, total_tokens_embeddings]
-  - `multiple_choice_logits`: the multiple choice logits as a torch.FloatTensor of size [batch_size, num_choices]
-  - `presents`: a list of pre-computed hidden-states (key and values in each attention blocks) as a torch.FloatTensors. They can be reused to speed up sequential decoding (see the `run_gpt2.py` example).
-
-### Tokenizers
-
-#### `BertTokenizer`
-
-`BertTokenizer` perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
-
-This class has five arguments:
-
- `vocab_file`: path to a vocabulary file.
- `do_lower_case`: convert text to lower-case while tokenizing. **Default = True**.
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
- `do_basic_tokenize`: Do basic tokenization before wordpice tokenization. Set to false if text is pre-tokenized. **Default = True**.
- `never_split`: a list of tokens that should not be splitted during tokenization. **Default = `["[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"]`**
-
-and three methods:
-
- `tokenize(text)`: convert a `str` in a list of `str` tokens by (1) performing basic tokenization and (2) WordPiece tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
- `save_vocabulary(directory_path)`: save the vocabulary file to `directory_path`. Return the path to the saved vocabulary file: `vocab_file_path`. The vocabulary can be reloaded with `BertTokenizer.from_pretrained('vocab_file_path')` or `BertTokenizer.from_pretrained('directory_path')`.
-
-Please refer to the doc strings and code in [`tokenization.py`](./pytorch_pretrained_bert/tokenization.py) for the details of the `BasicTokenizer` and `WordpieceTokenizer` classes. In general it is recommended to use `BertTokenizer` unless you know what you are doing.
-
-#### `OpenAIGPTTokenizer`
-
-`OpenAIGPTTokenizer` perform Byte-Pair-Encoding (BPE) tokenization.
-
-This class has four arguments:
-
- `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges.
- `max_len`: max length to filter the input of the Transformer. Default to pre-trained value for the model if `None`. **Default = None**
- `special_tokens`: a list of tokens to add to the vocabulary for fine-tuning. If SpaCy is not installed and BERT's `BasicTokenizer` is used as the pre-BPE tokenizer, these tokens are not split. **Default= None**
-
-and five methods:
-
- `tokenize(text)`: convert a `str` in a list of `str` tokens by performing BPE tokenization.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
- `set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
- `encode(text)`: convert a `str` in a list of `int` tokens by performing BPE encoding.
- `decode(ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)`: decode a list of `int` indices in a string and do some post-processing if needed: (i) remove special tokens from the output and (ii) clean up tokenization spaces.
- `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
-
-Please refer to the doc strings and code in [`tokenization_openai.py`](./pytorch_pretrained_bert/tokenization_openai.py) for the details of the `OpenAIGPTTokenizer`.
-
-#### `TransfoXLTokenizer`
-
-`TransfoXLTokenizer` perform word tokenization. This tokenizer can be used for adaptive softmax and has utilities for counting tokens in a corpus to create a vocabulary ordered by toekn frequency (for adaptive softmax). See the adaptive softmax paper ([Efficient softmax approximation for GPUs](http://arxiv.org/abs/1609.04309)) for more details.
-
-The API is similar to the API of `BertTokenizer` (see above).
-
-Please refer to the doc strings and code in [`tokenization_transfo_xl.py`](./pytorch_pretrained_bert/tokenization_transfo_xl.py) for the details of these additional methods in `TransfoXLTokenizer`.
-
-#### `GPT2Tokenizer`
-
-`GPT2Tokenizer` perform byte-level Byte-Pair-Encoding (BPE) tokenization.
-
-This class has three arguments:
-
- `vocab_file`: path to a vocabulary file.
- `merges_file`: path to a file containing the BPE merges.
- `errors`: How to handle unicode decoding errors. **Default = `replace`**
-
-and two methods:
-
- `tokenize(text)`: convert a `str` in a list of `str` tokens by performing byte-level BPE.
- `convert_tokens_to_ids(tokens)`: convert a list of `str` tokens in a list of `int` indices in the vocabulary.
- `convert_ids_to_tokens(tokens)`: convert a list of `int` indices in a list of `str` tokens in the vocabulary.
- `set_special_tokens(self, special_tokens)`: update the list of special tokens (see above arguments)
- `encode(text)`: convert a `str` in a list of `int` tokens by performing byte-level BPE.
- `decode(tokens)`: convert back a list of `int` tokens in a `str`.
- `save_vocabulary(directory_path)`: save the vocabulary, merge and special tokens files to `directory_path`. Return the path to the three files: `vocab_file_path`, `merge_file_path`, `special_tokens_file_path`. The vocabulary can be reloaded with `OpenAIGPTTokenizer.from_pretrained('directory_path')`.
-
-Please refer to [`tokenization_gpt2.py`](./pytorch_pretrained_bert/tokenization_gpt2.py) for more details on the `GPT2Tokenizer`.
-
-### Optimizers
-
-#### `BertAdam`
-
-`BertAdam` is a `torch.optimizer` adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:
-
- BertAdam implements weight decay fix,
- BertAdam doesn't compensate for bias as in the regular Adam optimizer.
-
-The optimizer accepts the following arguments:
-
- `lr` : learning rate
- `warmup` : portion of `t_total` for the warmup, `-1`  means no warmup. Default : `-1`
- `t_total` : total number of training steps for the learning
-    rate schedule, `-1`  means constant learning rate. Default : `-1`
- `schedule` : schedule to use for the warmup (see above).
-    Can be `'warmup_linear'`, `'warmup_constant'`, `'warmup_cosine'`, `'none'`, `None` or a `_LRSchedule` object (see below).
-    If `None` or `'none'`, learning rate is always kept constant.
-    Default : `'warmup_linear'`
- `betas` : Adams betas. Default : `0.9, 0.999`
- `e` : Adams epsilon. Default : `1e-6`
- `weight_decay:` Weight decay. Default : `0.01`
- `max_grad_norm` : Maximum norm for the gradients (`-1` means no clipping). Default : `1.0`
-
-#### `OpenAIAdam`
-
-`OpenAIAdam` is similar to `BertAdam`.
-The differences with `BertAdam` is that `OpenAIAdam` compensate for bias as in the regular Adam optimizer.
-
-`OpenAIAdam` accepts the same arguments as `BertAdam`.
-
-#### Learning Rate Schedules
-The `.optimization` module also provides additional schedules in the form of schedule objects that inherit from `_LRSchedule`.
-All `_LRSchedule` subclasses accept `warmup` and `t_total` arguments at construction.
-When an `_LRSchedule` object is passed into `BertAdam` or `OpenAIAdam`, 
-the `warmup` and `t_total` arguments on the optimizer are ignored and the ones in the `_LRSchedule` object are used. 
-An overview of the implemented schedules:
- `ConstantLR`: always returns learning rate 1.
- `WarmupConstantSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
-    Keeps learning rate equal to 1. after warmup.
-    ![](docs/imgs/warmup_constant_schedule.png)
- `WarmupLinearSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
-    Linearly decreases learning rate from 1. to 0. over remaining `1 - warmup` steps.
-    ![](docs/imgs/warmup_linear_schedule.png)
-  `WarmupCosineSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
-    Decreases learning rate from 1. to 0. over remaining `1 - warmup` steps following a cosine curve.
-    If `cycles` (default=0.5) is different from default, learning rate follows cosine function after warmup.
-    ![](docs/imgs/warmup_cosine_schedule.png)
- `WarmupCosineWithHardRestartsSchedule`: Linearly increases learning rate from 0 to 1 over `warmup` fraction of training steps.
-    If `cycles` (default=1.) is different from default, learning rate follows `cycles` times a cosine decaying learning rate (with hard restarts).
-    ![](docs/imgs/warmup_cosine_hard_restarts_schedule.png)
- `WarmupCosineWithWarmupRestartsSchedule`: All training progress is divided in `cycles` (default=1.) parts of equal length.
-    Every part follows a schedule with the first `warmup` fraction of the training steps linearly increasing from 0. to 1.,
-    followed by a learning rate decreasing from 1. to 0. following a cosine curve.
-    Note that the total number of all warmup steps over all cycles together is equal to `warmup` * `cycles`
-    ![](docs/imgs/warmup_cosine_warm_restarts_schedule.png)
-
-## Examples
-
-| Sub-section | Description |
-|-|-|
-| [Training large models: introduction, tools and examples](#Training-large-models-introduction,-tools-and-examples) | How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models |
-| [Fine-tuning with BERT: running the examples](#Fine-tuning-with-BERT-running-the-examples) | Running the examples in [`./examples`](./examples/): `extract_classif.py`, `run_classifier.py`, `run_squad.py` and `lm_finetuning/simple_lm_finetuning.py` |
-| [Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2](#openai-gpt-transformer-xl-and-gpt-2-running-the-examples) | Running the examples in [`./examples`](./examples/): `run_openai_gpt.py`, `run_transfo_xl.py` and `run_gpt2.py` |
-| [Fine-tuning BERT-large on GPUs](#Fine-tuning-BERT-large-on-GPUs) | How to fine tune `BERT large`|
-
-### Training large models: introduction, tools and examples
-
-BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
-
-To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts [`run_classifier.py`](./examples/run_classifier.py) and [`run_squad.py`](./examples/run_squad.py): gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read [the tips on training large batches in PyTorch](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255) that I published earlier this month.
-
-Here is how to use these techniques in our scripts:
-
- **Gradient Accumulation**: Gradient accumulation can be used by supplying a integer greater than 1 to the `--gradient_accumulation_steps` argument. The batch at each step will be divided by this integer and gradient will be accumulated over `gradient_accumulation_steps` steps.
- **Multi-GPU**: Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
- **Distributed training**: Distributed training can be activated by supplying an integer greater or equal to 0 to the `--local_rank` argument (see below).
- **16-bits training**: 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found [here](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) and a full documentation is [here](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html). In our scripts, this option can be activated by setting the `--fp16` flag and you can play with loss scaling using the `--loss_scale` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
-
-To use 16-bits training and distributed training, you need to install NVIDIA's apex extension [as detailed here](https://github.com/nvidia/apex). You will find more information regarding the internals of `apex` and how to use `apex` in [the doc and the associated repository](https://github.com/nvidia/apex). The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in [the relevant PR of the present repository](https://github.com/huggingface/pytorch-pretrained-BERT/pull/116).
-
-Note: To use *Distributed Training*, you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see [the above mentioned blog post]((https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)) for more details):
-```bash
-python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=$THIS_MACHINE_INDEX --master_addr="192.168.1.1" --master_port=1234 run_classifier.py (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
-```
-Where `$THIS_MACHINE_INDEX` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address `192.168.1.1` and an open port `1234`.
-
-### Fine-tuning with BERT: running the examples
-
-We showcase several fine-tuning examples based on (and extended from) [the original implementation](https://github.com/google-research/bert/):
-
- a *sequence-level classifier* on nine different GLUE tasks,
- a *token-level classifier* on the question answering dataset SQuAD, and
- a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
- a *BERT language model* on another target corpus
-
-#### GLUE results on dev set
-
-We get the following results on the dev set of GLUE benchmark with an uncased BERT base 
-model. All experiments were run on a P100 GPU with a batch size of 32.
-
-| Task | Metric | Result |
-|-|-|-|
-| CoLA | Matthew's corr. | 57.29 |
-| SST-2 | accuracy | 93.00 |
-| MRPC | F1/accuracy | 88.85/83.82 |
-| STS-B | Pearson/Spearman corr. | 89.70/89.37 |
-| QQP | accuracy/F1 | 90.72/87.41 |
-| MNLI | matched acc./mismatched acc.| 83.95/84.39 |
-| QNLI | accuracy | 89.04 |
-| RTE | accuracy | 61.01 |
-| WNLI | accuracy | 53.52 |
-
-Some of these results are significantly different from the ones reported on the test set
-of GLUE benchmark on the website. For QQP and WNLI, please refer to [FAQ #12](https://gluebenchmark.com/faq) on the webite.
-
-Before running anyone of these GLUE tasks you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```shell
-export GLUE_DIR=/path/to/glue
-export TASK_NAME=MRPC
-
-python run_classifier.py \
-  --task_name $TASK_NAME \
-  --do_train \
-  --do_eval \
-  --do_lower_case \
-  --data_dir $GLUE_DIR/$TASK_NAME \
-  --bert_model bert-base-uncased \
-  --max_seq_length 128 \
-  --train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/$TASK_NAME/
-```
-
-where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
-
-The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
-
-The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.
-
-#### MRPC
-
-This example code fine-tunes BERT on the Microsoft Research Paraphrase
-Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
-
-Before running this example you should download the
-[GLUE data](https://gluebenchmark.com/tasks) by running
-[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
-and unpack it to some directory `$GLUE_DIR`.
-
-```shell
-export GLUE_DIR=/path/to/glue
-
-python run_classifier.py \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --do_lower_case \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --bert_model bert-base-uncased \
-  --max_seq_length 128 \
-  --train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/
-```
-
-Our test ran on a few seeds with [the original implementation hyper-parameters](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks) gave evaluation results between 84% and 88%.
-
-**Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
-First install apex as indicated [here](https://github.com/NVIDIA/apex).
-Then run
-```shell
-export GLUE_DIR=/path/to/glue
-
-python run_classifier.py \
-  --task_name MRPC \
-  --do_train \
-  --do_eval \
-  --do_lower_case \
-  --data_dir $GLUE_DIR/MRPC/ \
-  --bert_model bert-base-uncased \
-  --max_seq_length 128 \
-  --train_batch_size 32 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --output_dir /tmp/mrpc_output/ \
-  --fp16
-```
-
-**Distributed training**
-Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking model to reach a F1 > 92 on MRPC:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node 8 run_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name MRPC --do_train   --do_eval   --do_lower_case   --data_dir $GLUE_DIR/MRPC/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0  --output_dir /tmp/mrpc_output/
-```
-
-Training with these hyper-parameters gave us the following results:
-```bash
-  acc = 0.8823529411764706
-  acc_and_f1 = 0.901702786377709
-  eval_loss = 0.3418912578906332
-  f1 = 0.9210526315789473
-  global_step = 174
-  loss = 0.07231863956341798
-```
-
-Here is an example on MNLI:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node 8 run_classifier.py   --bert_model bert-large-uncased-whole-word-masking    --task_name mnli --do_train   --do_eval   --do_lower_case   --data_dir /datadrive/bert_data/glue_data//MNLI/   --max_seq_length 128   --train_batch_size 8   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir ../models/wwm-uncased-finetuned-mnli/ --overwrite_output_dir
-```
-
-```bash
-***** Eval results *****
-  acc = 0.8679706601466992
-  eval_loss = 0.4911287787382479
-  global_step = 18408
-  loss = 0.04755385363816904
-
-***** Eval results *****
-  acc = 0.8747965825874695
-  eval_loss = 0.45516540421714036
-  global_step = 18408
-  loss = 0.04755385363816904
-```
-
-This is the example of the `bert-large-uncased-whole-word-masking-finetuned-mnli` model
-
-
-#### SQuAD
-
-This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.
-
-The data for SQuAD can be downloaded with the following links and should be saved in a `$SQUAD_DIR` directory.
-
-*   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
-*   [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
-*   [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
-
-```shell
-export SQUAD_DIR=/path/to/SQUAD
-
-python run_squad.py \
-  --bert_model bert-base-uncased \
-  --do_train \
-  --do_predict \
-  --do_lower_case \
-  --train_file $SQUAD_DIR/train-v1.1.json \
-  --predict_file $SQUAD_DIR/dev-v1.1.json \
-  --train_batch_size 12 \
-  --learning_rate 3e-5 \
-  --num_train_epochs 2.0 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --output_dir /tmp/debug_squad/
-```
-
-Training with the previous hyper-parameters gave us the following results:
-```bash
-python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /tmp/debug_squad/predictions.json
-{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
-```
-
-**distributed training**
-
-Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=8 \
- run_squad.py \
- --bert_model bert-large-uncased-whole-word-masking  \
- --do_train \
- --do_predict \
- --do_lower_case \
- --train_file $SQUAD_DIR/train-v1.1.json \
- --predict_file $SQUAD_DIR/dev-v1.1.json \
- --learning_rate 3e-5 \
- --num_train_epochs 2 \
- --max_seq_length 384 \
- --doc_stride 128 \
- --output_dir ../models/wwm_uncased_finetuned_squad/ \
- --train_batch_size 24 \
- --gradient_accumulation_steps 12
-```
-
-Training with these hyper-parameters gave us the following results:
-```bash
-python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
-{"exact_match": 86.91579943235573, "f1": 93.1532499015869}
-```
-
-This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.
-
-And here is the model provided as `bert-large-cased-whole-word-masking-finetuned-squad`:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=8  run_squad.py  --bert_model bert-large-cased-whole-word-masking   --do_train  --do_predict  --do_lower_case  --train_file $SQUAD_DIR/train-v1.1.json  --predict_file $SQUAD_DIR/dev-v1.1.json  --learning_rate 3e-5  --num_train_epochs 2  --max_seq_length 384  --doc_stride 128  --output_dir ../models/wwm_cased_finetuned_squad/  --train_batch_size 24  --gradient_accumulation_steps 12
-```
-
-Training with these hyper-parameters gave us the following results:
-```bash
-python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
-{"exact_match": 84.18164616840113, "f1": 91.58645594850135}
-```
-
-#### SWAG
-
-The data for SWAG can be downloaded by cloning the following [repository](https://github.com/rowanz/swagaf)
-
-```shell
-export SWAG_DIR=/path/to/SWAG
-
-python run_swag.py \
-  --bert_model bert-base-uncased \
-  --do_train \
-  --do_lower_case \
-  --do_eval \
-  --data_dir $SWAG_DIR/data \
-  --train_batch_size 16 \
-  --learning_rate 2e-5 \
-  --num_train_epochs 3.0 \
-  --max_seq_length 80 \
-  --output_dir /tmp/swag_output/ \
-  --gradient_accumulation_steps 4
-```
-
-Training with the previous hyper-parameters on a single GPU gave us the following results:
-```
-eval_accuracy = 0.8062081375587323
-eval_loss = 0.5966546792367169
-global_step = 13788
-loss = 0.06423990014260186
-```
-
-#### LM Fine-tuning
-
-The data should be a text file in the same format as [sample_text.txt](./samples/sample_text.txt)  (one sentence per line, docs separated by empty line).
-You can download an [exemplary training corpus](https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt) generated from wikipedia articles and splitted into ~500k sentences with spaCy.
-Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with `train_batch_size=200` and `max_seq_length=128`:
-
-
-Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the [`README`](./examples/lm_finetuning/README.md) of the [`examples/lm_finetuning/`](./examples/lm_finetuning/) folder.
-
-### OpenAI GPT, Transformer-XL and GPT-2: running the examples
-
-We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
-
- fine-tuning OpenAI GPT on the ROCStories dataset
- evaluating Transformer-XL on Wikitext 103
- unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
-
-#### Fine-tuning OpenAI GPT on the RocStories dataset
-
-This example code fine-tunes OpenAI GPT on the RocStories dataset.
-
-Before running this example you should download the
-[RocStories dataset](https://github.com/snigdhac/StoryComprehension_EMNLP/tree/master/Dataset/RoCStories) and unpack it to some directory `$ROC_STORIES_DIR`.
-
-```shell
-export ROC_STORIES_DIR=/path/to/RocStories
-
-python run_openai_gpt.py \
-  --model_name openai-gpt \
-  --do_train \
-  --do_eval \
-  --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
-  --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
-  --output_dir ../log \
-  --train_batch_size 16 \
-```
-
-This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
-
-#### Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset
-
-This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset.
-This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.
-
-```shell
-python run_transfo_xl.py --work_dir ../log
-```
-
-This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).
-
-#### Unconditional and conditional generation from OpenAI's GPT-2 model
-
-This example code is identical to the original unconditional and conditional generation codes.
-
-Conditional generation:
-```shell
-python run_gpt2.py
-```
-
-Unconditional generation:
-```shell
-python run_gpt2.py --unconditional
-```
-
-The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
-
-## Fine-tuning BERT-large on GPUs
-
-The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.
-
-For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):
-```bash
-{"exact_match": 84.56953642384106, "f1": 91.04028647786927}
-```
-To get these results we used a combination of:
- multi-GPU training (automatically activated on a multi-GPU server),
- 2 steps of gradient accumulation and
- perform the optimization step on CPU to store Adam's averages in RAM.
-
-Here is the full list of hyper-parameters for this run:
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python ./run_squad.py \
-  --bert_model bert-large-uncased \
-  --do_train \
-  --do_predict \
-  --do_lower_case \
-  --train_file $SQUAD_DIR/train-v1.1.json \
-  --predict_file $SQUAD_DIR/dev-v1.1.json \
-  --learning_rate 3e-5 \
-  --num_train_epochs 2 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --output_dir /tmp/debug_squad/ \
-  --train_batch_size 24 \
-  --gradient_accumulation_steps 2
-```
-
-If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).
-
-Here is an example of hyper-parameters for a FP16 run we tried:
-```bash
-export SQUAD_DIR=/path/to/SQUAD
-
-python ./run_squad.py \
-  --bert_model bert-large-uncased \
-  --do_train \
-  --do_predict \
-  --do_lower_case \
-  --train_file $SQUAD_DIR/train-v1.1.json \
-  --predict_file $SQUAD_DIR/dev-v1.1.json \
-  --learning_rate 3e-5 \
-  --num_train_epochs 2 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --output_dir /tmp/debug_squad/ \
-  --train_batch_size 24 \
-  --fp16 \
-  --loss_scale 128
-```
-
-The results were similar to the above FP32 results (actually slightly higher):
-```bash
-{"exact_match": 84.65468306527909, "f1": 91.238669287002}
-```
-
-Here is an example with the recent `bert-large-uncased-whole-word-masking`:
-
-```bash
-python -m torch.distributed.launch --nproc_per_node=8 \
-  run_squad.py \
-  --bert_model bert-large-uncased-whole-word-masking \
-  --do_train \
-  --do_predict \
-  --do_lower_case \
-  --train_file $SQUAD_DIR/train-v1.1.json \
-  --predict_file $SQUAD_DIR/dev-v1.1.json \
-  --learning_rate 3e-5 \
-  --num_train_epochs 2 \
-  --max_seq_length 384 \
-  --doc_stride 128 \
-  --output_dir /tmp/debug_squad/ \
-  --train_batch_size 24 \
-  --gradient_accumulation_steps 2
-```
-
-## BERTology
-
-There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
-
- BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
- Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
- What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
-
-In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
-
- accessing all the hidden-states of BERT/GPT/GPT-2,
- accessing all the attention weights for each head of BERT/GPT/GPT-2,
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
-
-To help you understand and use these features, we have added a specific example script: [`bertology.py`](./examples/bertology.py) while extract information and prune a model pre-trained on MRPC.
-
-## Notebooks
-
-We include [three Jupyter Notebooks](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks) that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
-
- The first NoteBook ([Comparing-TF-and-PT-models.ipynb](./notebooks/Comparing-TF-and-PT-models.ipynb)) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
-
- The second NoteBook ([Comparing-TF-and-PT-models-SQuAD.ipynb](./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb)) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the `BertForQuestionAnswering` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
-
- The third NoteBook ([Comparing-TF-and-PT-models-MLM-NSP.ipynb](./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb)) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
-
-Please follow the instructions given in the notebooks to run and modify them.
-
-## Command-line interface
-
-A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the `BertForPreTraining` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the `OpenAIGPTModel` class  (for OpenAI GPT).
-
-### BERT
-
-You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the [`convert_tf_checkpoint_to_pytorch.py`](./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py ) script.
-
-This CLI takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using `torch.load()` (see examples in [`extract_features.py`](./examples/extract_features.py), [`run_classifier.py`](./examples/run_classifier.py) and [`run_squad.py`](./examples/run_squad.py)).
-
-You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too.
-
-To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch.
-
-Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model:
-
-```shell
-export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
-
-pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
-  $BERT_BASE_DIR/bert_model.ckpt \
-  $BERT_BASE_DIR/bert_config.json \
-  $BERT_BASE_DIR/pytorch_model.bin
-```
-
-You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models).
-
-### OpenAI GPT
-
-Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see [here](https://github.com/openai/finetune-transformer-lm))
-
-```shell
-export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
-
-pytorch_pretrained_bert convert_openai_checkpoint \
-  $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
-  $PYTORCH_DUMP_OUTPUT \
-  [OPENAI_GPT_CONFIG]
-```
-
-### Transformer-XL
-
-Here is an example of the conversion process for a pre-trained Transformer-XL model (see [here](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models))
-
-```shell
-export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
-
-pytorch_pretrained_bert convert_transfo_xl_checkpoint \
-  $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
-  $PYTORCH_DUMP_OUTPUT \
-  [TRANSFO_XL_CONFIG]
-```
-
-### GPT-2
-
-Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.
-
-```shell
-export GPT2_DIR=/path/to/gpt2/checkpoint
-
-pytorch_pretrained_bert convert_gpt2_checkpoint \
-  $GPT2_DIR/model.ckpt \
-  $PYTORCH_DUMP_OUTPUT \
-  [GPT2_CONFIG]
-```
-
-## TPU
-
-TPU support and pretraining scripts
-
-TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent [official announcement](https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud)).
-
-We will add TPU support when this next release is published.
-
-The original TensorFlow code further comprises two scripts for pre-training BERT: [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py) and [run_pretraining.py](https://github.com/google-research/bert/blob/master/run_pretraining.py).
-
-Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details [here](https://github.com/google-research/bert#pre-training-with-bert)) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.
+# Parameters:
+lr = 1e-3
+num_total_steps = 1000
+num_warmup_steps = 100
+warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
+
+### Previously BertAdam optimizer was instantiated like this:
+optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_total_steps)
+### and used like this:
+for batch in train_data:
+    loss = model(batch)
+    loss.backward()
+    optimizer.step()
+
+### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
+optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
+scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler
+### and used like this:
+for batch in train_data:
+    loss = model(batch)
+    loss.backward()
+    scheduler.step()
+    optimizer.step()
+```
+
+## Citation
+
+At the moment, there is no paper associated to PyTorch-Transformers but we are working on preparing one. In the meantime, please include a mention of the library and a link to the present repository if you use this work in a published or open-source project.
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -2,6 +2,6 @@ FROM pytorch/pytorch:latest

 RUN git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext

-RUN pip install pytorch-pretrained-bert
+RUN pip install pytorch_transformers

 WORKDIR /workspace
\ No newline at end of file
--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
--- a/docs/README.md
+++ b/docs/README.md
+# Generating the documentation
+
+To generate the documentation, you first have to build it. Several packages are necessary to build the doc,
+you can install them using:
+
+```bash
+pip install -r requirements.txt
+```
+ 
+## Packages installed
+
+Here's an overview of all the packages installed. If you ran the previous command installing all packages from 
+`requirements.txt`, you do not need to run the following commands.
+
+Building it requires the package `sphinx` that you can 
+install using:
+
+```bash
+pip install -U sphinx
+```
+
+You would also need the custom installed [theme](https://github.com/readthedocs/sphinx_rtd_theme) by 
+[Read The Docs](https://readthedocs.org/). You can install it using the following command:
+
+```bash
+pip install sphinx_rtd_theme
+```
+
+The third necessary package is the `recommonmark` package to accept Markdown as well as Restructured text:
+
+```bash
+pip install recommonmark
+```
+
+## Building the documentation
+
+Once you have setup `sphinx`, you can build the documentation by running the following command in the `/docs` folder:
+
+```bash
+make html
+```
+
+---
+**NOTE**
+
+If you are adding/removing elements from the toc-tree or from any strutural item, it is recommended to clean the build
+directory before rebuilding. Run the following command to clean and build:
+
+```bash
+make clean && make html
+```
+
+---
+
+It should build the static app that will be available under `/docs/_build/html`
+
+## Adding a new element to the tree (toc-tree)
+
+Accepted files are reStructuredText (.rst) and Markdown (.md). Create a file with its extension and put it
+in the source directory. You can then link it to the toc-tree by putting the filename without the extension.
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
+alabaster==0.7.12
+Babel==2.7.0
+certifi==2019.6.16
+chardet==3.0.4
+commonmark==0.9.0
+docutils==0.14
+future==0.17.1
+idna==2.8
+imagesize==1.1.0
+Jinja2==2.10.1
+MarkupSafe==1.1.1
+packaging==19.0
+Pygments==2.4.2
+pyparsing==2.4.0
+pytz==2019.1
+recommonmark==0.5.0
+requests==2.22.0
+six==1.12.0
+snowballstemmer==1.9.0
+Sphinx==2.1.2
+sphinx-rtd-theme==0.4.3
+sphinxcontrib-applehelp==1.0.1
+sphinxcontrib-devhelp==1.0.1
+sphinxcontrib-htmlhelp==1.0.2
+sphinxcontrib-jsmath==1.0.1
+sphinxcontrib-qthelp==1.0.2
+sphinxcontrib-serializinghtml==1.1.3
+urllib3==1.25.3
--- a/docs/source/_static/css/Calibre-Light.ttf
+++ b/docs/source/_static/css/Calibre-Light.ttf
--- a/docs/source/_static/css/Calibre-Medium.otf
+++ b/docs/source/_static/css/Calibre-Medium.otf
--- a/docs/source/_static/css/Calibre-Regular.otf
+++ b/docs/source/_static/css/Calibre-Regular.otf
--- a/docs/source/_static/css/Calibre-Thin.otf
+++ b/docs/source/_static/css/Calibre-Thin.otf
--- a/docs/source/_static/css/code-snippets.css
+++ b/docs/source/_static/css/code-snippets.css
+
+.highlight .c1, .highlight .sd{
+    color: #999
+}
+
+.highlight .nn, .highlight .k, .highlight .s1, .highlight .nb, .highlight .bp, .highlight .kc {
+    color: #FB8D68;
+}
+
+.highlight .kn, .highlight .nv, .highlight .s2, .highlight .ow {
+    color: #6670FF;
+}
\ No newline at end of file
--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
+huggingface.css
+
+/* The literal code blocks */
+.rst-content tt.literal, .rst-content tt.literal, .rst-content code.literal {
+    color: #6670FF;
+}
+
+/* To keep the logo centered */
+.wy-side-scroll {
+    width: auto;
+    font-size: 20px;
+}
+
+/* The div that holds the Hugging Face logo */
+.HuggingFaceDiv {
+    width: 100%
+}
+
+/* The research field on top of the toc tree */
+.wy-side-nav-search{
+    background-color: #6670FF;
+}
+
+/* The toc tree */
+.wy-nav-side{
+    background-color: #6670FF;
+}
+
+/* The selected items in the toc tree */
+.wy-menu-vertical li.current{
+    background-color: #A6B0FF;
+}
+
+/* When a list item that does belong to the selected block from the toc tree is hovered */
+.wy-menu-vertical li.current a:hover{
+    background-color: #B6C0FF;
+}
+
+/* When a list item that does NOT belong to the selected block from the toc tree is hovered. */
+.wy-menu-vertical li a:hover{
+    background-color: #A7AFFB;
+}
+
+/* The text items on the toc tree */
+.wy-menu-vertical a {
+    color: #FFFFDD;
+    font-family: Calibre-Light;
+}
+.wy-menu-vertical header, .wy-menu-vertical p.caption{
+    color: white;
+    font-family: Calibre-Light;
+}
+
+/* The color inside the selected toc tree block */
+.wy-menu-vertical li.toctree-l2 a, .wy-menu-vertical li.toctree-l3 a, .wy-menu-vertical li.toctree-l4 a {
+    color: black;
+}
+
+/* Inside the depth-2 selected toc tree block */
+.wy-menu-vertical li.toctree-l2.current>a {
+    background-color: #B6C0FF
+}
+.wy-menu-vertical li.toctree-l2.current li.toctree-l3>a {
+    background-color: #C6D0FF
+}
+
+/* Inside the depth-3 selected toc tree block */
+.wy-menu-vertical li.toctree-l3.current li.toctree-l4>a{
+    background-color: #D6E0FF
+}
+
+/* Inside code snippets */
+.rst-content dl:not(.docutils) dt{
+    font-size: 15px;
+}
+
+/* Links */
+a {
+    color: #6670FF;
+}
+
+/* Content bars */
+.rst-content dl:not(.docutils) dt {
+    background-color: rgba(251, 141, 104, 0.1);
+    border-right: solid 2px #FB8D68;
+    border-left: solid 2px #FB8D68;
+    color: #FB8D68;
+    font-family: Calibre-Light;
+    border-top: none;
+    font-style: normal !important;
+}
+
+/* Expand button */
+.wy-menu-vertical li.toctree-l2 span.toctree-expand,
+.wy-menu-vertical li.on a span.toctree-expand, .wy-menu-vertical li.current>a span.toctree-expand,
+.wy-menu-vertical li.toctree-l3 span.toctree-expand{
+    color: black;
+}
+
+/* Max window size */
+.wy-nav-content{
+    max-width: 1200px;
+}
+
+/* Mobile header */
+.wy-nav-top{
+    background-color: #6670FF;
+}
+
+
+/* Source spans */
+.rst-content .viewcode-link, .rst-content .viewcode-back{
+    color: #6670FF;
+    font-size: 110%;
+    letter-spacing: 2px;
+    text-transform: uppercase;
+}
+
+/* It would be better for table to be visible without horizontal scrolling */
+.wy-table-responsive table td, .wy-table-responsive table th{
+    white-space: normal;
+}
+
+.footer {
+    margin-top: 20px;
+}
+
+.footer__Social {
+    display: flex;
+    flex-direction: row;
+}
+
+.footer__CustomImage {
+    margin: 2px 5px 0 0;
+}
+
+/* class and method names in doc */
+.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname, .rst-content dl:not(.docutils) tt.descclassname, .rst-content dl:not(.docutils) code.descclassname{
+    font-family: Calibre;
+    font-size: 20px !important;
+}
+
+/* class name in doc*/
+.rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) tt.descname, .rst-content dl:not(.docutils) code.descname{
+    margin-right: 10px;
+    font-family: Calibre-Medium;
+}
+
+/* Method and class parameters */
+.sig-param{
+    line-height: 23px;
+}
+
+/* Class introduction "class" string at beginning */
+.rst-content dl:not(.docutils) .property{
+    font-size: 18px;
+    color: black;
+}
+
+
+/* FONTS */
+body{
+    font-family: Calibre;
+    font-size: 16px;
+}
+
+h1 {
+    font-family: Calibre-Thin;
+    font-size: 70px;
+}
+
+h2, .rst-content .toctree-wrapper p.caption, h3, h4, h5, h6, legend{
+    font-family: Calibre-Medium;
+}
+
+@font-face {
+    font-family: Calibre-Medium;
+    src: url(./Calibre-Medium.otf);
+    font-weight:400;
+}
+
+@font-face {
+    font-family: Calibre;
+    src: url(./Calibre-Regular.otf);
+    font-weight:400;
+}
+
+@font-face {
+    font-family: Calibre-Light;
+    src: url(./Calibre-Light.ttf);
+    font-weight:400;
+}
+
+@font-face {
+    font-family: Calibre-Thin;
+    src: url(./Calibre-Thin.otf);
+    font-weight:400;
+}
+
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
+function addIcon() {
+    const huggingFaceLogo = "http://lysand.re/huggingface_logo.svg";
+    const image = document.createElement("img");
+    image.setAttribute("src", huggingFaceLogo);
+
+    const div = document.createElement("div");
+    div.appendChild(image);
+    div.style.textAlign = 'center';
+    div.style.paddingTop = '30px';
+    div.style.backgroundColor = '#6670FF';
+
+    const scrollDiv = document.getElementsByClassName("wy-side-scroll")[0];
+    scrollDiv.prepend(div);
+}
+
+function addCustomFooter() {
+    const customFooter = document.createElement("div");
+    const questionOrIssue = document.createElement("div");
+    questionOrIssue.innerHTML = "Stuck? Read our <a href='https://medium.com/huggingface'>Blog posts</a> or <a href='https://github.com/huggingface/pytorch_transformers'>Create an issue</a>";
+    customFooter.appendChild(questionOrIssue);
+    customFooter.classList.add("footer");
+
+    const social = document.createElement("div");
+    social.classList.add("footer__Social");
+
+    const imageDetails = [
+        { link: "https://huggingface.co", imageLink: "http://lysand.re/icons/website.svg" },
+        { link: "https://twitter.com/huggingface", imageLink: "http://lysand.re/icons/twitter.svg" },
+        { link: "https://github.com/huggingface", imageLink: "http://lysand.re/icons/github.svg" },
+        { link: "https://www.linkedin.com/company/huggingface/", imageLink: "http://lysand.re/icons/linkedin.svg" }
+    ];
+
+    imageDetails.forEach(imageLinks => {
+        const link = document.createElement("a");
+        const image = document.createElement("img");
+        image.src = imageLinks.imageLink;
+        link.href = imageLinks.link;
+        image.style.width = "30px";
+        image.classList.add("footer__CustomImage");
+        link.appendChild(image);
+        social.appendChild(link);
+    });
+
+    customFooter.appendChild(social);
+    document.getElementsByTagName("footer")[0].appendChild(customFooter);
+}
+
+function onLoad() {
+    addIcon();
+    addCustomFooter();
+}
+
+window.addEventListener("load", onLoad);
+
--- a/docs/source/_static/js/huggingface_logo.svg
+++ b/docs/source/_static/js/huggingface_logo.svg
+<svg width="95px" height="88px" viewBox="0 0 95 88" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+    <!-- Generator: Sketch 43.2 (39069) - http://www.bohemiancoding.com/sketch -->
+    <title>icon</title>
+    <desc>Created with Sketch.</desc>
+    <defs>
+        <path d="M13,14.7890193 C22.8284801,14.7890193 26,6.02605902 26,1.5261751 C26,-0.812484109 24.4279133,-0.0763570998 21.9099482,1.17020987 C19.5830216,2.32219957 16.4482998,3.91011313 13,3.91011313 C5.82029825,3.91011313 0,-2.97370882 0,1.5261751 C0,6.02605902 3.17151989,14.7890193 13,14.7890193 Z" id="path-1"></path>
+    </defs>
+    <g id="Page-1" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
+        <g id="icon_desktop">
+            <g id="icon">
+                <g id="icon_desktop">
+                    <g id="Group-2">
+                        <g id="Group">
+                            <path d="M93.7930402,70.08 C94.5430402,72.24 94.3630402,74.54 93.3630402,76.54 C92.6430402,78 91.6130402,79.13 90.3530402,80.14 C88.8330402,81.34 86.9430402,82.36 84.6630402,83.34 C81.9430402,84.5 78.6230402,85.59 77.1030402,85.99 C73.2130402,87 69.4730402,87.64 65.6830402,87.67 C60.2630402,87.72 55.5930402,86.44 52.2730402,83.17 C50.5530402,83.38 48.8130402,83.5 47.0630402,83.5 C45.4030402,83.5 43.7630402,83.4 42.1330402,83.2 C38.8030402,86.45 34.1530402,87.72 28.7530402,87.67 C24.9630402,87.64 21.2230402,87 17.3230402,85.99 C15.8130402,85.59 12.4930402,84.5 9.77304019,83.34 C7.49304019,82.36 5.60304019,81.34 4.09304019,80.14 C2.82304019,79.13 1.79304019,78 1.07304019,76.54 C0.0830401858,74.54 -0.106959814,72.24 0.653040186,70.08 C-0.0469598142,68.43 -0.226959814,66.54 0.323040186,64.45 C0.573040186,63.5 0.983040186,62.62 1.50304019,61.84 C1.39304019,61.43 1.30304019,61.01 1.24304019,60.55 C0.863040186,57.81 1.81304019,55.31 3.60304019,53.37 C4.48304019,52.4 5.43304019,51.73 6.42304019,51.3 C5.69304019,48.2 5.31304019,45.01 5.31304019,41.75 C5.31304019,18.69 24.0030402,0 47.0630402,0 C54.9830402,0 62.3930402,2.2 68.7130402,6.04 C69.8530402,6.74 70.9730402,7.49 72.0430402,8.29 C72.5730402,8.69 73.1030402,9.1 73.6130402,9.53 C74.1330402,9.95 74.6430402,10.39 75.1330402,10.84 C76.6130402,12.19 78.0030402,13.64 79.2730402,15.19 C79.7030402,15.7 80.1130402,16.23 80.5130402,16.77 C81.3230402,17.84 82.0730402,18.95 82.7630402,20.1 C83.8130402,21.82 84.7330402,23.62 85.5330402,25.49 C86.0630402,26.74 86.5230402,28.02 86.9330402,29.33 C87.5430402,31.29 88.0130402,33.31 88.3330402,35.39 C88.4330402,36.08 88.5230402,36.78 88.5930402,37.48 C88.7330402,38.88 88.8130402,40.3 88.8130402,41.75 C88.8130402,44.97 88.4330402,48.13 87.7230402,51.18 C88.8230402,51.61 89.8630402,52.31 90.8330402,53.37 C92.6230402,55.31 93.5730402,57.82 93.1930402,60.56 C93.1330402,61.01 93.0430402,61.43 92.9330402,61.84 C93.4530402,62.62 93.8630402,63.5 94.1130402,64.45 C94.6630402,66.54 94.4830402,68.43 93.7930402,70.08" id="Fill-1" fill="#FFFFFF" fill-rule="nonzero"></path>
+                            <circle id="Oval" fill="#FFD21E" fill-rule="nonzero" cx="46.75" cy="41.75" r="34.75"></circle>
+                            <path d="M81.5,41.75 C81.5,22.5581049 65.9418951,7 46.75,7 C27.5581049,7 12,22.5581049 12,41.75 C12,60.9418951 27.5581049,76.5 46.75,76.5 C65.9418951,76.5 81.5,60.9418951 81.5,41.75 Z M8,41.75 C8,20.3489659 25.3489659,3 46.75,3 C68.1510341,3 85.5,20.3489659 85.5,41.75 C85.5,63.1510341 68.1510341,80.5 46.75,80.5 C25.3489659,80.5 8,63.1510341 8,41.75 Z" id="Oval" fill="#FFAC03" fill-rule="nonzero"></path>
+                            <path d="M57.1723547,31.7151181 C58.0863134,32.7107502 57.3040427,35.2620959 58.7620957,35.2620959 C61.5235194,35.2620959 63.7620957,33.0235196 63.7620957,30.2620959 C63.7620957,27.5006721 61.5235194,25.2620959 58.7620957,25.2620959 C56.0006719,25.2620959 53.7620957,27.5006721 53.7620957,30.2620959 C53.7620957,31.5654666 56.3553563,30.8251108 57.1723547,31.7151181 Z" id="Oval-2" fill="#3A3B45" fill-rule="nonzero" transform="translate(58.762096, 30.262096) rotate(-28.000000) translate(-58.762096, -30.262096) "></path>
+                            <path d="M32.1723553,31.7151181 C33.086314,32.7107502 32.3040433,35.2620959 33.7620963,35.2620959 C36.52352,35.2620959 38.7620963,33.0235196 38.7620963,30.2620959 C38.7620963,27.5006721 36.52352,25.2620959 33.7620963,25.2620959 C31.0006725,25.2620959 28.7620963,27.5006721 28.7620963,30.2620959 C28.7620963,31.5654666 31.3553569,30.8251108 32.1723553,31.7151181 Z" id="Oval-2" fill="#3A3B45" fill-rule="nonzero" transform="translate(33.762096, 30.262096) scale(-1, 1) rotate(-28.000000) translate(-33.762096, -30.262096) "></path>
+                            <g id="Oval-4" transform="translate(33.500000, 41.500000)">
+                                <g id="Mask" fill-rule="nonzero" fill="#3A3B45">
+                                    <path d="M13,14.7890193 C22.8284801,14.7890193 26,6.02605902 26,1.5261751 C26,-0.812484109 24.4279133,-0.0763570998 21.9099482,1.17020987 C19.5830216,2.32219957 16.4482998,3.91011313 13,3.91011313 C5.82029825,3.91011313 0,-2.97370882 0,1.5261751 C0,6.02605902 3.17151989,14.7890193 13,14.7890193 Z" id="path-1"></path>
+                                </g>
+                                <g id="Clipped">
+                                    <mask id="mask-2" fill="white">
+                                        <use xlink:href="#path-1"></use>
+                                    </mask>
+                                    <g id="path-1"></g>
+                                    <path d="M13.25,25 C18.0399291,25 21.9229338,21.1169953 21.9229338,16.3270662 C21.9229338,12.5962324 19.5672252,9.41560375 16.2620987,8.19147116 C16.1404592,8.14641904 16.0175337,8.10401696 15.8933923,8.06433503 C15.0599892,7.79793679 14.1717882,10.6623144 13.25,10.6623144 C12.3886883,10.6623144 11.5567012,7.77968641 10.7713426,8.01349068 C7.18916268,9.07991937 4.57706621,12.3984489 4.57706621,16.3270662 C4.57706621,21.1169953 8.46007093,25 13.25,25 Z" id="Shape" fill="#EF4E4E" fill-rule="nonzero" mask="url(#mask-2)"></path>
+                                </g>
+                            </g>
+                            <circle id="Oval-3" fill="#FFD21E" fill-rule="nonzero" style="mix-blend-mode: multiply;" cx="70.25" cy="33.75" r="3.25"></circle>
+                            <circle id="Oval-3" fill="#FFD21E" fill-rule="nonzero" style="mix-blend-mode: multiply;" cx="23.75" cy="33.75" r="3.25"></circle>
+                        </g>
+                    </g>
+                </g>
+                <g id="Group-4" transform="translate(3.000000, 48.000000)" fill-rule="nonzero">
+                    <path d="M14.0619453,0 L14.0619453,0 C12.4429453,0 10.9959453,0.665 9.98694534,1.871 C9.36294534,2.618 8.71094534,3.822 8.65794534,5.625 C7.97894534,5.43 7.32594534,5.321 6.71594534,5.321 C5.16594534,5.321 3.76594534,5.915 2.77594534,6.994 C1.50394534,8.379 0.938945345,10.081 1.18494534,11.784 C1.30194534,12.595 1.57294534,13.322 1.97794534,13.995 C1.12394534,14.686 0.494945345,15.648 0.190945345,16.805 C-0.0470546551,17.712 -0.291054655,19.601 0.982945345,21.547 C0.901945345,21.674 0.825945345,21.806 0.754945345,21.941 C-0.0110546551,23.395 -0.0600546551,25.038 0.615945345,26.568 C1.64094534,28.887 4.18794534,30.714 9.13394534,32.675 C12.2109453,33.895 15.0259453,34.675 15.0509453,34.682 C19.1189453,35.737 22.7979453,36.273 25.9829453,36.273 C31.8369453,36.273 36.0279453,34.48 38.4399453,30.944 C42.3219453,25.25 41.7669453,20.042 36.7439453,15.022 C33.9639453,12.244 32.1159453,8.148 31.7309453,7.249 C30.9549453,4.587 28.9029453,1.628 25.4919453,1.628 L25.4909453,1.628 C25.2039453,1.628 24.9139453,1.651 24.6279453,1.696 C23.1339453,1.931 21.8279453,2.791 20.8949453,4.085 C19.8879453,2.833 18.9099453,1.837 18.0249453,1.275 C16.6909453,0.429 15.3579453,0 14.0619453,0 M14.0619453,4 C14.5719453,4 15.1949453,4.217 15.8819453,4.653 C18.0149453,6.006 22.1309453,13.081 23.6379453,15.833 C24.1429453,16.755 25.0059453,17.145 25.7829453,17.145 C27.3249453,17.145 28.5289453,15.612 25.9239453,13.664 C22.0069453,10.733 23.3809453,5.942 25.2509453,5.647 C25.3329453,5.634 25.4139453,5.628 25.4919453,5.628 C27.1919453,5.628 27.9419453,8.558 27.9419453,8.558 C27.9419453,8.558 30.1399453,14.078 33.9159453,17.851 C37.6919453,21.625 37.8869453,24.654 35.1349453,28.69 C33.2579453,31.442 29.6649453,32.273 25.9829453,32.273 C22.1639453,32.273 18.2489453,31.379 16.0549453,30.81 C15.9469453,30.782 2.60394534,27.013 4.29394534,23.805 C4.57794534,23.266 5.04594534,23.05 5.63494534,23.05 C8.01494534,23.05 12.3439453,26.592 14.2049453,26.592 C14.6209453,26.592 14.9139453,26.415 15.0339453,25.983 C15.8269453,23.138 2.97694534,21.942 4.05994534,17.821 C4.25094534,17.092 4.76894534,16.796 5.49694534,16.797 C8.64194534,16.797 15.6979453,22.328 17.1769453,22.328 C17.2899453,22.328 17.3709453,22.295 17.4149453,22.225 C18.1559453,21.029 17.7499453,20.194 12.5269453,17.033 C7.30394534,13.871 3.63794534,11.969 5.72294534,9.699 C5.96294534,9.437 6.30294534,9.321 6.71594534,9.321 C9.88694534,9.322 17.3789453,16.14 17.3789453,16.14 C17.3789453,16.14 19.4009453,18.243 20.6239453,18.243 C20.9049453,18.243 21.1439453,18.132 21.3059453,17.858 C22.1729453,16.396 13.2529453,9.636 12.7499453,6.847 C12.4089453,4.957 12.9889453,4 14.0619453,4" id="Fill-1" fill="#FFAC03"></path>
+                    <path d="M35.1348,28.6899 C37.8868,24.6539 37.6918,21.6249 33.9158,17.8509 C30.1398,14.0779 27.9418,8.5579 27.9418,8.5579 C27.9418,8.5579 27.1208,5.3519 25.2508,5.6469 C23.3808,5.9419 22.0078,10.7329 25.9248,13.6639 C29.8418,16.5939 25.1448,18.5849 23.6378,15.8329 C22.1308,13.0809 18.0158,6.0059 15.8818,4.6529 C13.7488,3.2999 12.2468,4.0579 12.7498,6.8469 C13.2528,9.6359 22.1738,16.3959 21.3058,17.8589 C20.4378,19.3209 17.3788,16.1399 17.3788,16.1399 C17.3788,16.1399 7.8068,7.4289 5.7228,9.6989 C3.6388,11.9689 7.3038,13.8709 12.5268,17.0329 C17.7508,20.1939 18.1558,21.0289 17.4148,22.2249 C16.6728,23.4209 5.1428,13.6999 4.0598,17.8209 C2.9778,21.9419 15.8268,23.1379 15.0338,25.9829 C14.2408,28.8289 5.9828,20.5979 4.2938,23.8049 C2.6038,27.0129 15.9468,30.7819 16.0548,30.8099 C20.3648,31.9279 31.3108,34.2969 35.1348,28.6899" id="Fill-4" fill="#FFD21E"></path>
+                </g>
+                <g id="Group-4" transform="translate(70.500000, 66.500000) scale(-1, 1) translate(-70.500000, -66.500000) translate(50.000000, 48.000000)" fill-rule="nonzero">
+                    <path d="M14.0619453,0 L14.0619453,0 C12.4429453,0 10.9959453,0.665 9.98694534,1.871 C9.36294534,2.618 8.71094534,3.822 8.65794534,5.625 C7.97894534,5.43 7.32594534,5.321 6.71594534,5.321 C5.16594534,5.321 3.76594534,5.915 2.77594534,6.994 C1.50394534,8.379 0.938945345,10.081 1.18494534,11.784 C1.30194534,12.595 1.57294534,13.322 1.97794534,13.995 C1.12394534,14.686 0.494945345,15.648 0.190945345,16.805 C-0.0470546551,17.712 -0.291054655,19.601 0.982945345,21.547 C0.901945345,21.674 0.825945345,21.806 0.754945345,21.941 C-0.0110546551,23.395 -0.0600546551,25.038 0.615945345,26.568 C1.64094534,28.887 4.18794534,30.714 9.13394534,32.675 C12.2109453,33.895 15.0259453,34.675 15.0509453,34.682 C19.1189453,35.737 22.7979453,36.273 25.9829453,36.273 C31.8369453,36.273 36.0279453,34.48 38.4399453,30.944 C42.3219453,25.25 41.7669453,20.042 36.7439453,15.022 C33.9639453,12.244 32.1159453,8.148 31.7309453,7.249 C30.9549453,4.587 28.9029453,1.628 25.4919453,1.628 L25.4909453,1.628 C25.2039453,1.628 24.9139453,1.651 24.6279453,1.696 C23.1339453,1.931 21.8279453,2.791 20.8949453,4.085 C19.8879453,2.833 18.9099453,1.837 18.0249453,1.275 C16.6909453,0.429 15.3579453,0 14.0619453,0 M14.0619453,4 C14.5719453,4 15.1949453,4.217 15.8819453,4.653 C18.0149453,6.006 22.1309453,13.081 23.6379453,15.833 C24.1429453,16.755 25.0059453,17.145 25.7829453,17.145 C27.3249453,17.145 28.5289453,15.612 25.9239453,13.664 C22.0069453,10.733 23.3809453,5.942 25.2509453,5.647 C25.3329453,5.634 25.4139453,5.628 25.4919453,5.628 C27.1919453,5.628 27.9419453,8.558 27.9419453,8.558 C27.9419453,8.558 30.1399453,14.078 33.9159453,17.851 C37.6919453,21.625 37.8869453,24.654 35.1349453,28.69 C33.2579453,31.442 29.6649453,32.273 25.9829453,32.273 C22.1639453,32.273 18.2489453,31.379 16.0549453,30.81 C15.9469453,30.782 2.60394534,27.013 4.29394534,23.805 C4.57794534,23.266 5.04594534,23.05 5.63494534,23.05 C8.01494534,23.05 12.3439453,26.592 14.2049453,26.592 C14.6209453,26.592 14.9139453,26.415 15.0339453,25.983 C15.8269453,23.138 2.97694534,21.942 4.05994534,17.821 C4.25094534,17.092 4.76894534,16.796 5.49694534,16.797 C8.64194534,16.797 15.6979453,22.328 17.1769453,22.328 C17.2899453,22.328 17.3709453,22.295 17.4149453,22.225 C18.1559453,21.029 17.7499453,20.194 12.5269453,17.033 C7.30394534,13.871 3.63794534,11.969 5.72294534,9.699 C5.96294534,9.437 6.30294534,9.321 6.71594534,9.321 C9.88694534,9.322 17.3789453,16.14 17.3789453,16.14 C17.3789453,16.14 19.4009453,18.243 20.6239453,18.243 C20.9049453,18.243 21.1439453,18.132 21.3059453,17.858 C22.1729453,16.396 13.2529453,9.636 12.7499453,6.847 C12.4089453,4.957 12.9889453,4 14.0619453,4" id="Fill-1" fill="#FFAC03"></path>
+                    <path d="M35.1348,28.6899 C37.8868,24.6539 37.6918,21.6249 33.9158,17.8509 C30.1398,14.0779 27.9418,8.5579 27.9418,8.5579 C27.9418,8.5579 27.1208,5.3519 25.2508,5.6469 C23.3808,5.9419 22.0078,10.7329 25.9248,13.6639 C29.8418,16.5939 25.1448,18.5849 23.6378,15.8329 C22.1308,13.0809 18.0158,6.0059 15.8818,4.6529 C13.7488,3.2999 12.2468,4.0579 12.7498,6.8469 C13.2528,9.6359 22.1738,16.3959 21.3058,17.8589 C20.4378,19.3209 17.3788,16.1399 17.3788,16.1399 C17.3788,16.1399 7.8068,7.4289 5.7228,9.6989 C3.6388,11.9689 7.3038,13.8709 12.5268,17.0329 C17.7508,20.1939 18.1558,21.0289 17.4148,22.2249 C16.6728,23.4209 5.1428,13.6999 4.0598,17.8209 C2.9778,21.9419 15.8268,23.1379 15.0338,25.9829 C14.2408,28.8289 5.9828,20.5979 4.2938,23.8049 C2.6038,27.0129 15.9468,30.7819 16.0548,30.8099 C20.3648,31.9279 31.3108,34.2969 35.1348,28.6899" id="Fill-4" fill="#FFD21E"></path>
+                </g>
+            </g>
+        </g>
+    </g>
+</svg>
\ No newline at end of file
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
+BERTology
+---------
+
+There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
+
+
+* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
+* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
+* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
+
+In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
+
+
+* accessing all the hidden-states of BERT/GPT/GPT-2,
+* accessing all the attention weights for each head of BERT/GPT/GPT-2,
+* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
+
+To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/bertology.py>`_ while extract information and prune a model pre-trained on MRPC.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
+# -*- coding: utf-8 -*-
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+sys.path.insert(0, os.path.abspath('../..'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = u'pytorch-transformers'
+copyright = u'2019, huggingface'
+author = u'huggingface'
+
+# The short X.Y version
+version = u''
+# The full version, including alpha/beta/rc tags
+release = u'1.0.0'
+
+
+# -- General configuration ---------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.coverage',
+    'sphinx.ext.napoleon',
+    'recommonmark',
+    'sphinx.ext.viewcode'
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = ['.rst', '.md']
+# source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    'analytics_id': 'UA-83738774-2'
+}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+
+
+# -- Options for HTMLHelp output ---------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'pytorch-transformersdoc'
+
+
+# -- Options for LaTeX output ------------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'pytorch-transformers.tex', u'pytorch-transformers Documentation',
+     u'huggingface', 'manual'),
+]
+
+
+# -- Options for manual page output ------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'pytorch-transformers', u'pytorch-transformers Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output ----------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'pytorch-transformers', u'pytorch-transformers Documentation',
+     author, 'pytorch-transformers', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+
+# -- Options for Epub output -------------------------------------------------
+
+# Bibliographic Dublin Core info.
+epub_title = project
+
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+
+# A unique identification for the text.
+#
+# epub_uid = ''
+
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+
+def setup(app):
+    app.add_stylesheet('css/huggingface.css')
+    app.add_stylesheet('css/code-snippets.css')
+    app.add_js_file('js/custom.js')
+
+# -- Extension configuration -------------------------------------------------
--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
+Converting Tensorflow Models
+================================================
+
+A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).
+
+BERT
+^^^^
+
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py>`_ script.
+
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
+
+You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+
+To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
+
+Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
+
+.. code-block:: shell
+
+   export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
+
+   pytorch_transformers bert \
+     $BERT_BASE_DIR/bert_model.ckpt \
+     $BERT_BASE_DIR/bert_config.json \
+     $BERT_BASE_DIR/pytorch_model.bin
+
+You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
+
+OpenAI GPT
+^^^^^^^^^^
+
+Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
+
+.. code-block:: shell
+
+   export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
+
+   pytorch_transformers gpt \
+     $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     [OPENAI_GPT_CONFIG]
+
+Transformer-XL
+^^^^^^^^^^^^^^
+
+Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
+
+.. code-block:: shell
+
+   export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
+
+   pytorch_transformers transfo_xl \
+     $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     [TRANSFO_XL_CONFIG]
+
+GPT-2
+^^^^^
+
+Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.
+
+.. code-block:: shell
+
+   export GPT2_DIR=/path/to/gpt2/checkpoint
+
+   pytorch_transformers gpt2 \
+     $GPT2_DIR/model.ckpt \
+     $PYTORCH_DUMP_OUTPUT \
+     [GPT2_CONFIG]
+
+XLNet
+^^^^^
+
+Here is an example of the conversion process for a pre-trained XLNet model, fine-tuned on STS-B using the TensorFlow script:
+
+.. code-block:: shell
+
+   export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
+   export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
+
+   pytorch_transformers xlnet \
+     $TRANSFO_XL_CHECKPOINT_PATH \
+     $TRANSFO_XL_CONFIG_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     STS-B \
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
+examples.rst
+
+Examples
+================================================
+
+.. list-table::
+   :header-rows: 1
+
+   * - Sub-section
+     - Description
+   * - `Training large models: introduction, tools and examples <#introduction>`_
+     - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
+   * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
+   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
+   * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
+     - How to fine tune ``BERT large``
+
+
+.. _introduction:
+
+Training large models: introduction, tools and examples
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
+
+To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_ that I published earlier this year.
+
+Here is how to use these techniques in our scripts:
+
+
+* **Gradient Accumulation**\ : Gradient accumulation can be used by supplying a integer greater than 1 to the ``--gradient_accumulation_steps`` argument. The batch at each step will be divided by this integer and gradient will be accumulated over ``gradient_accumulation_steps`` steps.
+* **Multi-GPU**\ : Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
+* **Distributed training**\ : Distributed training can be activated by supplying an integer greater or equal to 0 to the ``--local_rank`` argument (see below).
+* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here <https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/>`__ and a full documentation is `here <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
+
+To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`__. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.
+
+Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):
+
+.. code-block:: bash
+
+    python -m torch.distributed.launch \
+        --nproc_per_node=4 \
+        --nnodes=2 \
+        --node_rank=$THIS_MACHINE_INDEX \
+        --master_addr="192.168.1.1" \
+        --master_port=1234 run_bert_classifier.py \
+        (--arg1 --arg2 --arg3 and all other arguments of the run_classifier script)
+
+Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``.
+
+.. _fine-tuning-bert-examples:
+
+Fine-tuning with BERT: running the examples
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We showcase several fine-tuning examples based on (and extended from) `the original implementation <https://github.com/google-research/bert/>`_\ :
+
+
+* a *sequence-level classifier* on nine different GLUE tasks,
+* a *token-level classifier* on the question answering dataset SQuAD, and
+* a *sequence-level multiple-choice classifier* on the SWAG classification corpus.
+* a *BERT language model* on another target corpus
+
+GLUE results on dev set
+~~~~~~~~~~~~~~~~~~~~~~~
+
+We get the following results on the dev set of GLUE benchmark with an uncased BERT base
+model. All experiments were run on a P100 GPU with a batch size of 32.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Task
+     - Metric
+     - Result
+   * - CoLA
+     - Matthew's corr.
+     - 57.29
+   * - SST-2
+     - accuracy
+     - 93.00
+   * - MRPC
+     - F1/accuracy
+     - 88.85/83.82
+   * - STS-B
+     - Pearson/Spearman corr.
+     - 89.70/89.37
+   * - QQP
+     - accuracy/F1
+     - 90.72/87.41
+   * - MNLI
+     - matched acc./mismatched acc.
+     - 83.95/84.39
+   * - QNLI
+     - accuracy
+     - 89.04
+   * - RTE
+     - accuracy
+     - 61.01
+   * - WNLI
+     - accuracy
+     - 53.52
+
+
+Some of these results are significantly different from the ones reported on the test set
+of GLUE benchmark on the website. For QQP and WNLI, please refer to `FAQ #12 <https://gluebenchmark.com/faq>`_ on the webite.
+
+Before running anyone of these GLUE tasks you should download the
+`GLUE data <https://gluebenchmark.com/tasks>`_ by running
+`this script <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
+and unpack it to some directory ``$GLUE_DIR``.
+
+.. code-block:: shell
+
+   export GLUE_DIR=/path/to/glue
+   export TASK_NAME=MRPC
+
+   python run_bert_classifier.py \
+     --task_name $TASK_NAME \
+     --do_train \
+     --do_eval \
+     --do_lower_case \
+     --data_dir $GLUE_DIR/$TASK_NAME \
+     --bert_model bert-base-uncased \
+     --max_seq_length 128 \
+     --train_batch_size 32 \
+     --learning_rate 2e-5 \
+     --num_train_epochs 3.0 \
+     --output_dir /tmp/$TASK_NAME/
+
+where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
+
+The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
+
+The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.
+
+MRPC
+~~~~
+
+This example code fine-tunes BERT on the Microsoft Research Paraphrase
+Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
+
+Before running this example you should download the
+`GLUE data <https://gluebenchmark.com/tasks>`_ by running
+`this script <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
+and unpack it to some directory ``$GLUE_DIR``.
+
+.. code-block:: shell
+
+   export GLUE_DIR=/path/to/glue
+
+   python run_bert_classifier.py \
+     --task_name MRPC \
+     --do_train \
+     --do_eval \
+     --do_lower_case \
+     --data_dir $GLUE_DIR/MRPC/ \
+     --bert_model bert-base-uncased \
+     --max_seq_length 128 \
+     --train_batch_size 32 \
+     --learning_rate 2e-5 \
+     --num_train_epochs 3.0 \
+     --output_dir /tmp/mrpc_output/
+
+Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks>`__ gave evaluation results between 84% and 88%.
+
+**Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
+First install apex as indicated `here <https://github.com/NVIDIA/apex>`__.
+Then run
+
+.. code-block:: shell
+
+   export GLUE_DIR=/path/to/glue
+
+   python run_bert_classifier.py \
+     --task_name MRPC \
+     --do_train \
+     --do_eval \
+     --do_lower_case \
+     --data_dir $GLUE_DIR/MRPC/ \
+     --bert_model bert-base-uncased \
+     --max_seq_length 128 \
+     --train_batch_size 32 \
+     --learning_rate 2e-5 \
+     --num_train_epochs 3.0 \
+     --output_dir /tmp/mrpc_output/ \
+     --fp16
+
+**Distributed training**
+Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking model to reach a F1 > 92 on MRPC:
+
+.. code-block:: bash
+
+    python -m torch.distributed.launch \
+        --nproc_per_node 8 run_bert_classifier.py \
+        --bert_model bert-large-uncased-whole-word-masking \
+        --task_name MRPC \
+        --do_train \
+        --do_eval \
+        --do_lower_case \
+        --data_dir $GLUE_DIR/MRPC/ \
+        --max_seq_length 128 \
+        --train_batch_size 8 \
+        --learning_rate 2e-5 \
+        --num_train_epochs 3.0 \
+         --output_dir /tmp/mrpc_output/
+
+Training with these hyper-parameters gave us the following results:
+
+.. code-block:: bash
+
+     acc = 0.8823529411764706
+     acc_and_f1 = 0.901702786377709
+     eval_loss = 0.3418912578906332
+     f1 = 0.9210526315789473
+     global_step = 174
+     loss = 0.07231863956341798
+
+Here is an example on MNLI:
+
+.. code-block:: bash
+
+    python -m torch.distributed.launch \
+        --nproc_per_node 8 run_bert_classifier.py \
+        --bert_model bert-large-uncased-whole-word-masking \
+        --task_name mnli \
+        --do_train \
+        --do_eval \
+        --do_lower_case \
+        --data_dir /datadrive/bert_data/glue_data//MNLI/ \
+        --max_seq_length 128 \
+        --train_batch_size 8 \
+        --learning_rate 2e-5 \
+        --num_train_epochs 3.0 \
+        --output_dir ../models/wwm-uncased-finetuned-mnli/ \
+        --overwrite_output_dir
+
+.. code-block:: bash
+
+   ***** Eval results *****
+     acc = 0.8679706601466992
+     eval_loss = 0.4911287787382479
+     global_step = 18408
+     loss = 0.04755385363816904
+
+   ***** Eval results *****
+     acc = 0.8747965825874695
+     eval_loss = 0.45516540421714036
+     global_step = 18408
+     loss = 0.04755385363816904
+
+This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model
+
+SQuAD
+~~~~~
+
+This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.
+
+The data for SQuAD can be downloaded with the following links and should be saved in a ``$SQUAD_DIR`` directory.
+
+
+* `train-v1.1.json <https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json>`_
+* `dev-v1.1.json <https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json>`_
+* `evaluate-v1.1.py <https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py>`_
+
+.. code-block:: shell
+
+   export SQUAD_DIR=/path/to/SQUAD
+
+   python run_bert_squad.py \
+     --bert_model bert-base-uncased \
+     --do_train \
+     --do_predict \
+     --do_lower_case \
+     --train_file $SQUAD_DIR/train-v1.1.json \
+     --predict_file $SQUAD_DIR/dev-v1.1.json \
+     --train_batch_size 12 \
+     --learning_rate 3e-5 \
+     --num_train_epochs 2.0 \
+     --max_seq_length 384 \
+     --doc_stride 128 \
+     --output_dir /tmp/debug_squad/
+
+Training with the previous hyper-parameters gave us the following results:
+
+.. code-block:: bash
+
+   python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json /tmp/debug_squad/predictions.json
+   {"f1": 88.52381567990474, "exact_match": 81.22043519394512}
+
+**distributed training**
+
+Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:
+
+.. code-block:: bash
+
+   python -m torch.distributed.launch --nproc_per_node=8 \
+    run_bert_squad.py \
+    --bert_model bert-large-uncased-whole-word-masking  \
+    --do_train \
+    --do_predict \
+    --do_lower_case \
+    --train_file $SQUAD_DIR/train-v1.1.json \
+    --predict_file $SQUAD_DIR/dev-v1.1.json \
+    --learning_rate 3e-5 \
+    --num_train_epochs 2 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --output_dir ../models/wwm_uncased_finetuned_squad/ \
+    --train_batch_size 24 \
+    --gradient_accumulation_steps 12
+
+Training with these hyper-parameters gave us the following results:
+
+.. code-block:: bash
+
+   python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
+   {"exact_match": 86.91579943235573, "f1": 93.1532499015869}
+
+This is the model provided as ``bert-large-uncased-whole-word-masking-finetuned-squad``.
+
+And here is the model provided as ``bert-large-cased-whole-word-masking-finetuned-squad``\ :
+
+.. code-block:: bash
+
+    python -m torch.distributed.launch --nproc_per_node=8  run_bert_squad.py \
+        --bert_model bert-large-cased-whole-word-masking \
+        --do_train \
+        --do_predict \
+        --do_lower_case \
+        --train_file $SQUAD_DIR/train-v1.1.json \
+        --predict_file $SQUAD_DIR/dev-v1.1.json \
+        --learning_rate 3e-5 \
+        --num_train_epochs 2 \
+        --max_seq_length 384 \
+        --doc_stride 128 \
+        --output_dir ../models/wwm_cased_finetuned_squad/ \
+        --train_batch_size 24 \
+        --gradient_accumulation_steps 12
+
+Training with these hyper-parameters gave us the following results:
+
+.. code-block:: bash
+
+   python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncased_finetuned_squad/predictions.json
+   {"exact_match": 84.18164616840113, "f1": 91.58645594850135}
+
+SWAG
+~~~~
+
+The data for SWAG can be downloaded by cloning the following `repository <https://github.com/rowanz/swagaf>`_
+
+.. code-block:: shell
+
+   export SWAG_DIR=/path/to/SWAG
+
+   python run_bert_swag.py \
+     --bert_model bert-base-uncased \
+     --do_train \
+     --do_lower_case \
+     --do_eval \
+     --data_dir $SWAG_DIR/data \
+     --train_batch_size 16 \
+     --learning_rate 2e-5 \
+     --num_train_epochs 3.0 \
+     --max_seq_length 80 \
+     --output_dir /tmp/swag_output/ \
+     --gradient_accumulation_steps 4
+
+Training with the previous hyper-parameters on a single GPU gave us the following results:
+
+.. code-block::
+
+   eval_accuracy = 0.8062081375587323
+   eval_loss = 0.5966546792367169
+   global_step = 13788
+   loss = 0.06423990014260186
+
+LM Fine-tuning
+~~~~~~~~~~~~~~
+
+The data should be a text file in the same format as `sample_text.txt <./samples/sample_text.txt>`_  (one sentence per line, docs separated by empty line).
+You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and split into ~500k sentences with spaCy.
+Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ :
+
+Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `README <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/README.md>`_ of the `examples/lm_finetuning/ <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/>`_ folder.
+
+.. _fine-tuning:
+
+OpenAI GPT, Transformer-XL and GPT-2: running the examples
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
+
+
+* fine-tuning OpenAI GPT on the ROCStories dataset
+* evaluating Transformer-XL on Wikitext 103
+* unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
+
+Fine-tuning OpenAI GPT on the RocStories dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This example code fine-tunes OpenAI GPT on the RocStories dataset.
+
+Before running this example you should download the
+`RocStories dataset <https://github.com/snigdhac/StoryComprehension_EMNLP/tree/master/Dataset/RoCStories>`_ and unpack it to some directory ``$ROC_STORIES_DIR``.
+
+.. code-block:: shell
+
+   export ROC_STORIES_DIR=/path/to/RocStories
+
+   python run_openai_gpt.py \
+     --model_name openai-gpt \
+     --do_train \
+     --do_eval \
+     --train_dataset $ROC_STORIES_DIR/cloze_test_val__spring2016\ -\ cloze_test_ALL_val.csv \
+     --eval_dataset $ROC_STORIES_DIR/cloze_test_test__spring2016\ -\ cloze_test_ALL_test.csv \
+     --output_dir ../log \
+     --train_batch_size 16 \
+
+This command runs in about 10 min on a single K-80 an gives an evaluation accuracy of about 87.7% (the authors report a median accuracy with the TensorFlow code of 85.8% and the OpenAI GPT paper reports a best single run accuracy of 86.5%).
+
+Evaluating the pre-trained Transformer-XL on the WikiText 103 dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This example code evaluate the pre-trained Transformer-XL on the WikiText 103 dataset.
+This command will download a pre-processed version of the WikiText 103 dataset in which the vocabulary has been computed.
+
+.. code-block:: shell
+
+   python run_transfo_xl.py --work_dir ../log
+
+This command runs in about 1 min on a V100 and gives an evaluation perplexity of 18.22 on WikiText-103 (the authors report a perplexity of about 18.3 on this dataset with the TensorFlow code).
+
+Unconditional and conditional generation from OpenAI's GPT-2 model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This example code is identical to the original unconditional and conditional generation codes.
+
+Conditional generation:
+
+.. code-block:: shell
+
+   python run_gpt2.py
+
+Unconditional generation:
+
+.. code-block:: shell
+
+   python run_gpt2.py --unconditional
+
+The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
+
+.. _fine-tuning-BERT-large:
+
+Fine-tuning BERT-large on GPUs
+------------------------------
+
+The options we list above allow to fine-tune BERT-large rather easily on GPU(s) instead of the TPU used by the original implementation.
+
+For example, fine-tuning BERT-large on SQuAD can be done on a server with 4 k-80 (these are pretty old now) in 18 hours. Our results are similar to the TensorFlow implementation results (actually slightly higher):
+
+.. code-block:: bash
+
+   {"exact_match": 84.56953642384106, "f1": 91.04028647786927}
+
+To get these results we used a combination of:
+
+
+* multi-GPU training (automatically activated on a multi-GPU server),
+* 2 steps of gradient accumulation and
+* perform the optimization step on CPU to store Adam's averages in RAM.
+
+Here is the full list of hyper-parameters for this run:
+
+.. code-block:: bash
+
+   export SQUAD_DIR=/path/to/SQUAD
+
+   python ./run_bert_squad.py \
+     --bert_model bert-large-uncased \
+     --do_train \
+     --do_predict \
+     --do_lower_case \
+     --train_file $SQUAD_DIR/train-v1.1.json \
+     --predict_file $SQUAD_DIR/dev-v1.1.json \
+     --learning_rate 3e-5 \
+     --num_train_epochs 2 \
+     --max_seq_length 384 \
+     --doc_stride 128 \
+     --output_dir /tmp/debug_squad/ \
+     --train_batch_size 24 \
+     --gradient_accumulation_steps 2
+
+If you have a recent GPU (starting from NVIDIA Volta series), you should try **16-bit fine-tuning** (FP16).
+
+Here is an example of hyper-parameters for a FP16 run we tried:
+
+.. code-block:: bash
+
+   export SQUAD_DIR=/path/to/SQUAD
+
+   python ./run_bert_squad.py \
+     --bert_model bert-large-uncased \
+     --do_train \
+     --do_predict \
+     --do_lower_case \
+     --train_file $SQUAD_DIR/train-v1.1.json \
+     --predict_file $SQUAD_DIR/dev-v1.1.json \
+     --learning_rate 3e-5 \
+     --num_train_epochs 2 \
+     --max_seq_length 384 \
+     --doc_stride 128 \
+     --output_dir /tmp/debug_squad/ \
+     --train_batch_size 24 \
+     --fp16 \
+     --loss_scale 128
+
+The results were similar to the above FP32 results (actually slightly higher):
+
+.. code-block:: bash
+
+   {"exact_match": 84.65468306527909, "f1": 91.238669287002}
+
+Here is an example with the recent ``bert-large-uncased-whole-word-masking``\ :
+
+.. code-block:: bash
+
+   python -m torch.distributed.launch --nproc_per_node=8 \
+     run_bert_squad.py \
+     --bert_model bert-large-uncased-whole-word-masking \
+     --do_train \
+     --do_predict \
+     --do_lower_case \
+     --train_file $SQUAD_DIR/train-v1.1.json \
+     --predict_file $SQUAD_DIR/dev-v1.1.json \
+     --learning_rate 3e-5 \
+     --num_train_epochs 2 \
+     --max_seq_length 384 \
+     --doc_stride 128 \
+     --output_dir /tmp/debug_squad/ \
+     --train_batch_size 24 \
+     --gradient_accumulation_steps 2
+
+Fine-tuning XLNet
+-----------------
+
+STS-B
+~~~~~
+
+This example code fine-tunes XLNet on the STS-B corpus.
+
+Before running this example you should download the
+`GLUE data <https://gluebenchmark.com/tasks>`_ by running
+`this script <https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e>`_
+and unpack it to some directory ``$GLUE_DIR``.
+
+.. code-block:: shell
+
+   export GLUE_DIR=/path/to/glue
+
+   python run_xlnet_classifier.py \
+    --task_name STS-B \
+    --do_train \
+    --do_eval \
+    --data_dir $GLUE_DIR/STS-B/ \
+    --max_seq_length 128 \
+    --train_batch_size 8 \
+    --gradient_accumulation_steps 1 \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir /tmp/mrpc_output/
+
+Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus>`__ gave evaluation results between 84% and 88%.
+
+**Distributed training**
+Here is an example using distributed training on 8 V100 GPUs to reach XXXX:
+
+.. code-block:: bash
+
+   python -m torch.distributed.launch --nproc_per_node 8 \
+    run_xlnet_classifier.py \
+    --task_name STS-B \
+    --do_train \
+    --do_eval \
+    --data_dir $GLUE_DIR/STS-B/ \
+    --max_seq_length 128 \
+    --train_batch_size 8 \
+    --gradient_accumulation_steps 1 \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --output_dir /tmp/mrpc_output/
+
+Training with these hyper-parameters gave us the following results:
+
+.. code-block:: bash
+
+     acc = 0.8823529411764706
+     acc_and_f1 = 0.901702786377709
+     eval_loss = 0.3418912578906332
+     f1 = 0.9210526315789473
+     global_step = 174
+     loss = 0.07231863956341798
+
+Here is an example on MNLI:
+
+.. code-block:: bash
+
+    python -m torch.distributed.launch --nproc_per_node 8 run_bert_classifier.py \
+        --bert_model bert-large-uncased-whole-word-masking \
+        --task_name mnli \
+        --do_train \
+        --do_eval \
+        --data_dir /datadrive/bert_data/glue_data//MNLI/ \
+        --max_seq_length 128 \
+        --train_batch_size 8 \
+        --learning_rate 2e-5 \
+        --num_train_epochs 3.0 \
+        --output_dir ../models/wwm-uncased-finetuned-mnli/ \
+        --overwrite_output_dir
+
+.. code-block:: bash
+
+   ***** Eval results *****
+     acc = 0.8679706601466992
+     eval_loss = 0.4911287787382479
+     global_step = 18408
+     loss = 0.04755385363816904
+
+   ***** Eval results *****
+     acc = 0.8747965825874695
+     eval_loss = 0.45516540421714036
+     global_step = 18408
+     loss = 0.04755385363816904
+
+This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model.