updates to readme and doc

43e0e8fa · thomwolf · f31154cb · 43e0e8fa · 43e0e8fa · 43e0e8fa
Commit 43e0e8fa authored Jul 16, 2019 by thomwolf
9 changed files
--- a/README.md
+++ b/README.md
 # 👾 PyTorch-Transformers

-[![CircleCI](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-pretrained-BERT)
+[![CircleCI](https://circleci.com/gh/huggingface/pytorch-transformers.svg?style=svg)](https://circleci.com/gh/huggingface/pytorch-transformers)

 PyTorch-Transformers is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).

 The library currently contains PyTorch implementations, pre-trained model weights, usage scripts and conversion utilities for the following models:

- **[Google's BERT model](https://github.com/google-research/bert)** released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
- **[OpenAI's GPT model](https://github.com/openai/finetune-transformer-lm)** released  with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
- **[OpenAI's GPT-2 model](https://blog.openai.com/better-language-models/)** released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
- **[Google/CMU's Transformer-XL model](https://github.com/kimiyoung/transformer-xl)** released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
- **[Google/CMU's XLNet model](https://github.com/zihangdai/xlnet/)** released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
- **[Facebook's XLM model](https://github.com/facebookresearch/XLM/)** released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
+1. **[BERT](https://github.com/google-research/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+2. **[GPT](https://github.com/openai/finetune-transformer-lm)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+3. **[GPT-2](https://blog.openai.com/better-language-models/)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
+4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
+6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.

-These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](#documentation).
+These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).

 | Section | Description |
 |-|-|
@@ -21,7 +21,7 @@ These implementations have been tested on several datasets (see the example scri
 | [Quick tour: Usage](#quick-tour-usage) | Tokenizers & models usage: Bert and GPT-2 |
 | [Quick tour: Fine-tuning/usage scripts](#quick-tour-fine-tuningusage-scripts) | Using provided scripts: GLUE, SQuAD and Text generation |
 | [Migrating from pytorch-pretrained-bert to pytorch-transformers](#Migrating-from-pytorch-pretrained-bert-to-pytorch-transformers) | Migrating your code from pytorch-pretrained-bert to pytorch-transformers |
-| [Documentation](#documentation) | Full API documentation and more |
+| [Documentation](https://huggingface.co/pytorch-transformers/) | Full API documentation and more |

 ## Installation

@@ -202,13 +202,14 @@ Examples for each model class of each model architecture (Bert, GPT, GPT-2, Tran

 The library comprises several example scripts with SOTA performances for NLU and NLG tasks:

- fine-tuning Bert/XLNet/XLM with a *sequence-level classifier* on nine different GLUE tasks,
- fine-tuning Bert/XLNet/XLM with a *token-level classifier* on the question answering dataset SQuAD 2.0, and
- using GPT/GPT-2/Transformer-XL and XLNet for conditional language generation.
+- `run_glue.py`: an example fine-tuning Bert, XLNet and XLM on nine different GLUE tasks (*sequence-level classification*)
+- `run_squad.py`: an example fine-tuning Bert, XLNet and XLM on the question answering dataset SQuAD 2.0 (*token-level classification*)
+- `run_generation.py`: an example using GPT, GPT-2, Transformer-XL and XLNet for conditional language generation
+- other model-specific examples (see the documentation).

 Here are three quick usage examples for these scripts:

-### Fine-tuning for sequence classification: GLUE tasks examples
+### `run_glue.py`: Fine-tuning on GLUE tasks for sequence classification

 The [General Language Understanding Evaluation (GLUE) benchmark](https://gluebenchmark.com/) is a collection of nine sentence- or sentence-pair language understanding tasks for evaluating and analyzing natural language understanding systems.

@@ -302,7 +303,7 @@ Training with these hyper-parameters gave us the following results:
  loss = 0.07231863956341798
 ```

-### Fine-tuning for question-answering: SQuAD example
+### `run_squad.py`: Fine-tuning on SQuAD for question-answering

 This example code fine-tunes BERT on the SQuAD dataset using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD:

@@ -333,7 +334,7 @@ python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ../models/wwm_uncase

 This is the model provided as `bert-large-uncased-whole-word-masking-finetuned-squad`.

-### Conditional generation: Text generation with GPT, GPT-2, Transformer-XL and XLNet
+### `run_generation.py`: Text generation with GPT, GPT-2, Transformer-XL and XLNet

 A conditional generation script is also included to generate text from a prompt.
 The generation script include the [tricks](https://github.com/rusiaaman/XLNet-gen#methodology) proposed by by Aman Rusia to get high quality generation with memory models like Transformer-XL and XLNet (include a predefined text to make short inputs longer).
@@ -347,10 +348,6 @@ python ./examples/run_glue.py \
    --model_name_or_path=gpt2 \
 ```

-## Documentation
-
-The full documentation is available at https://huggingface.co/pytorch-transformers/.
-
 ## Migrating from pytorch-pretrained-bert to pytorch-transformers

 Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers`

--- a/docs/source/converting_tensorflow_models.rst
+++ b/docs/source/converting_tensorflow_models.rst
-Converting Tensorflow Models
+Converting Tensorflow Checkpoints
 ================================================

 A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).

--- a/docs/source/index.rst
+++ b/docs/source/index.rst
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -6,43 +6,47 @@ This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python
 With pip
 ^^^^^^^^

-PyTorch pretrained bert can be installed by pip as follows:
+PyTorch pretrained bert can be installed with pip as follows:

 .. code-block:: bash

-   pip install pytorch-pretrained-bert
-
-If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
-
-.. code-block:: bash
-
-   pip install spacy ftfy==4.4.3
-   python -m spacy download en
-
-If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+   pip install pytorch-transformers

 From source
 ^^^^^^^^^^^

-Clone the repository and run:
+Clone the repository and instal locally:

 .. code-block:: bash

+    git clone https://github.com/huggingface/pytorch-transformers.git
+    cd pytorch-transformers
    pip install [--editable] .

-Here also, if you want to reproduce the original tokenization process of the ``OpenAI GPT`` model, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+
+Tests
+^^^^^
+
+An extensive test suite is included for the library and the example scripts. Library tests can be found in the `tests folder <https://github.com/huggingface/pytorch-transformers/tree/master/pytorch_transformers/tests>`_ and examples tests in the `examples folder <https://github.com/huggingface/pytorch-transformers/tree/master/examples>`_.
+
+These tests can be run using `pytest` (install pytest if needed with `pip install pytest`).
+
+You can run the tests from the root of the cloned repository with the commands:

 .. code-block:: bash

-   pip install spacy ftfy==4.4.3
-   python -m spacy download en
+    python -m pytest -sv ./pytorch_transformers/tests/
+    python -m pytest -sv ./examples/

-Again, if you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage).

-A series of tests is included in the `tests folder <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests>`_ and can be run using ``pytest`` (install pytest if needed: ``pip install pytest``\ ).
+OpenAI GPT original tokenization workflow
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-You can run the tests with the command:
+If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :

 .. code-block:: bash

-   python -m pytest -sv tests/
+   pip install spacy ftfy==4.4.3
+   python -m spacy download en
+
+If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
-# Migration
\ No newline at end of file
+# Migrating from pytorch-pretrained-bert
+
+
+Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to `pytorch-transformers`
+
+### Models always output `tuples`
+
+The main breaking change when migrating from `pytorch-pretrained-bert` to `pytorch-transformers` is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters.
+
+The exact content of the tuples for each model are detailled in the models' docstrings and the [documentation](https://huggingface.co/pytorch-transformers/).
+
+In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`.
+
+Here is a `pytorch-pretrained-bert` to `pytorch-transformers` conversion example for a `BertForSequenceClassification` classification model:
+
+```python
+# Let's load our model
+model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+# If you used to have this line in pytorch-pretrained-bert:
+loss = model(input_ids, labels=labels)
+
+# Now just use this line in pytorch-transformers to extract the loss from the output tuple:
+outputs = model(input_ids, labels=labels)
+loss = outputs[0]
+
+# In pytorch-transformers you can also have access to the logits:
+loss, logits = outputs[:2]
+
+# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
+model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
+outputs = model(input_ids, labels=labels)
+loss, logits, attentions = outputs
+```
+
+### Serialization
+
+While not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other seralization method before.
+
+Here is an example:
+
+```python
+### Let's load a model and tokenizer
+model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+### Do some stuff to our model and tokenizer
+# Ex: add new tokens to the vocabulary and embeddings of our model
+tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
+model.resize_token_embeddings(len(tokenizer))
+# Train our model
+train(model)
+
+### Now let's save our model and tokenizer to a directory
+model.save_pretrained('./my_saved_model_directory/')
+tokenizer.save_pretrained('./my_saved_model_directory/')
+
+### Reload the model and the tokenizer
+model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
+tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
+```
+
+### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
+
+The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer.
+The new optimizer `AdamW` matches PyTorch `Adam` optimizer API.
+
+The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore.
+
+Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule:
+
+```python
+# Parameters:
+lr = 1e-3
+num_total_steps = 1000
+num_warmup_steps = 100
+warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1
+
+### Previously BertAdam optimizer was instantiated like this:
+optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, t_total=num_total_steps)
+### and used like this:
+for batch in train_data:
+    loss = model(batch)
+    loss.backward()
+    optimizer.step()
+
+### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
+optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
+scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler
+### and used like this:
+for batch in train_data:
+    loss = model(batch)
+    loss.backward()
+    scheduler.step()
+    optimizer.step()
+```
--- a/docs/source/philosophy.md
+++ b/docs/source/philosophy.md
-# Philosophy
\ No newline at end of file
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
+Pretrained models
+================================================
+
+Here is the full list of the currently provided pretrained models together with a short presentation of each model.
+
+===============+============================================================+===========================+ 
+| Architecture  | Shortcut name                                              | Details of the model      |
+===============+============================================================+===========================+ 
+|               | ``bert-base-uncased``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters
+|               |                                                            | Trained on lower-cased English text                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-large-uncased``                                     | 24-layer, 1024-hidden, 16-heads, 340M parameters
+|               |                                                            | Trained on lower-cased English text                  |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-base-cased``                                        | 12-layer, 768-hidden, 12-heads, 110M parameters
+|               |                                                            | Trained on cased English text                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-large-cased``                                       | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
+|               |                                                            | Trained on cased English text                  |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-base-multilingual-uncased``                         | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
+|               |                                                            | Trained on lower-cased text in the top 102 languages with the largest Wikipedias
+|               |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-base-multilingual-cased``                           | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters                  |
+|               |                                                            | Trained on cased text in the top 104 languages with the largest Wikipedias
+|               |                                                            | (see `details <https://github.com/google-research/bert/blob/master/multilingual.md>`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|    BERT       | ``bert-base-chinese``                                      | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
+|               |                                                            | Trained on cased Chinese Simplified and Traditional text |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-base-german-cased``                                 | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
+|               |                                                            | Trained on cased German text by Deepset.ai |
+|               |                                                            | (see `details on deepset.ai website <https://deepset.ai/german-bert>`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-large-uncased-whole-word-masking``                  | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
+|               |                                                            | Trained on lower-cased English text using Whole-Word-Masking                  |
+|               |                                                            | (see `details <https://github.com/google-research/bert/#bert>`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-large-cased-whole-word-masking``                    | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
+|               |                                                            | Trained on cased English text using Whole-Word-Masking                  |
+|               |                                                            | (see `details <https://github.com/google-research/bert/#bert>`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-large-uncased-whole-word-masking-finetuned-squad``  | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
+|               |                                                            | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD                  |
+|               |                                                            | (see details of fine-tuning in the `example section`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-large-cased-whole-word-masking-finetuned-squad``    | 24-layer, 1024-hidden, 16-heads, 340M parameters                  |
+|               |                                                            | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD                  |
+|               |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_)                 |
+|               +------------------------------------------------------------+---------------------------+ 
+|               | ``bert-base-cased-finetuned-mrpc``                         | 12-layer, 768-hidden, 12-heads, 110M parameters                  |
+|               |                                                            | The ``bert-base-cased`` model fine-tuned on MRPC                  |
+|               |                                                            | (see `details of fine-tuning in the example section <https://huggingface.co/pytorch-transformers/examples.html>`_)                 |
+---------------+------------------------------------------------------------+---------------------------+ 
+|    GPT        | Cells may span columns.                                                                |
+---------------+----------------------------------------------------------------------------------------+ 
+
+.. <https://huggingface.co/pytorch-transformers/examples.html>`_
\ No newline at end of file
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
+# Quickstart
+
+## Main concepts
+
+
+## Quick tour: Usage
+
+Here are two quick-start examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
+
+See package reference for examples for each model classe.
+
+### BERT example
+
+First let's prepare a tokenized input from a text string using `BertTokenizer`
+
+```python
+import torch
+from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
+
+# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
+import logging
+logging.basicConfig(level=logging.INFO)
+
+# Load pre-trained model tokenizer (vocabulary)
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+# Tokenize input
+text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+tokenized_text = tokenizer.tokenize(text)
+
+# Mask a token that we will try to predict back with `BertForMaskedLM`
+masked_index = 8
+tokenized_text[masked_index] = '[MASK]'
+assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
+
+# Convert token to vocabulary indices
+indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
+segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+# Convert inputs to PyTorch tensors
+tokens_tensor = torch.tensor([indexed_tokens])
+segments_tensors = torch.tensor([segments_ids])
+```
+
+Let's see how we can use `BertModel` to encode our inputs in hidden-states:
+
+```python
+# Load pre-trained model (weights)
+model = BertModel.from_pretrained('bert-base-uncased')
+
+# Set the model in evaluation mode to desactivate the DropOut modules
+# This is IMPORTANT to have reproductible results during evaluation!
+model.eval()
+
+# If you have a GPU, put everything on cuda
+tokens_tensor = tokens_tensor.to('cuda')
+segments_tensors = segments_tensors.to('cuda')
+model.to('cuda')
+
+# Predict hidden states features for each layer
+with torch.no_grad():
+    # See the models docstrings for the detail of the inputs
+    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
+    # PyTorch-Transformers models always output tuples.
+    # See the models docstrings for the detail of all the outputs
+    # In our case, the first element is the hidden state of the last layer of the Bert model
+    encoded_layers = outputs[0]
+# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
+assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
+```
+
+And how to use `BertForMaskedLM` to predict a masked token:
+
+```python
+# Load pre-trained model (weights)
+model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+model.eval()
+
+# If you have a GPU, put everything on cuda
+tokens_tensor = tokens_tensor.to('cuda')
+segments_tensors = segments_tensors.to('cuda')
+model.to('cuda')
+
+# Predict all tokens
+with torch.no_grad():
+    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
+    predictions = outputs[0]
+
+# confirm we were able to predict 'henson'
+predicted_index = torch.argmax(predictions[0, masked_index]).item()
+predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+assert predicted_token == 'henson'
+```
+
+### OpenAI GPT-2
+
+Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.
+
+First let's prepare a tokenized input from our text string using `GPT2Tokenizer`
+
+```python
+import torch
+from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
+
+# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+import logging
+logging.basicConfig(level=logging.INFO)
+
+# Load pre-trained model tokenizer (vocabulary)
+tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+
+# Encode a text inputs
+text = "Who was Jim Henson ? Jim Henson was a"
+indexed_tokens = tokenizer.encode(text)
+
+# Convert indexed tokens in a PyTorch tensor
+tokens_tensor = torch.tensor([indexed_tokens])
+```
+
+Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:
+
+```python
+# Load pre-trained model (weights)
+model = GPT2LMHeadModel.from_pretrained('gpt2')
+
+# Set the model in evaluation mode to desactivate the DropOut modules
+# This is IMPORTANT to have reproductible results during evaluation!
+model.eval()
+
+# If you have a GPU, put everything on cuda
+tokens_tensor = tokens_tensor.to('cuda')
+model.to('cuda')
+
+# Predict all tokens
+with torch.no_grad():
+    outputs = model(tokens_tensor)
+    predictions = outputs[0]
+
+# get the predicted next sub-word (in our case, the word 'man')
+predicted_index = torch.argmax(predictions[0, -1, :]).item()
+predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
+assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
+```
+
+Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
-Usage
-================================================
-
-BERT
-^^^^
-
-Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <./model_doc/overview.html>`_ below for all the details on these classes.
-
-First let's prepare a tokenized input with ``BertTokenizer``
-
-.. code-block:: python
-
-   import torch
-   from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
-
-   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-   import logging
-   logging.basicConfig(level=logging.INFO)
-
-   # Load pre-trained model tokenizer (vocabulary)
-   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-
-   # Tokenized input
-   text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-   tokenized_text = tokenizer.tokenize(text)
-
-   # Mask a token that we will try to predict back with `BertForMaskedLM`
-   masked_index = 8
-   tokenized_text[masked_index] = '[MASK]'
-   assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
-
-   # Convert token to vocabulary indices
-   indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-   # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
-   segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
-
-   # Convert inputs to PyTorch tensors
-   tokens_tensor = torch.tensor([indexed_tokens])
-   segments_tensors = torch.tensor([segments_ids])
-
-Let's see how to use ``BertModel`` to get hidden states
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = BertModel.from_pretrained('bert-base-uncased')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor = tokens_tensor.to('cuda')
-   segments_tensors = segments_tensors.to('cuda')
-   model.to('cuda')
-
-   # Predict hidden states features for each layer
-   with torch.no_grad():
-       encoded_layers, _ = model(tokens_tensor, segments_tensors)
-   # We have a hidden states for each of the 12 layers in model bert-base-uncased
-   assert len(encoded_layers) == 12
-
-And how to use ``BertForMaskedLM``
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = BertForMaskedLM.from_pretrained('bert-base-uncased')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor = tokens_tensor.to('cuda')
-   segments_tensors = segments_tensors.to('cuda')
-   model.to('cuda')
-
-   # Predict all tokens
-   with torch.no_grad():
-       predictions = model(tokens_tensor, segments_tensors)
-
-   # confirm we were able to predict 'henson'
-   predicted_index = torch.argmax(predictions[0, masked_index]).item()
-   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-   assert predicted_token == 'henson'
-
-OpenAI GPT
-^^^^^^^^^^
-
-Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.
-
-First let's prepare a tokenized input with ``OpenAIGPTTokenizer``
-
-.. code-block:: python
-
-   import torch
-   from pytorch_transformers import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
-
-   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-   import logging
-   logging.basicConfig(level=logging.INFO)
-
-   # Load pre-trained model tokenizer (vocabulary)
-   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
-
-   # Tokenized input
-   text = "Who was Jim Henson ? Jim Henson was a puppeteer"
-   tokenized_text = tokenizer.tokenize(text)
-
-   # Convert token to vocabulary indices
-   indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-
-   # Convert inputs to PyTorch tensors
-   tokens_tensor = torch.tensor([indexed_tokens])
-
-Let's see how to use ``OpenAIGPTModel`` to get hidden states
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = OpenAIGPTModel.from_pretrained('openai-gpt')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor = tokens_tensor.to('cuda')
-   model.to('cuda')
-
-   # Predict hidden states features for each layer
-   with torch.no_grad():
-       hidden_states = model(tokens_tensor)
-
-And how to use ``OpenAIGPTLMHeadModel``
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor = tokens_tensor.to('cuda')
-   model.to('cuda')
-
-   # Predict all tokens
-   with torch.no_grad():
-       predictions = model(tokens_tensor)
-
-   # get the predicted last token
-   predicted_index = torch.argmax(predictions[0, -1, :]).item()
-   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-   assert predicted_token == '.</w>'
-
-And how to use ``OpenAIGPTDoubleHeadsModel``
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
-   model.eval()
-
-   #  Prepare tokenized input
-   text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-   text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-   tokenized_text1 = tokenizer.tokenize(text1)
-   tokenized_text2 = tokenizer.tokenize(text2)
-   indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-   indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-   tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-   mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
-
-   # Predict hidden states features for each layer
-   with torch.no_grad():
-       lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)
-
-Transformer-XL
-^^^^^^^^^^^^^^
-
-Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.
-
-First let's prepare a tokenized input with ``TransfoXLTokenizer``
-
-.. code-block:: python
-
-   import torch
-   from pytorch_transformers import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
-
-   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-   import logging
-   logging.basicConfig(level=logging.INFO)
-
-   # Load pre-trained model tokenizer (vocabulary from wikitext 103)
-   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
-
-   # Tokenized input
-   text_1 = "Who was Jim Henson ?"
-   text_2 = "Jim Henson was a puppeteer"
-   tokenized_text_1 = tokenizer.tokenize(text_1)
-   tokenized_text_2 = tokenizer.tokenize(text_2)
-
-   # Convert token to vocabulary indices
-   indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
-   indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
-
-   # Convert inputs to PyTorch tensors
-   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-   tokens_tensor_2 = torch.tensor([indexed_tokens_2])
-
-Let's see how to use ``TransfoXLModel`` to get hidden states
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor_1 = tokens_tensor_1.to('cuda')
-   tokens_tensor_2 = tokens_tensor_2.to('cuda')
-   model.to('cuda')
-
-   with torch.no_grad():
-       # Predict hidden states features for each layer
-       hidden_states_1, mems_1 = model(tokens_tensor_1)
-       # We can re-use the memory cells in a subsequent call to attend a longer context
-       hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
-
-And how to use ``TransfoXLLMHeadModel``
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor_1 = tokens_tensor_1.to('cuda')
-   tokens_tensor_2 = tokens_tensor_2.to('cuda')
-   model.to('cuda')
-
-   with torch.no_grad():
-       # Predict all tokens
-       predictions_1, mems_1 = model(tokens_tensor_1)
-       # We can re-use the memory cells in a subsequent call to attend a longer context
-       predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
-
-   # get the predicted last token
-   predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-   assert predicted_token == 'who'
-
-OpenAI GPT-2
-^^^^^^^^^^^^
-
-Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.
-
-First let's prepare a tokenized input with ``GPT2Tokenizer``
-
-.. code-block:: python
-
-   import torch
-   from pytorch_transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
-
-   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-   import logging
-   logging.basicConfig(level=logging.INFO)
-
-   # Load pre-trained model tokenizer (vocabulary)
-   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-
-   # Encode some inputs
-   text_1 = "Who was Jim Henson ?"
-   text_2 = "Jim Henson was a puppeteer"
-   indexed_tokens_1 = tokenizer.encode(text_1)
-   indexed_tokens_2 = tokenizer.encode(text_2)
-
-   # Convert inputs to PyTorch tensors
-   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
-   tokens_tensor_2 = torch.tensor([indexed_tokens_2])
-
-Let's see how to use ``GPT2Model`` to get hidden states
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = GPT2Model.from_pretrained('gpt2')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor_1 = tokens_tensor_1.to('cuda')
-   tokens_tensor_2 = tokens_tensor_2.to('cuda')
-   model.to('cuda')
-
-   # Predict hidden states features for each layer
-   with torch.no_grad():
-       hidden_states_1, past = model(tokens_tensor_1)
-       # past can be used to reuse precomputed hidden state in a subsequent predictions
-       # (see beam-search examples in the run_gpt2.py example).
-       hidden_states_2, past = model(tokens_tensor_2, past=past)
-
-And how to use ``GPT2LMHeadModel``
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = GPT2LMHeadModel.from_pretrained('gpt2')
-   model.eval()
-
-   # If you have a GPU, put everything on cuda
-   tokens_tensor_1 = tokens_tensor_1.to('cuda')
-   tokens_tensor_2 = tokens_tensor_2.to('cuda')
-   model.to('cuda')
-
-   # Predict all tokens
-   with torch.no_grad():
-       predictions_1, past = model(tokens_tensor_1)
-       # past can be used to reuse precomputed hidden state in a subsequent predictions
-       # (see beam-search examples in the run_gpt2.py example).
-       predictions_2, past = model(tokens_tensor_2, past=past)
-
-   # get the predicted last token
-   predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
-   predicted_token = tokenizer.decode([predicted_index])
-
-And how to use ``GPT2DoubleHeadsModel``
-
-.. code-block:: python
-
-   # Load pre-trained model (weights)
-   model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
-   model.eval()
-
-   #  Prepare tokenized input
-   text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
-   text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
-   tokenized_text1 = tokenizer.tokenize(text1)
-   tokenized_text2 = tokenizer.tokenize(text2)
-   indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
-   indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
-   tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
-   mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
-
-   # Predict hidden states features for each layer
-   with torch.no_grad():
-       lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)