Merge pull request #773 from huggingface/doc-sphinx

Sphinx doc, XLM Checkpoints

Merge pull request #773 from huggingface/doc-sphinx
Sphinx doc, XLM Checkpoints
e4f9dca0 · Thomas Wolf · GitHub · d216e798 · b87eb82b · e4f9dca0
Unverified Commit e4f9dca0 authored Jul 11, 2019 by Thomas Wolf Committed by GitHub Jul 11, 2019
20 changed files
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -4,6 +4,7 @@ jobs:
        working_directory: ~/pytorch-transformers
        docker:
            - image: circleci/python:3.5
+        resource_class: large
        steps:
            - checkout
            - run: sudo pip install --progress-bar off .
@@ -14,6 +15,7 @@ jobs:
            - run: codecov
    build_py2:
        working_directory: ~/pytorch-transformers
+        resource_class: large
        docker:
            - image: circleci/python:2.7
        steps:

--- a/docs/source/_static/css/code-snippets.css
+++ b/docs/source/_static/css/code-snippets.css
-.highlight .c1{
+.highlight .c1, .highlight .sd{
    color: #999
 }
-.highlight .nn, .highlight .k, .highlight .s1, .highlight .nb, .highlight .bp {
+.highlight .nn, .highlight .k, .highlight .s1, .highlight .nb, .highlight .bp, .highlight .kc {
    color: #FB8D68;
 }
-.highlight .kn, .highlight .nv, .highlight .s2 {
+.highlight .kn, .highlight .nv, .highlight .s2, .highlight .ow {
    color: #6670FF;
 }
\ No newline at end of file
--- a/docs/source/_static/css/huggingface.css
+++ b/docs/source/_static/css/huggingface.css
@@ -6,6 +6,7 @@
 /* To keep the logo centered */
 .wy-side-scroll {
    width: auto;
+    font-size: 20px;
 }
 /* The div that holds the Hugging Face logo */
@@ -104,10 +105,34 @@ a {
    background-color: #6670FF;
 }
+/* Source spans */
+.rst-content .viewcode-link, .rst-content .viewcode-back{
+    color: #6670FF;
+    font-size: 110%;
+    letter-spacing: 2px;
+    text-transform: uppercase;
+}
+.footer {
+    margin-top: 20px;
+}
+.footer__Social {
+    display: flex;
+    flex-direction: row;
+}
+.footer__CustomImage {
+    margin: 2px 5px 0 0;
+}
 /* FONTS */
 body{
    font-family: Calibre;
-    font-size: 20px;
+    font-size: 16px;
 }
 h1 {

--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
 function addIcon() {
    const huggingFaceLogo = "http://lysand.re/huggingface_logo.svg";
    const image = document.createElement("img");
-    image.setAttribute("src", huggingFaceLogo)
+    image.setAttribute("src", huggingFaceLogo);
+    const div = document.createElement("div");
-    const div = document.createElement("div")
    div.appendChild(image);
    div.style.textAlign = 'center';
    div.style.paddingTop = '30px';
-    div.style.backgroundColor = '#6670FF'
+    div.style.backgroundColor = '#6670FF';
    const scrollDiv = document.getElementsByClassName("wy-side-scroll")[0];
-    scrollDiv.prepend(div)
+    scrollDiv.prepend(div);
+}
+function addCustomFooter() {
+    const customFooter = document.createElement("div");
+    const questionOrIssue = document.createElement("div");
+    questionOrIssue.innerHTML = "Stuck? Read our <a href='https://medium.com/huggingface'>Blog posts</a> or <a href='https://github.com/huggingface/pytorch_transformers'>Create an issue</a>";
+    customFooter.appendChild(questionOrIssue);
+    customFooter.classList.add("footer");
+    const social = document.createElement("div");
+    social.classList.add("footer__Social");
+    const imageDetails = [
+        { link: "https://huggingface.co", imageLink: "http://lysand.re/icons/website.svg" },
+        { link: "https://twitter.com/huggingface", imageLink: "http://lysand.re/icons/twitter.svg" },
+        { link: "https://github.com/huggingface", imageLink: "http://lysand.re/icons/github.svg" },
+        { link: "https://www.linkedin.com/company/huggingface/", imageLink: "http://lysand.re/icons/linkedin.svg" }
+    ];
+    imageDetails.forEach(imageLinks => {
+        const link = document.createElement("a");
+        const image = document.createElement("img");
+        image.src = imageLinks.imageLink;
+        link.href = imageLinks.link;
+        image.style.width = "30px";
+        image.classList.add("footer__CustomImage");
+        link.appendChild(image);
+        social.appendChild(link);
+    });
+    customFooter.appendChild(social);
+    document.getElementsByTagName("footer")[0].appendChild(customFooter);
+}
+function onLoad() {
+    addIcon();
+    addCustomFooter();
 }
-window.addEventListener("load", addIcon)
+window.addEventListener("load", onLoad);
--- a/docs/source/bertology.md
+++ b/docs/source/bertology.md
-# Bertology
\ No newline at end of file
--- a/docs/source/bertology.rst
+++ b/docs/source/bertology.rst
+BERTology
+---------
+There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
+* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
+* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
+* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
+In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
+* accessing all the hidden-states of BERT/GPT/GPT-2,
+* accessing all the attention weights for each head of BERT/GPT/GPT-2,
+* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
+To help you understand and use these features, we have added a specific example script: `bertology.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/bertology.py>`_ while extract information and prune a model pre-trained on MRPC.
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -42,8 +42,8 @@ extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.coverage',
    'sphinx.ext.napoleon',
-    'recommonmark'
+    'recommonmark',
+    'sphinx.ext.viewcode'
 ]
 # Add any paths that contain templates here, relative to this directory.

--- a/docs/source/cli.rst
+++ b/docs/source/cli.rst
-CLI
+Converting Tensorflow Models
 ================================================
 A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).
@@ -6,9 +6,9 @@ A command-line interface is provided to convert a TensorFlow checkpoint in a PyT
 BERT
 ^^^^
-You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `\ ``convert_tf_checkpoint_to_pytorch.py`` <./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py>`_ script.
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `convert_tf_checkpoint_to_pytorch.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py>`_ script.
-This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `\ ``run_bert_extract_features.py`` <./examples/run_bert_extract_features.py>`_\ , `\ ``run_bert_classifier.py`` <./examples/run_bert_classifier.py>`_ and `\ ``run_bert_squad.py`` <./examples/run_bert_squad.py>`_\ ).
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_extract_features.py>`_\ , `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ ).
 You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
@@ -20,7 +20,7 @@ Here is an example of the conversion process for a pre-trained ``BERT-Base Uncas
   export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
-   pytorch_pretrained_bert bert \
+   pytorch_transformers bert \
     $BERT_BASE_DIR/bert_model.ckpt \
     $BERT_BASE_DIR/bert_config.json \
     $BERT_BASE_DIR/pytorch_model.bin
@@ -36,7 +36,7 @@ Here is an example of the conversion process for a pre-trained OpenAI GPT model,
   export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
-   pytorch_pretrained_bert gpt \
+   pytorch_transformers gpt \
     $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
     $PYTORCH_DUMP_OUTPUT \
     [OPENAI_GPT_CONFIG]
@@ -50,7 +50,7 @@ Here is an example of the conversion process for a pre-trained Transformer-XL mo
   export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
-   pytorch_pretrained_bert transfo_xl \
+   pytorch_transformers transfo_xl \
     $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
     $PYTORCH_DUMP_OUTPUT \
     [TRANSFO_XL_CONFIG]
@@ -64,7 +64,7 @@ Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 mo
   export GPT2_DIR=/path/to/gpt2/checkpoint
-   pytorch_pretrained_bert gpt2 \
+   pytorch_transformers gpt2 \
     $GPT2_DIR/model.ckpt \
     $PYTORCH_DUMP_OUTPUT \
     [GPT2_CONFIG]
@@ -79,7 +79,7 @@ Here is an example of the conversion process for a pre-trained XLNet model, fine
   export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
   export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
-   pytorch_pretrained_bert xlnet \
+   pytorch_transformers xlnet \
     $TRANSFO_XL_CHECKPOINT_PATH \
     $TRANSFO_XL_CONFIG_PATH \
     $PYTORCH_DUMP_OUTPUT \

--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -6,22 +6,24 @@ Examples
   * - Sub-section
     - Description
-   * - `Training large models: introduction, tools and examples <#Training-large-models-introduction,-tools-and-examples>`_
+   * - `Training large models: introduction, tools and examples <#introduction>`_
     - How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
-   * - `Fine-tuning with BERT: running the examples <#Fine-tuning-with-BERT-running-the-examples>`_
+   * - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
-     - Running the examples in `\ ``./examples`` <./examples/>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
-   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#openai-gpt-transformer-xl-and-gpt-2-running-the-examples>`_
+   * - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
-     - Running the examples in `\ ``./examples`` <./examples/>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
+     - Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
-   * - `Fine-tuning BERT-large on GPUs <#Fine-tuning-BERT-large-on-GPUs>`_
+   * - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
     - How to fine tune ``BERT large``
+.. _introduction:
 Training large models: introduction, tools and examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 BERT-base and BERT-large are respectively 110M and 340M parameters models and it can be difficult to fine-tune them on a single GPU with the recommended batch size for good performance (in most case a batch size of 32).
-To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `\ ``run_bert_classifier.py`` <./examples/run_bert_classifier.py>`_ and `\ ``run_bert_squad.py`` <./examples/run_bert_squad.py>`_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_ that I published earlier this month.
+To help with fine-tuning these models, we have included several techniques that you can activate in the fine-tuning scripts `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_classifier.py>`_ and `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/run_bert_squad.py>`_\ : gradient-accumulation, multi-gpu training, distributed training and 16-bits training . For more details on how to use these techniques you can read `the tips on training large batches in PyTorch <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_ that I published earlier this year.
 Here is how to use these techniques in our scripts:
@@ -29,11 +31,11 @@ Here is how to use these techniques in our scripts:
 * **Gradient Accumulation**\ : Gradient accumulation can be used by supplying a integer greater than 1 to the ``--gradient_accumulation_steps`` argument. The batch at each step will be divided by this integer and gradient will be accumulated over ``gradient_accumulation_steps`` steps.
 * **Multi-GPU**\ : Multi-GPU is automatically activated when several GPUs are detected and the batches are splitted over the GPUs.
 * **Distributed training**\ : Distributed training can be activated by supplying an integer greater or equal to 0 to the ``--local_rank`` argument (see below).
-* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here <https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/>`_ and a full documentation is `here <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`_. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
+* **16-bits training**\ : 16-bits training, also called mixed-precision training, can reduce the memory requirement of your model on the GPU by using half-precision training, basically allowing to double the batch size. If you have a recent GPU (starting from NVIDIA Volta architecture) you should see no decrease in speed. A good introduction to Mixed precision training can be found `here <https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/>`__ and a full documentation is `here <https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__. In our scripts, this option can be activated by setting the ``--fp16`` flag and you can play with loss scaling using the ``--loss_scale`` flag (see the previously linked documentation for details on loss scaling). The loss scale can be zero in which case the scale is dynamically adjusted or a positive power of two in which case the scaling is static.
-To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`_. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.
+To use 16-bits training and distributed training, you need to install NVIDIA's apex extension `as detailed here <https://github.com/nvidia/apex>`__. You will find more information regarding the internals of ``apex`` and how to use ``apex`` in `the doc and the associated repository <https://github.com/nvidia/apex>`_. The results of the tests performed on pytorch-BERT by the NVIDIA team (and my trials at reproducing them) can be consulted in `the relevant PR of the present repository <https://github.com/huggingface/pytorch-pretrained-BERT/pull/116>`_.
-Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <(https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):
+Note: To use *Distributed Training*\ , you will need to run one training script on each of your machines. This can be done for example by running the following command on each server (see `the above mentioned blog post <https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255>`_\ ) for more details):
 .. code-block:: bash
@@ -41,6 +43,8 @@ Note: To use *Distributed Training*\ , you will need to run one training script
 Where ``$THIS_MACHINE_INDEX`` is an sequential index assigned to each of your machine (0, 1, 2...) and the machine with rank 0 has an IP address ``192.168.1.1`` and an open port ``1234``.
+.. _fine-tuning-bert-examples:
 Fine-tuning with BERT: running the examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -153,10 +157,10 @@ and unpack it to some directory ``$GLUE_DIR``.
     --num_train_epochs 3.0 \
     --output_dir /tmp/mrpc_output/
-Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks>`_ gave evaluation results between 84% and 88%.
+Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks>`__ gave evaluation results between 84% and 88%.
 **Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds!**
-First install apex as indicated `here <https://github.com/NVIDIA/apex>`_.
+First install apex as indicated `here <https://github.com/NVIDIA/apex>`__.
 Then run
 .. code-block:: shell
@@ -333,10 +337,12 @@ LM Fine-tuning
 ~~~~~~~~~~~~~~
 The data should be a text file in the same format as `sample_text.txt <./samples/sample_text.txt>`_  (one sentence per line, docs separated by empty line).
-You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and splitted into ~500k sentences with spaCy.
+You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and split into ~500k sentences with spaCy.
 Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ :
-Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `\ ``README`` <./examples/lm_finetuning/README.md>`_ of the `\ ``examples/lm_finetuning/`` <./examples/lm_finetuning/>`_ folder.
+Thank to the work of @Rocketknight1 and @tholor there are now **several scripts** that can be used to fine-tune BERT using the pretraining objective (combination of masked-language modeling and next sentence prediction loss). These scripts are detailed in the `README <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/README.md>`_ of the `examples/lm_finetuning/ <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning/>`_ folder.
+.. _fine-tuning:
 OpenAI GPT, Transformer-XL and GPT-2: running the examples
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -402,6 +408,8 @@ Unconditional generation:
 The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
+.. _fine-tuning-BERT-large:
 Fine-tuning BERT-large on GPUs
 ------------------------------
@@ -520,7 +528,7 @@ and unpack it to some directory ``$GLUE_DIR``.
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/
-Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus>`_ gave evaluation results between 84% and 88%.
+Our test ran on a few seeds with `the original implementation hyper-parameters <https://github.com/zihangdai/xlnet#1-sts-b-sentence-pair-relevance-regression-with-gpus>`__ gave evaluation results between 84% and 88%.
 **Distributed training**
 Here is an example using distributed training on 8 V100 GPUs to reach XXXX:
@@ -571,23 +579,4 @@ Here is an example on MNLI:
     global_step = 18408
     loss = 0.04755385363816904
-This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model
+This is the example of the ``bert-large-uncased-whole-word-masking-finetuned-mnli`` model.
\ No newline at end of file
-BERTology
---------
-There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call "BERTology"). Some good examples of this field are:
-* BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
-* Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
-* What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
-In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted  from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
-* accessing all the hidden-states of BERT/GPT/GPT-2,
-* accessing all the attention weights for each head of BERT/GPT/GPT-2,
-* retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
-To help you understand and use these features, we have added a specific example script: `\ ``bertology.py`` <./examples/bertology.py>`_ while extract information and prune a model pre-trained on MRPC.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,8 +11,7 @@ Pytorch-Transformers
    usage
    examples
    notebooks
-    tpu
+    converting_tensorflow_models
-    cli
    migration
    bertology
    torchscript
@@ -39,26 +38,33 @@ Pytorch-Transformers
 This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
-* `Google's BERT model <https://github.com/google-research/bert>`_\ ,
+* `Google's BERT model <https://github.com/google-research/bert>`__\ ,
-* `OpenAI's GPT model <https://github.com/openai/finetune-transformer-lm>`_\ ,
+* `OpenAI's GPT model <https://github.com/openai/finetune-transformer-lm>`__\ ,
-* `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`_\ , and
+* `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`__\ , and
-* `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`_.
+* `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`__.
-These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <#examples>`_ section below.
+These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <./examples.html>`__ section.
 Here are some information on these models:
-**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`_\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
+This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`__\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
-**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
-This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`_ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
+This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`__ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
-**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\ *, Zhilin Yang*\ , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`__ by Zihang Dai\*, Zhilin Yang\* , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`_ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
+This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`__ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
+**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford\*, Jeffrey Wu\* , Rewon Child, David Luan, Dario Amodei\*\* and Ilya Sutskever\*\*.
+This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
+**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
+This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`__. TODO Lysandre filled
+**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.
+This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`__. TODO Lysandre filled
-**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\ *, Jeffrey Wu*\ , Rewon Child, David Luan, Dario Amodei\ ** and Ilya Sutskever**.
-This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`_ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
 Content
 -------
@@ -68,157 +74,226 @@ Content
   * - Section
     - Description
-   * - `Installation <#installation>`_
+   * - `Installation <./installation.html>`__
     - How to install the package
-   * - `Overview <#overview>`_
+   * - `Philosphy <./philosophy.html>`__
-     - Overview of the package
+     - The philosophy behind this package
-   * - `Usage <#usage>`_
+   * - `Usage <./usage.html>`__
     - Quickstart examples
-   * - `Doc <#doc>`_
+   * - `Examples <./examples.html>`__
-     - Detailed documentation
-   * - `Examples <#examples>`_
     - Detailed examples on how to fine-tune Bert
-   * - `Notebooks <#notebooks>`_
+   * - `Notebooks <./notebooks.html>`__
     - Introduction on the provided Jupyter Notebooks
-   * - `TPU <#tpu>`_
+   * - `TPU <./tpu.html>`__
     - Notes on TPU support and pretraining scripts
-   * - `Command-line interface <#Command-line-interface>`_
+   * - `Command-line interface <./cli.html>`__
     - Convert a TensorFlow checkpoint in a PyTorch dump
+   * - `Migration <./migration.html>`__
+     - Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0)
+   * - `Bertology <./bertology.html>`__
+     - TODO Lysandre didn't know how to fill
+   * - `TorchScript <./torchscript.html>`__
+     - Convert a model to TorchScript for use in other programming languages
+.. list-table::
+   :header-rows: 1
+   * - Section
+     - Description
+   * - `Overview <./model_doc/overview.html>`__
+     - Overview of the package
+   * - `BERT <./model_doc/bert.html>`__
+     - BERT Models, Tokenizers and optimizers
+   * - `OpenAI GPT <./model_doc/gpt.html>`__
+     - GPT Models, Tokenizers and optimizers
+   * - `TransformerXL <./model_doc/transformerxl.html>`__
+     - TransformerXL Models, Tokenizers and optimizers
+   * - `OpenAI GPT2 <./model_doc/gpt2.html>`__
+     - GPT2 Models, Tokenizers and optimizers
+   * - `XLM <./model_doc/xlm.html>`__
+     - XLM Models, Tokenizers and optimizers
+   * - `XLNet <./model_doc/xlnet.html>`__
+     - XLNet Models, Tokenizers and optimizers
+TODO Lysandre filled: might need an introduction for both parts. Is it even necessary, since there is a summary? Up to you Thom.
 Overview
 --------
-This package comprises the following classes that can be imported in Python and are detailed in the `Doc <#doc>`_ section of this readme:
+This package comprises the following classes that can be imported in Python and are detailed in the `documentation <./model_doc/overview.html>`__ section of this package:
 *
-  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_ file):
+  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`__ file):
-  * `BertModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L639>`_ - raw BERT Transformer model (\ **fully pre-trained**\ ),
+  * `BertModel <./model_doc/bert.html#pytorch_transformers.BertModel>`__ - raw BERT Transformer model (\ **fully pre-trained**\ ),
-  * `BertForMaskedLM <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L793>`_ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
+  * `BertForMaskedLM <./model_doc/bert.html#pytorch_transformers.BertForMaskedLM>`__ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
-  * `BertForNextSentencePrediction <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L854>`_ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
+  * `BertForNextSentencePrediction <./model_doc/bert.html#pytorch_transformers.BertForNextSentencePrediction>`__ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
-  * `BertForPreTraining <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L722>`_ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
+  * `BertForPreTraining <./model_doc/bert.html#pytorch_transformers.BertForPreTraining>`__ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
-  * `BertForSequenceClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L916>`_ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `BertForSequenceClassification <./model_doc/bert.html#pytorch_transformers.BertForSequenceClassification>`__ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
-  * `BertForMultipleChoice <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L982>`_ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `BertForMultipleChoice <./model_doc/bert.html#pytorch_transformers.BertForMultipleChoice>`__ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
-  * `BertForTokenClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1051>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
+  * `BertForTokenClassification <./model_doc/bert.html#pytorch_transformers.BertForTokenClassification>`__ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
-  * `BertForQuestionAnswering <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1124>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
+  * `BertForQuestionAnswering <./model_doc/bert.html#pytorch_transformers.BertForQuestionAnswering>`__ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
 *
-  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_ file):
+  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`__ file):
-  * `OpenAIGPTModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L536>`_ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
+  * `OpenAIGPTModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTModel>`__ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
-  * `OpenAIGPTLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L643>`_ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `OpenAIGPTLMHeadModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTLMHeadModel>`__ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `OpenAIGPTDoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L722>`_ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `OpenAIGPTDoubleHeadsModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTDoubleHeadsModel>`__ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
 *
-  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ file):
+  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`__ file):
-  * `TransfoXLModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L983>`_ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLModel>`__ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
-  * `TransfoXLLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L1260>`_ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLLMHeadModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLLMHeadModel>`__ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
 *
-  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py>`_ file):
+  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <./_modules/pytorch_transformers/modeling_gpt2.html>`__ file):
-  * `GPT2Model <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L479>`_ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
+  * `GPT2Model <./model_doc/gpt2.html#pytorch_transformers.GPT2Model>`__ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
-  * `GPT2LMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L559>`_ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `GPT2LMHeadModel <./model_doc/gpt2.html#pytorch_transformers.GPT2LMHeadModel>`__ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `GPT2DoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L624>`_ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `GPT2DoubleHeadsModel <./model_doc/gpt2.html#pytorch_transformers.GPT2DoubleHeadsModel>`__ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
 *
-  Tokenizers for **BERT** (using word-piece) (in the `tokenization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py>`_ file):
+  Four **XLM** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`__ file):
+  * `XLMModel <./model_doc/xlm.html#pytorch_transformers.XLMModel>`__ - raw XLM Transformer model (\ **fully pre-trained**\ ),
+  * `XLMWithLMHeadModel <./model_doc/xlm.html#pytorch_transformers.XLMWithLMHeadModel>`__ - XLM Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `XLMForSequenceClassification <./model_doc/xlm.html#pytorch_transformers.XLMForSequenceClassification>`__ - XLM Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `XLMForQuestionAnswering <./model_doc/xlm.html#pytorch_transformers.XLMForQuestionAnswering>`__ - XLM Transformer with a token classification head on top (XLM Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
+*
+  Four **XLNet** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`__ file):
+  * `XLNetModel <./model_doc/xlnet.html#pytorch_transformers.XLNetModel>`__ - raw XLNet Transformer model (\ **fully pre-trained**\ ),
+  * `XLNetLMHeadModel <./model_doc/xlnet.html#pytorch_transformers.XLNetLMHeadModel>`__ - XLNet Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `XLNetForSequenceClassification <./model_doc/xlnet.html#pytorch_transformers.XLNetForSequenceClassification>`__ - XLNet Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `XLNetForQuestionAnswering <./model_doc/xlnet.html#pytorch_transformers.XLNetForQuestionAnswering>`__ - XLNet Transformer with a token classification head on top (XLNet Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
+TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers because I don't know the current philosophy behind them.
+*
+  Tokenizers for **BERT** (using word-piece) (in the `tokenization_bert.py <./_modules/pytorch_transformers/tokenization_bert.html>`__ file):
  * ``BasicTokenizer`` - basic tokenization (punctuation splitting, lower casing, etc.),
  * ``WordpieceTokenizer`` - WordPiece tokenization,
  * ``BertTokenizer`` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
-*
-  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_openai.py>`_ file):
+*
+  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <./_modules/pytorch_transformers/tokenization_openai.html>`__ file):
  * ``OpenAIGPTTokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
 *
-  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_transfo_xl.py>`_ file):
+  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <./_modules/pytorch_transformers/tokenization_gpt2.html>`__ file):
+  * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
+*
+  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <./_modules/pytorch_transformers/tokenization_transfo_xl.html>`__ file):
  * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
 *
-  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_gpt2.py>`_ file):
+  Tokenizer for **XLNet** (SentencePiece based tokenizer) (in the `tokenization_xlnet.py <./_modules/pytorch_transformers/tokenization_xlnet.html>`__ file):
+  * ``XLNetTokenizer`` - perform SentencePiece tokenization.
-  * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
 *
-  Optimizer for **BERT** (in the `optimization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/pytorch_pretrained_bert/optimization.py>`_ file):
+  Tokenizer for **XLM** (using Byte-Pair-Encoding) (in the `tokenization_xlm.py <./_modules/pytorch_transformers/tokenization_xlm.html>`__ file):
+  * ``GPT2Tokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
+*
+  Optimizer for **BERT** (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file):
  * ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
 *
-  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/optimization_openai.py>`_ file):
+  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <./_modules/pytorch_transformers/optimization_openai.html>`__ file):
  * ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
 *
-  Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_\ , `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_\ , `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ files):
+  Configuration classes for BERT, OpenAI GPT, Transformer-XL, XLM and XLNet (in the respective \
+  `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`__\ , \
+  `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`__\ , \
+  `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`__, \
+  `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`__, \
+  `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`__ \
+  files):
  * ``BertConfig`` - Configuration class to store the configuration of a ``BertModel`` with utilities to read and write from JSON configuration files.
  * ``OpenAIGPTConfig`` - Configuration class to store the configuration of a ``OpenAIGPTModel`` with utilities to read and write from JSON configuration files.
  * ``GPT2Config`` - Configuration class to store the configuration of a ``GPT2Model`` with utilities to read and write from JSON configuration files.
  * ``TransfoXLConfig`` - Configuration class to store the configuration of a ``TransfoXLModel`` with utilities to read and write from JSON configuration files.
+  * ``XLMConfig`` - Configuration class to store the configuration of a ``XLMModel`` with utilities to read and write from JSON configuration files.
+  * ``XLNetConfig`` - Configuration class to store the configuration of a ``XLNetModel`` with utilities to read and write from JSON configuration files.
 The repository further comprises:
 *
-  Five examples on how to use **BERT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  Five examples on how to use **BERT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
-  * `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_extract_features.py>`_ - Show how to extract hidden states from an instance of ``BertModel``\ ,
+  * `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_extract_features.py>`__ - Show how to extract hidden states from an instance of ``BertModel``\ ,
-  * `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_classifier.py>`_ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task,
+  * `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_classifier.py>`__ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task,
-  * `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_squad.py>`_ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks.
+  * `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_squad.py>`__ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks.
-  * `run_swag.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py>`_ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task.
+  * `run_swag.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py>`__ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task.
-  * `simple_lm_finetuning.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/simple_lm_finetuning.py>`_ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus.
+  * `simple_lm_finetuning.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/simple_lm_finetuning.py>`__ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus.
 *
-  One example on how to use **OpenAI GPT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  One example on how to use **OpenAI GPT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
-  * `run_openai_gpt.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_openai_gpt.py>`_ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task.
+  * `run_openai_gpt.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_openai_gpt.py>`__ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task.
 *
-  One example on how to use **Transformer-XL** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  One example on how to use **Transformer-XL** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
-  * `run_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_transfo_xl.py>`_ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103.
+  * `run_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_transfo_xl.py>`__ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103.
 *
-  One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
-  * `run_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_gpt2.py>`_ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples).
+  * `run_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_gpt2.py>`__ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples).
-  These examples are detailed in the `Examples <#examples>`_ section of this readme.
+  These examples are detailed in the `Examples <#examples>`__ section of this readme.
 *
-  Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks>`_\ ):
+  Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks>`__\ ):
-  * `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_ - Compare the hidden states predicted by ``BertModel``\ ,
+  * `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`__ - Compare the hidden states predicted by ``BertModel``\ ,
-  * `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_ - Compare the spans predicted by  ``BertForQuestionAnswering`` instances,
+  * `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`__ - Compare the spans predicted by  ``BertForQuestionAnswering`` instances,
-  * `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_ - Compare the predictions of the ``BertForPretraining`` instances.
+  * `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`__ - Compare the predictions of the ``BertForPretraining`` instances.
-  These notebooks are detailed in the `Notebooks <#notebooks>`_ section of this readme.
+  These notebooks are detailed in the `Notebooks <#notebooks>`__ section of this readme.
 *
  A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
-  This CLI is detailed in the `Command-line interface <#Command-line-interface>`_ section of this readme.
+  This CLI is detailed in the `Command-line interface <#Command-line-interface>`__ section of this readme.
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -4,75 +4,75 @@ BERT
 ``BertConfig``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertConfig
+.. autoclass:: pytorch_transformers.BertConfig
    :members:
 ``BertTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertTokenizer
+.. autoclass:: pytorch_transformers.BertTokenizer
    :members:
 ``BertAdam``
 ~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertAdam
+.. autoclass:: pytorch_transformers.BertAdam
    :members:
-1. ``BertModel``
+``BertModel``
 ~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertModel
+.. autoclass:: pytorch_transformers.BertModel
    :members:
-2. ``BertForPreTraining``
+``BertForPreTraining``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForPreTraining
+.. autoclass:: pytorch_transformers.BertForPreTraining
    :members:
-3. ``BertForMaskedLM``
+``BertForMaskedLM``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForMaskedLM
+.. autoclass:: pytorch_transformers.BertForMaskedLM
    :members:
-4. ``BertForNextSentencePrediction``
+``BertForNextSentencePrediction``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForNextSentencePrediction
+.. autoclass:: pytorch_transformers.BertForNextSentencePrediction
    :members:
-5. ``BertForSequenceClassification``
+``BertForSequenceClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForSequenceClassification
+.. autoclass:: pytorch_transformers.BertForSequenceClassification
    :members:
-6. ``BertForMultipleChoice``
+``BertForMultipleChoice``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForMultipleChoice
+.. autoclass:: pytorch_transformers.BertForMultipleChoice
    :members:
-7. ``BertForTokenClassification``
+``BertForTokenClassification``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForTokenClassification
+.. autoclass:: pytorch_transformers.BertForTokenClassification
    :members:
-8. ``BertForQuestionAnswering``
+``BertForQuestionAnswering``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.BertForQuestionAnswering
+.. autoclass:: pytorch_transformers.BertForQuestionAnswering
    :members:
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
@@ -4,40 +4,40 @@ OpenAI GPT
 ``OpenAIGPTConfig``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTConfig
+.. autoclass:: pytorch_transformers.OpenAIGPTConfig
    :members:
 ``OpenAIGPTTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTTokenizer
+.. autoclass:: pytorch_transformers.OpenAIGPTTokenizer
    :members:
 ``OpenAIAdam``
 ~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIAdam
+.. autoclass:: pytorch_transformers.OpenAIAdam
    :members:
-9. ``OpenAIGPTModel``
+``OpenAIGPTModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTModel
+.. autoclass:: pytorch_transformers.OpenAIGPTModel
    :members:
-10. ``OpenAIGPTLMHeadModel``
+``OpenAIGPTLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTLMHeadModel
+.. autoclass:: pytorch_transformers.OpenAIGPTLMHeadModel
    :members:
-11. ``OpenAIGPTDoubleHeadsModel``
+``OpenAIGPTDoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.OpenAIGPTDoubleHeadsModel
+.. autoclass:: pytorch_transformers.OpenAIGPTDoubleHeadsModel
    :members:
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -4,33 +4,33 @@ OpenAI GPT2
 ``GPT2Config``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2Config
+.. autoclass:: pytorch_transformers.GPT2Config
    :members:
 ``GPT2Tokenizer``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2Tokenizer
+.. autoclass:: pytorch_transformers.GPT2Tokenizer
    :members:
-14. ``GPT2Model``
+``GPT2Model``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2Model
+.. autoclass:: pytorch_transformers.GPT2Model
    :members:
-15. ``GPT2LMHeadModel``
+``GPT2LMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2LMHeadModel
+.. autoclass:: pytorch_transformers.GPT2LMHeadModel
    :members:
-16. ``GPT2DoubleHeadsModel``
+``GPT2DoubleHeadsModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.GPT2DoubleHeadsModel
+.. autoclass:: pytorch_transformers.GPT2DoubleHeadsModel
    :members:
--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
@@ -9,18 +9,15 @@ Here is a detailed documentation of the classes in the package and how to use th
   * - Sub-section
     - Description
-   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`_
+   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`__
     - How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
-   * - `Serialization best-practices <#serialization-best-practices>`_
+   * - `Serialization best-practices <#serialization-best-practices>`__
     - How to save and reload a fine-tuned model
-   * - `Configurations <#configurations>`_
+   * - `Configurations <#configurations>`__
     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Models <#models>`_
-     - API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Tokenizers <#tokenizers>`_
+TODO Lysandre filled: Removed Models/Tokenizers/Optimizers as no single link can be made.
-     - API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL
-   * - `Optimizers <#optimizers>`_
-     - API of the optimizers
 Configurations
@@ -77,7 +74,7 @@ where
    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
-    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`_
+    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`__
    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
@@ -93,7 +90,7 @@ where
    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
-  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/modeling.py>`_\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
+  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/modeling.py>`__\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
 *
  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
@@ -102,7 +99,7 @@ where
 * ``state_dict``\ : an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
 * ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
-``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`_ or the original TensorFlow repository.
+``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`__ or the original TensorFlow repository.
 When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
@@ -152,7 +149,7 @@ This section explain how you can save and re-load a fine-tuned model (BERT, GPT,
 There are three types of files you need to save to be able to reload a fine-tuned model:
-* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`_\ ,
+* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`__\ ,
 * the configuration file of the model which is saved as a JSON file, and
 * the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
@@ -245,7 +242,7 @@ An overview of the implemented schedules:
 * ``ConstantLR``\ : always returns learning rate 1.
-* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
    Keeps learning rate equal to 1. after warmup.
  .. image:: /imgs/warmup_constant_schedule.png
@@ -253,7 +250,7 @@ An overview of the implemented schedules:
     :alt:
-* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
    Linearly decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps.
  .. image:: /imgs/warmup_linear_schedule.png
@@ -261,8 +258,8 @@ An overview of the implemented schedules:
     :alt:
-* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps. \
-   Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve.
+  Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve. \
  If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.
  .. image:: /imgs/warmup_cosine_schedule.png

--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
@@ -5,26 +5,26 @@ Transformer XL
 ``TransfoXLConfig``
 ~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLConfig
+.. autoclass:: pytorch_transformers.TransfoXLConfig
    :members:
 ``TransfoXLTokenizer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLTokenizer
+.. autoclass:: pytorch_transformers.TransfoXLTokenizer
    :members:
-12. ``TransfoXLModel``
+``TransfoXLModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLModel
+.. autoclass:: pytorch_transformers.TransfoXLModel
    :members:
-13. ``TransfoXLLMHeadModel``
+``TransfoXLLMHeadModel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-.. autoclass:: pytorch_pretrained_bert.TransfoXLLMHeadModel
+.. autoclass:: pytorch_transformers.TransfoXLLMHeadModel
    :members:
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
 XLM
 ----------------------------------------------------
+``XLMConfig``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-I don't really know what to put here, I'll leave it up to you to decide @Thom
+.. autoclass:: pytorch_transformers.XLMConfig
\ No newline at end of file
+    :members:
+``XLMTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMTokenizer
+    :members:
+``XLMModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMModel
+    :members:
+``XLMWithLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMWithLMHeadModel
+    :members:
+``XLMForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMForSequenceClassification
+    :members:
+``XLMForQuestionAnswering``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLMForQuestionAnswering
+    :members:
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
 XLNet
 ----------------------------------------------------
-I don't really know what to put here, I'll leave it up to you to decide @Thom
+``XLNetConfig``
\ No newline at end of file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLNetConfig
+    :members:
+``XLNetTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLNetTokenizer
+    :members:
+``XLNetModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLNetModel
+    :members:
+``XLNetLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLNetLMHeadModel
+    :members:
+``XLNetForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLNetForSequenceClassification
+    :members:
+``XLNetForQuestionAnswering``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. autoclass:: pytorch_transformers.XLNetForQuestionAnswering
+    :members:
--- a/docs/source/notebooks.rst
+++ b/docs/source/notebooks.rst
@@ -5,12 +5,12 @@ We include `three Jupyter Notebooks <https://github.com/huggingface/pytorch-pret
 *
-  The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <./notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
+  The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
 *
-  The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
+  The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
 *
-  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
+  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
 Please follow the instructions given in the notebooks to run and modify them.
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
-TPU
-================================================
-TPU support and pretraining scripts
------------------------------------------------
-TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent `official announcement <https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud>`_\ ).
-We will add TPU support when this next release is published.
-The original TensorFlow code further comprises two scripts for pre-training BERT: `create_pretraining_data.py <https://github.com/google-research/bert/blob/master/create_pretraining_data.py>`_ and `run_pretraining.py <https://github.com/google-research/bert/blob/master/run_pretraining.py>`_.
-Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details `here <https://github.com/google-research/bert#pre-training-with-bert>`_\ ) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@@ -4,14 +4,14 @@ Usage
 BERT
 ^^^^
-Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <./model_doc/overview.html>`_ below for all the details on these classes.
 First let's prepare a tokenized input with ``BertTokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
+   from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging
@@ -82,14 +82,14 @@ And how to use ``BertForMaskedLM``
 OpenAI GPT
 ^^^^^^^^^^
-Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.
 First let's prepare a tokenized input with ``OpenAIGPTTokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
+   from pytorch_transformers import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging
@@ -170,14 +170,14 @@ And how to use ``OpenAIGPTDoubleHeadsModel``
 Transformer-XL
 ^^^^^^^^^^^^^^
-Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.
 First let's prepare a tokenized input with ``TransfoXLTokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
+   from pytorch_transformers import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging
@@ -246,14 +246,14 @@ And how to use ``TransfoXLLMHeadModel``
 OpenAI GPT-2
 ^^^^^^^^^^^^
-Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <./model_doc/overview.html>`_ for all the details on these classes.
 First let's prepare a tokenized input with ``GPT2Tokenizer``
 .. code-block:: python
   import torch
-   from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
+   from pytorch_transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
   import logging