Merge pull request #767 from huggingface/doc

Documentation

Merge pull request #767 from huggingface/doc
Documentation
9dd2c860 · Thomas Wolf · GitHub · 9113b50c · e0e5c7fa · 9dd2c860
Unverified Commit 9dd2c860 authored Jul 09, 2019 by Thomas Wolf Committed by GitHub Jul 09, 2019
20 changed files
--- a/docs/imgs/warmup_cosine_warm_restarts_schedule.png
+++ b/docs/imgs/warmup_cosine_warm_restarts_schedule.png
--- a/docs/imgs/warmup_linear_schedule.png
+++ b/docs/imgs/warmup_linear_schedule.png
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
+Pytorch-Transformers
+================================================================================================================================================
+
+
+.. toctree::
+    :maxdepth: 2
+    :caption: Notes
+
+    installation
+    philosophy
+    usage
+    examples
+    notebooks
+    tpu
+    cli
+    migration
+    bertology
+    torchscript
+
+
+.. toctree::
+    :maxdepth: 2
+    :caption: Package Reference
+
+    model_doc/overview
+    model_doc/bert
+    model_doc/gpt
+    model_doc/transformerxl
+    model_doc/gpt2
+    model_doc/xlm
+    model_doc/xlnet
+
+
+.. image:: https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg
+   :target: https://circleci.com/gh/huggingface/pytorch-pretrained-BERT
+   :alt: CircleCI
+
+
+This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
+
+
+* `Google's BERT model <https://github.com/google-research/bert>`_\ ,
+* `OpenAI's GPT model <https://github.com/openai/finetune-transformer-lm>`_\ ,
+* `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`_\ , and
+* `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`_.
+
+These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <#examples>`_ section below.
+
+Here are some information on these models:
+
+**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`_\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
+
+**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`_ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
+
+**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\ *, Zhilin Yang*\ , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`_ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
+
+**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\ *, Jeffrey Wu*\ , Rewon Child, David Luan, Dario Amodei\ ** and Ilya Sutskever**.
+This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`_ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
+
+Content
+-------
+
+.. list-table::
+   :header-rows: 1
+
+   * - Section
+     - Description
+   * - `Installation <#installation>`_
+     - How to install the package
+   * - `Overview <#overview>`_
+     - Overview of the package
+   * - `Usage <#usage>`_
+     - Quickstart examples
+   * - `Doc <#doc>`_
+     - Detailed documentation
+   * - `Examples <#examples>`_
+     - Detailed examples on how to fine-tune Bert
+   * - `Notebooks <#notebooks>`_
+     - Introduction on the provided Jupyter Notebooks
+   * - `TPU <#tpu>`_
+     - Notes on TPU support and pretraining scripts
+   * - `Command-line interface <#Command-line-interface>`_
+     - Convert a TensorFlow checkpoint in a PyTorch dump
+
+Overview
+--------
+
+This package comprises the following classes that can be imported in Python and are detailed in the `Doc <#doc>`_ section of this readme:
+
+
+*
+  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_ file):
+
+
+  * `BertModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L639>`_ - raw BERT Transformer model (\ **fully pre-trained**\ ),
+  * `BertForMaskedLM <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L793>`_ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
+  * `BertForNextSentencePrediction <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L854>`_ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
+  * `BertForPreTraining <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L722>`_ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
+  * `BertForSequenceClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L916>`_ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `BertForMultipleChoice <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L982>`_ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `BertForTokenClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1051>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
+  * `BertForQuestionAnswering <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1124>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
+
+*
+  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_ file):
+
+
+  * `OpenAIGPTModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L536>`_ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
+  * `OpenAIGPTLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L643>`_ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `OpenAIGPTDoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L722>`_ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+
+*
+  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ file):
+
+
+  * `TransfoXLModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L983>`_ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L1260>`_ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
+
+*
+  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py>`_ file):
+
+
+  * `GPT2Model <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L479>`_ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
+  * `GPT2LMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L559>`_ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `GPT2DoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L624>`_ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+
+*
+  Tokenizers for **BERT** (using word-piece) (in the `tokenization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py>`_ file):
+
+
+  * ``BasicTokenizer`` - basic tokenization (punctuation splitting, lower casing, etc.),
+  * ``WordpieceTokenizer`` - WordPiece tokenization,
+  * ``BertTokenizer`` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
+
+*
+  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_openai.py>`_ file):
+
+
+  * ``OpenAIGPTTokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
+
+*
+  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_transfo_xl.py>`_ file):
+
+
+  * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
+
+*
+  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_gpt2.py>`_ file):
+
+
+  * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
+
+*
+  Optimizer for **BERT** (in the `optimization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/pytorch_pretrained_bert/optimization.py>`_ file):
+
+
+  * ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
+
+*
+  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/optimization_openai.py>`_ file):
+
+
+  * ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
+
+*
+  Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_\ , `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_\ , `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ files):
+
+
+  * ``BertConfig`` - Configuration class to store the configuration of a ``BertModel`` with utilities to read and write from JSON configuration files.
+  * ``OpenAIGPTConfig`` - Configuration class to store the configuration of a ``OpenAIGPTModel`` with utilities to read and write from JSON configuration files.
+  * ``GPT2Config`` - Configuration class to store the configuration of a ``GPT2Model`` with utilities to read and write from JSON configuration files.
+  * ``TransfoXLConfig`` - Configuration class to store the configuration of a ``TransfoXLModel`` with utilities to read and write from JSON configuration files.
+
+The repository further comprises:
+
+
+*
+  Five examples on how to use **BERT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+
+
+  * `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_extract_features.py>`_ - Show how to extract hidden states from an instance of ``BertModel``\ ,
+  * `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_classifier.py>`_ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task,
+  * `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_squad.py>`_ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks.
+  * `run_swag.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py>`_ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task.
+  * `simple_lm_finetuning.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/simple_lm_finetuning.py>`_ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus.
+
+*
+  One example on how to use **OpenAI GPT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+
+
+  * `run_openai_gpt.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_openai_gpt.py>`_ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task.
+
+*
+  One example on how to use **Transformer-XL** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+
+
+  * `run_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_transfo_xl.py>`_ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103.
+
+*
+  One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+
+
+  * `run_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_gpt2.py>`_ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples).
+
+  These examples are detailed in the `Examples <#examples>`_ section of this readme.
+
+*
+  Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks>`_\ ):
+
+
+  * `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_ - Compare the hidden states predicted by ``BertModel``\ ,
+  * `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_ - Compare the spans predicted by  ``BertForQuestionAnswering`` instances,
+  * `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_ - Compare the predictions of the ``BertForPretraining`` instances.
+
+  These notebooks are detailed in the `Notebooks <#notebooks>`_ section of this readme.
+
+
+*
+  A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
+
+  This CLI is detailed in the `Command-line interface <#Command-line-interface>`_ section of this readme.
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
+Installation
+================================================
+
+This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
+
+With pip
+^^^^^^^^
+
+PyTorch pretrained bert can be installed by pip as follows:
+
+.. code-block:: bash
+
+   pip install pytorch-pretrained-bert
+
+If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+
+.. code-block:: bash
+
+   pip install spacy ftfy==4.4.3
+   python -m spacy download en
+
+If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+
+From source
+^^^^^^^^^^^
+
+Clone the repository and run:
+
+.. code-block:: bash
+
+   pip install [--editable] .
+
+Here also, if you want to reproduce the original tokenization process of the ``OpenAI GPT`` model, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+
+.. code-block:: bash
+
+   pip install spacy ftfy==4.4.3
+   python -m spacy download en
+
+Again, if you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage).
+
+A series of tests is included in the `tests folder <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests>`_ and can be run using ``pytest`` (install pytest if needed: ``pip install pytest``\ ).
+
+You can run the tests with the command:
+
+.. code-block:: bash
+
+   python -m pytest -sv tests/
--- a/docs/source/migration.md
+++ b/docs/source/migration.md
+# Migration
\ No newline at end of file
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
+BERT
+----------------------------------------------------
+
+``BertConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertConfig
+    :members:
+
+
+``BertTokenizer``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertTokenizer
+    :members:
+
+
+``BertAdam``
+~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertAdam
+    :members:
+
+1. ``BertModel``
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertModel
+    :members:
+
+
+2. ``BertForPreTraining``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForPreTraining
+    :members:
+
+
+3. ``BertForMaskedLM``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForMaskedLM
+    :members:
+
+
+4. ``BertForNextSentencePrediction``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForNextSentencePrediction
+    :members:
+
+
+5. ``BertForSequenceClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForSequenceClassification
+    :members:
+
+
+6. ``BertForMultipleChoice``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForMultipleChoice
+    :members:
+
+
+7. ``BertForTokenClassification``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForTokenClassification
+    :members:
+
+
+8. ``BertForQuestionAnswering``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.BertForQuestionAnswering
+    :members:
+
--- a/docs/source/model_doc/gpt.rst
+++ b/docs/source/model_doc/gpt.rst
+OpenAI GPT
+----------------------------------------------------
+
+``OpenAIGPTConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTConfig
+    :members:
+
+
+``OpenAIGPTTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTTokenizer
+    :members:
+
+
+``OpenAIAdam``
+~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIAdam
+    :members:
+
+
+9. ``OpenAIGPTModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTModel
+    :members:
+
+
+10. ``OpenAIGPTLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTLMHeadModel
+    :members:
+
+
+11. ``OpenAIGPTDoubleHeadsModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.OpenAIGPTDoubleHeadsModel
+    :members:
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
+OpenAI GPT2
+----------------------------------------------------
+
+``GPT2Config``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.GPT2Config
+    :members:
+
+
+``GPT2Tokenizer``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.GPT2Tokenizer
+    :members:
+
+
+14. ``GPT2Model``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.GPT2Model
+    :members:
+
+
+15. ``GPT2LMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.GPT2LMHeadModel
+    :members:
+
+
+16. ``GPT2DoubleHeadsModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.GPT2DoubleHeadsModel
+    :members:
--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
+Overview
+================================================
+
+
+Here is a detailed documentation of the classes in the package and how to use them:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Sub-section
+     - Description
+   * - `Loading pre-trained weights <#loading-google-ai-or-openai-pre-trained-weights-or-pytorch-dump>`_
+     - How to load Google AI/OpenAI's pre-trained weight or a PyTorch saved instance
+   * - `Serialization best-practices <#serialization-best-practices>`_
+     - How to save and reload a fine-tuned model
+   * - `Configurations <#configurations>`_
+     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
+   * - `Models <#models>`_
+     - API of the PyTorch model classes for BERT, GPT, GPT-2 and Transformer-XL
+   * - `Tokenizers <#tokenizers>`_
+     - API of the tokenizers class for BERT, GPT, GPT-2 and Transformer-XL
+   * - `Optimizers <#optimizers>`_
+     - API of the optimizers
+
+
+Configurations
+^^^^^^^^^^^^^^
+
+Models (BERT, GPT, GPT-2 and Transformer-XL) are defined and build from configuration classes which contains the
+parameters of the models (number of layers, dimensionalities...) and a few utilities to read and write from JSON
+configuration files. The respective configuration classes are:
+
+
+* ``BertConfig`` for ``BertModel`` and BERT classes instances.
+* ``OpenAIGPTConfig`` for ``OpenAIGPTModel`` and OpenAI GPT classes instances.
+* ``GPT2Config`` for ``GPT2Model`` and OpenAI GPT-2 classes instances.
+* ``TransfoXLConfig`` for ``TransfoXLModel`` and Transformer-XL classes instances.
+
+These configuration classes contains a few utilities to load and save configurations:
+
+
+* ``from_dict(cls, json_object)``\ : A class method to construct a configuration from a Python dictionary of parameters. Returns an instance of the configuration class.
+* ``from_json_file(cls, json_file)``\ : A class method to construct a configuration from a json file of parameters. Returns an instance of the configuration class.
+* ``to_dict()``\ : Serializes an instance to a Python dictionary. Returns a dictionary.
+* ``to_json_string()``\ : Serializes an instance to a JSON string. Returns a string.
+* ``to_json_file(json_file_path)``\ : Save an instance to a json file.
+
+
+Loading Google AI or OpenAI pre-trained weights or PyTorch dump
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``from_pretrained()`` method
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To load one of Google AI's, OpenAI's pre-trained models or a PyTorch saved model (an instance of ``BertForPreTraining`` saved with ``torch.save()``\ ), the PyTorch model classes and the tokenizer can be instantiated using the ``from_pretrained()`` method:
+
+.. code-block:: python
+
+   model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None, from_tf=False, state_dict=None, *input, **kwargs)
+
+where
+
+
+* ``BERT_CLASS`` is either a tokenizer to load the vocabulary (\ ``BertTokenizer`` or ``OpenAIGPTTokenizer`` classes) or one of the eight BERT or three OpenAI GPT PyTorch model classes (to load the pre-trained weights): ``BertModel``\ , ``BertForMaskedLM``\ , ``BertForNextSentencePrediction``\ , ``BertForPreTraining``\ , ``BertForSequenceClassification``\ , ``BertForTokenClassification``\ , ``BertForMultipleChoice``\ , ``BertForQuestionAnswering``\ , ``OpenAIGPTModel``\ , ``OpenAIGPTLMHeadModel`` or ``OpenAIGPTDoubleHeadsModel``\ , and
+*
+  ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is either:
+
+
+  *
+    the shortcut name of a Google AI's or OpenAI's pre-trained model selected in the list:
+
+
+    * ``bert-base-uncased``: 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-large-uncased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
+    * ``bert-base-cased``: 12-layer, 768-hidden, 12-heads , 110M parameters
+    * ``bert-large-cased``: 24-layer, 1024-hidden, 16-heads, 340M parameters
+    * ``bert-base-multilingual-uncased``: (Orig, not recommended) 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-base-multilingual-cased``: **(New, recommended)** 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-base-chinese``: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``bert-base-german-cased``: Trained on German data only, 12-layer, 768-hidden, 12-heads, 110M parameters `Performance Evaluation <https://deepset.ai/german-bert>`_
+    * ``bert-large-uncased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
+    * ``bert-large-cased-whole-word-masking``: 24-layer, 1024-hidden, 16-heads, 340M parameters - Trained with Whole Word Masking (mask all of the the tokens corresponding to a word at once)
+    * ``bert-large-uncased-whole-word-masking-finetuned-squad``: The ``bert-large-uncased-whole-word-masking`` model finetuned on SQuAD (using the ``run_bert_squad.py`` examples). Results: *exact_match: 86.91579943235573, f1: 93.1532499015869*
+    * ``openai-gpt``: OpenAI GPT English model, 12-layer, 768-hidden, 12-heads, 110M parameters
+    * ``gpt2``: OpenAI GPT-2 English model, 12-layer, 768-hidden, 12-heads, 117M parameters
+    * ``gpt2-medium``: OpenAI GPT-2 English model, 24-layer, 1024-hidden, 16-heads, 345M parameters
+    * ``transfo-xl-wt103``: Transformer-XL English model trained on wikitext-103, 18-layer, 1024-hidden, 16-heads, 257M parameters
+
+  *
+    a path or url to a pretrained model archive containing:
+
+
+    * ``bert_config.json`` or ``openai_gpt_config.json`` a configuration file for the model, and
+    * ``pytorch_model.bin`` a PyTorch dump of a pre-trained instance of ``BertForPreTraining``\ , ``OpenAIGPTModel``\ , ``TransfoXLModel``\ , ``GPT2LMHeadModel`` (saved with the usual ``torch.save()``\ )
+
+  If ``PRE_TRAINED_MODEL_NAME_OR_PATH`` is a shortcut name, the pre-trained weights will be downloaded from AWS S3 (see the links `here <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/pytorch_pretrained_bert/modeling.py>`_\ ) and stored in a cache folder to avoid future download (the cache folder can be found at ``~/.pytorch_pretrained_bert/``\ ).
+
+*
+  ``cache_dir`` can be an optional path to a specific directory to download and cache the pre-trained model weights. This option is useful in particular when you are using distributed training: to avoid concurrent access to the same weights you can set for example ``cache_dir='./pretrained_model_{}'.format(args.local_rank)`` (see the section on distributed training for more information).
+
+* ``from_tf``\ : should we load the weights from a locally saved TensorFlow checkpoint
+* ``state_dict``\ : an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
+* ``*inputs``\ , `**kwargs`: additional input for the specific Bert class (ex: num_labels for BertForSequenceClassification)
+
+``Uncased`` means that the text has been lowercased before WordPiece tokenization, e.g., ``John Smith`` becomes ``john smith``. The Uncased model also strips out any accent markers. ``Cased`` means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging). For information about the Multilingual and Chinese model, see the `Multilingual README <https://github.com/google-research/bert/blob/master/multilingual.md>`_ or the original TensorFlow repository.
+
+When using an ``uncased model``\ , make sure to pass ``--do_lower_case`` to the example training scripts (or pass ``do_lower_case=True`` to FullTokenizer if you're using your own script and loading the tokenizer your-self.).
+
+Examples:
+
+.. code-block:: python
+
+   # BERT
+   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True, do_basic_tokenize=True)
+   model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+
+   # OpenAI GPT
+   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+   model = OpenAIGPTModel.from_pretrained('openai-gpt')
+
+   # Transformer-XL
+   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+
+   # OpenAI GPT-2
+   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+   model = GPT2Model.from_pretrained('gpt2')
+
+Cache directory
+~~~~~~~~~~~~~~~
+
+``pytorch_pretrained_bert`` save the pretrained weights in a cache directory which is located at (in this order of priority):
+
+
+* ``cache_dir`` optional arguments to the ``from_pretrained()`` method (see above),
+* shell environment variable ``PYTORCH_PRETRAINED_BERT_CACHE``\ ,
+* PyTorch cache home + ``/pytorch_pretrained_bert/``
+  where PyTorch cache home is defined by (in this order):
+
+  * shell environment variable ``ENV_TORCH_HOME``
+  * shell environment variable ``ENV_XDG_CACHE_HOME`` + ``/torch/``\ )
+  * default: ``~/.cache/torch/``
+
+Usually, if you don't set any specific environment variable, ``pytorch_pretrained_bert`` cache will be at ``~/.cache/torch/pytorch_pretrained_bert/``.
+
+You can alsways safely delete ``pytorch_pretrained_bert`` cache but the pretrained model weights and vocabulary files wil have to be re-downloaded from our S3.
+
+Serialization best-practices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This section explain how you can save and re-load a fine-tuned model (BERT, GPT, GPT-2 and Transformer-XL).
+There are three types of files you need to save to be able to reload a fine-tuned model:
+
+
+* the model it-self which should be saved following PyTorch serialization `best practices <https://pytorch.org/docs/stable/notes/serialization.html#best-practices>`_\ ,
+* the configuration file of the model which is saved as a JSON file, and
+* the vocabulary (and the merges for the BPE-based models GPT and GPT-2).
+
+The *default filenames* of these files are as follow:
+
+
+* the model weights file: ``pytorch_model.bin``\ ,
+* the configuration file: ``config.json``\ ,
+* the vocabulary file: ``vocab.txt`` for BERT and Transformer-XL, ``vocab.json`` for GPT/GPT-2 (BPE vocabulary),
+* for GPT/GPT-2 (BPE vocabulary) the additional merges file: ``merges.txt``.
+
+**If you save a model using these *default filenames*\ , you can then re-load the model and tokenizer using the ``from_pretrained()`` method.**
+
+Here is the recommended way of saving the model, configuration and vocabulary to an ``output_dir`` directory and reloading the model and tokenizer afterwards:
+
+.. code-block:: python
+
+   from pytorch_pretrained_bert import WEIGHTS_NAME, CONFIG_NAME
+
+   output_dir = "./models/"
+
+   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
+
+   # If we have a distributed model, save only the encapsulated model
+   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
+   model_to_save = model.module if hasattr(model, 'module') else model
+
+   # If we save using the predefined names, we can load using `from_pretrained`
+   output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
+   output_config_file = os.path.join(output_dir, CONFIG_NAME)
+
+   torch.save(model_to_save.state_dict(), output_model_file)
+   model_to_save.config.to_json_file(output_config_file)
+   tokenizer.save_vocabulary(output_dir)
+
+   # Step 2: Re-load the saved model and vocabulary
+
+   # Example for a Bert model
+   model = BertForQuestionAnswering.from_pretrained(output_dir)
+   tokenizer = BertTokenizer.from_pretrained(output_dir, do_lower_case=args.do_lower_case)  # Add specific options if needed
+   # Example for a GPT model
+   model = OpenAIGPTDoubleHeadsModel.from_pretrained(output_dir)
+   tokenizer = OpenAIGPTTokenizer.from_pretrained(output_dir)
+
+Here is another way you can save and reload the model if you want to use specific paths for each type of files:
+
+.. code-block:: python
+
+   output_model_file = "./models/my_own_model_file.bin"
+   output_config_file = "./models/my_own_config_file.bin"
+   output_vocab_file = "./models/my_own_vocab_file.bin"
+
+   # Step 1: Save a model, configuration and vocabulary that you have fine-tuned
+
+   # If we have a distributed model, save only the encapsulated model
+   # (it was wrapped in PyTorch DistributedDataParallel or DataParallel)
+   model_to_save = model.module if hasattr(model, 'module') else model
+
+   torch.save(model_to_save.state_dict(), output_model_file)
+   model_to_save.config.to_json_file(output_config_file)
+   tokenizer.save_vocabulary(output_vocab_file)
+
+   # Step 2: Re-load the saved model and vocabulary
+
+   # We didn't save using the predefined WEIGHTS_NAME, CONFIG_NAME names, we cannot load using `from_pretrained`.
+   # Here is how to do it in this situation:
+
+   # Example for a Bert model
+   config = BertConfig.from_json_file(output_config_file)
+   model = BertForQuestionAnswering(config)
+   state_dict = torch.load(output_model_file)
+   model.load_state_dict(state_dict)
+   tokenizer = BertTokenizer(output_vocab_file, do_lower_case=args.do_lower_case)
+
+   # Example for a GPT model
+   config = OpenAIGPTConfig.from_json_file(output_config_file)
+   model = OpenAIGPTDoubleHeadsModel(config)
+   state_dict = torch.load(output_model_file)
+   model.load_state_dict(state_dict)
+   tokenizer = OpenAIGPTTokenizer(output_vocab_file)
+
+Learning Rate Schedules
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``.optimization`` module also provides additional schedules in the form of schedule objects that inherit from ``_LRSchedule``.
+All ``_LRSchedule`` subclasses accept ``warmup`` and ``t_total`` arguments at construction.
+When an ``_LRSchedule`` object is passed into ``BertAdam`` or ``OpenAIAdam``\ ,
+the ``warmup`` and ``t_total`` arguments on the optimizer are ignored and the ones in the ``_LRSchedule`` object are used.
+An overview of the implemented schedules:
+
+
+* ``ConstantLR``\ : always returns learning rate 1.
+* ``WarmupConstantSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+    Keeps learning rate equal to 1. after warmup.
+
+  .. image:: /imgs/warmup_constant_schedule.png
+     :target: /imgs/warmup_constant_schedule.png
+     :alt:
+
+
+* ``WarmupLinearSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+    Linearly decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps.
+
+  .. image:: /imgs/warmup_linear_schedule.png
+     :target: /imgs/warmup_linear_schedule.png
+     :alt:
+
+
+* ``WarmupCosineSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+   Decreases learning rate from 1. to 0. over remaining ``1 - warmup`` steps following a cosine curve.
+   If ``cycles`` (default=0.5) is different from default, learning rate follows cosine function after warmup.
+
+  .. image:: /imgs/warmup_cosine_schedule.png
+     :target: /imgs/warmup_cosine_schedule.png
+     :alt:
+
+
+* ``WarmupCosineWithHardRestartsSchedule`` : Linearly increases learning rate from 0 to 1 over ``warmup`` fraction of training steps.
+    If ``cycles`` (default=1.) is different from default, learning rate follows ``cycles`` times a cosine decaying learning rate (with hard restarts).
+
+  .. image:: /imgs/warmup_cosine_hard_restarts_schedule.png
+     :target: /imgs/warmup_cosine_hard_restarts_schedule.png
+     :alt:
+
+
+* ``WarmupCosineWithWarmupRestartsSchedule`` : All training progress is divided in ``cycles`` (default=1.) parts of equal length.
+    Every part follows a schedule with the first ``warmup`` fraction of the training steps linearly increasing from 0. to 1.,
+    followed by a learning rate decreasing from 1. to 0. following a cosine curve.
+    Note that the total number of all warmup steps over all cycles together is equal to ``warmup`` * ``cycles``
+
+  .. image:: /imgs/warmup_cosine_warm_restarts_schedule.png
+     :target: /imgs/warmup_cosine_warm_restarts_schedule.png
+     :alt:
\ No newline at end of file
--- a/docs/source/model_doc/transformerxl.rst
+++ b/docs/source/model_doc/transformerxl.rst
+Transformer XL
+----------------------------------------------------
+
+
+``TransfoXLConfig``
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.TransfoXLConfig
+    :members:
+
+
+``TransfoXLTokenizer``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.TransfoXLTokenizer
+    :members:
+
+
+12. ``TransfoXLModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.TransfoXLModel
+    :members:
+
+
+13. ``TransfoXLLMHeadModel``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: pytorch_pretrained_bert.TransfoXLLMHeadModel
+    :members:
--- a/docs/source/model_doc/xlm.rst
+++ b/docs/source/model_doc/xlm.rst
+XLM
+----------------------------------------------------
+
+
+I don't really know what to put here, I'll leave it up to you to decide @Thom
\ No newline at end of file
--- a/docs/source/model_doc/xlnet.rst
+++ b/docs/source/model_doc/xlnet.rst
+XLNet
+----------------------------------------------------
+
+I don't really know what to put here, I'll leave it up to you to decide @Thom
\ No newline at end of file
--- a/docs/source/notebooks.rst
+++ b/docs/source/notebooks.rst
+Notebooks
+================================================
+
+We include `three Jupyter Notebooks <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks>`_ that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
+
+
+*
+  The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <./notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
+
+*
+  The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
+
+*
+  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
+
+Please follow the instructions given in the notebooks to run and modify them.
--- a/docs/source/philosophy.md
+++ b/docs/source/philosophy.md
+# Philosophy
\ No newline at end of file
--- a/docs/source/torchscript.rst
+++ b/docs/source/torchscript.rst
+TorchScript
+================================================
+
+According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch code".
+Pytorch's two modules `JIT and TRACE <https://pytorch.org/docs/stable/jit.html>`_ allow the developer to export
+their model to be re-used in other programs, such as efficiency-oriented C++ programs.
+
+We have provided an interface that allows the export of `pytorch-transformers` models to TorchScript so that they can
+be reused in a different environment than a Pytorch-based python program. Here we explain how to use our models so that
+they can be exported, and what to be mindful of when using these models with TorchScript.
+
+Exporting a model needs two things:
+
+* dummy inputs to execute a model forward pass.
+* the model needs to be instantiated with the ``torchscript`` flag.
+
+These necessities imply several things developers should be careful about. These are detailed below.
+
+
+Implications
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+TorchScript flag and tied weights
+------------------------------------------------
+This flag is necessary because most of the language models in this repository have tied weights between their
+``Embedding`` layer and their ``Decoding`` layer. TorchScript does not allow the export of models that have tied weights,
+it is therefore necessary to untie the weights beforehand.
+
+This implies that models instantiated with the ``torchscript`` flag have their ``Embedding`` layer and ``Decoding`` layer
+separate, which means that they should not be trained down the line. Training would de-synchronize the two layers,
+leading to unexpected results.
+
+This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
+can be safely exported without the ``torchscript`` flag.
+
+Dummy inputs and standard lengths
+------------------------------------------------
+
+The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
+Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used
+to create the "trace" of the model.
+
+The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
+input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
+as:
+
+``The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2``
+
+will be raised. It is therefore recommended to trace the model with a dummy input size at least as large as the largest
+input that will be fed to the model during inference. Padding can be performed to fill the missing values. As the model
+will have been traced with a large input size however, the dimensions of the different matrix will be large as well,
+resulting in more calculations.
+
+It is recommended to be careful of the total number of operations done on each input and to follow performance closely
+when exporting varying sequence-length models.
+
+Using TorchScript in Python
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Below are examples of using the Python to save, load models as well as how to use the trace for inference.
+
+Saving a model
+------------------------------------------------
+
+This snippet shows how to use TorchScript to export a ``BertModel``. Here the ``BertModel`` is instantiated
+according to a ``BertConfig`` class and then saved to disk under the filename ``traced_bert.pt``
+
+.. code-block:: python
+
+    from pytorch_pretrained_bert import BertModel, BertTokenizer, BertConfig
+    import torch
+
+    enc = BertTokenizer.from_pretrained("bert-base-uncased")
+
+    # Tokenizing input text
+    text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+    tokenized_text = enc.tokenize(text)
+
+    # Masking one of the input tokens
+    masked_index = 8
+    tokenized_text[masked_index] = '[MASK]'
+    indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
+    segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+    # Creating a dummy input
+    tokens_tensor = torch.tensor([indexed_tokens])
+    segments_tensors = torch.tensor([segments_ids])
+    dummy_input = [tokens_tensor, segments_tensors]
+
+    # Initializing the model with the torchscript flag
+    # Flag set to True even though it is not necessary as this model does not have an LM Head.
+    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
+
+    # Instantiating the model
+    model = BertModel(config)
+
+    # The model needs to be in evaluation mode
+    model.eval()
+
+    # Creating the trace
+    traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
+    torch.jit.save(traced_model, "traced_bert.pt")
+
+Loading a model
+------------------------------------------------
+
+This snippet shows how to load the ``BertModel`` that was previously saved to disk under the name ``traced_bert.pt``.
+We are re-using the previously initialised ``dummy_input``.
+
+.. code-block:: python
+
+    loaded_model = torch.jit.load("traced_model.pt")
+    loaded_model.eval()
+
+    all_encoder_layers, pooled_output = loaded_model(dummy_input)
+
+Using a traced model for inference
+------------------------------------------------
+
+Using the traced model for inference is as simple as using its ``__call__`` dunder method:
+
+.. code-block:: python
+
+    traced_model(tokens_tensor, segments_tensors)
\ No newline at end of file
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
+TPU
+================================================
+
+TPU support and pretraining scripts
+------------------------------------------------
+
+TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent `official announcement <https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud>`_\ ).
+
+We will add TPU support when this next release is published.
+
+The original TensorFlow code further comprises two scripts for pre-training BERT: `create_pretraining_data.py <https://github.com/google-research/bert/blob/master/create_pretraining_data.py>`_ and `run_pretraining.py <https://github.com/google-research/bert/blob/master/run_pretraining.py>`_.
+
+Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details `here <https://github.com/google-research/bert#pre-training-with-bert>`_\ ) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
+Usage
+================================================
+
+BERT
+^^^^
+
+Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <#doc>`_ below for all the details on these classes.
+
+First let's prepare a tokenized input with ``BertTokenizer``
+
+.. code-block:: python
+
+   import torch
+   from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
+
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+
+   # Load pre-trained model tokenizer (vocabulary)
+   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+   # Tokenized input
+   text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+   tokenized_text = tokenizer.tokenize(text)
+
+   # Mask a token that we will try to predict back with `BertForMaskedLM`
+   masked_index = 8
+   tokenized_text[masked_index] = '[MASK]'
+   assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
+
+   # Convert token to vocabulary indices
+   indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+   # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
+   segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+
+   # Convert inputs to PyTorch tensors
+   tokens_tensor = torch.tensor([indexed_tokens])
+   segments_tensors = torch.tensor([segments_ids])
+
+Let's see how to use ``BertModel`` to get hidden states
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = BertModel.from_pretrained('bert-base-uncased')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   segments_tensors = segments_tensors.to('cuda')
+   model.to('cuda')
+
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       encoded_layers, _ = model(tokens_tensor, segments_tensors)
+   # We have a hidden states for each of the 12 layers in model bert-base-uncased
+   assert len(encoded_layers) == 12
+
+And how to use ``BertForMaskedLM``
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   segments_tensors = segments_tensors.to('cuda')
+   model.to('cuda')
+
+   # Predict all tokens
+   with torch.no_grad():
+       predictions = model(tokens_tensor, segments_tensors)
+
+   # confirm we were able to predict 'henson'
+   predicted_index = torch.argmax(predictions[0, masked_index]).item()
+   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+   assert predicted_token == 'henson'
+
+OpenAI GPT
+^^^^^^^^^^
+
+Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+
+First let's prepare a tokenized input with ``OpenAIGPTTokenizer``
+
+.. code-block:: python
+
+   import torch
+   from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
+
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+
+   # Load pre-trained model tokenizer (vocabulary)
+   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+
+   # Tokenized input
+   text = "Who was Jim Henson ? Jim Henson was a puppeteer"
+   tokenized_text = tokenizer.tokenize(text)
+
+   # Convert token to vocabulary indices
+   indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+
+   # Convert inputs to PyTorch tensors
+   tokens_tensor = torch.tensor([indexed_tokens])
+
+Let's see how to use ``OpenAIGPTModel`` to get hidden states
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = OpenAIGPTModel.from_pretrained('openai-gpt')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   model.to('cuda')
+
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       hidden_states = model(tokens_tensor)
+
+And how to use ``OpenAIGPTLMHeadModel``
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   model.to('cuda')
+
+   # Predict all tokens
+   with torch.no_grad():
+       predictions = model(tokens_tensor)
+
+   # get the predicted last token
+   predicted_index = torch.argmax(predictions[0, -1, :]).item()
+   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+   assert predicted_token == '.</w>'
+
+And how to use ``OpenAIGPTDoubleHeadsModel``
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
+   model.eval()
+
+   #  Prepare tokenized input
+   text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+   text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+   tokenized_text1 = tokenizer.tokenize(text1)
+   tokenized_text2 = tokenizer.tokenize(text2)
+   indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+   indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+   tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+   mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)
+
+Transformer-XL
+^^^^^^^^^^^^^^
+
+Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <#doc>`_ below for all the details on these classes.
+
+First let's prepare a tokenized input with ``TransfoXLTokenizer``
+
+.. code-block:: python
+
+   import torch
+   from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
+
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+
+   # Load pre-trained model tokenizer (vocabulary from wikitext 103)
+   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+
+   # Tokenized input
+   text_1 = "Who was Jim Henson ?"
+   text_2 = "Jim Henson was a puppeteer"
+   tokenized_text_1 = tokenizer.tokenize(text_1)
+   tokenized_text_2 = tokenizer.tokenize(text_2)
+
+   # Convert token to vocabulary indices
+   indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
+   indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
+
+   # Convert inputs to PyTorch tensors
+   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+   tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+
+Let's see how to use ``TransfoXLModel`` to get hidden states
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+
+   with torch.no_grad():
+       # Predict hidden states features for each layer
+       hidden_states_1, mems_1 = model(tokens_tensor_1)
+       # We can re-use the memory cells in a subsequent call to attend a longer context
+       hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
+
+And how to use ``TransfoXLLMHeadModel``
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+
+   with torch.no_grad():
+       # Predict all tokens
+       predictions_1, mems_1 = model(tokens_tensor_1)
+       # We can re-use the memory cells in a subsequent call to attend a longer context
+       predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
+
+   # get the predicted last token
+   predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+   assert predicted_token == 'who'
+
+OpenAI GPT-2
+^^^^^^^^^^^^
+
+Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+
+First let's prepare a tokenized input with ``GPT2Tokenizer``
+
+.. code-block:: python
+
+   import torch
+   from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
+
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+
+   # Load pre-trained model tokenizer (vocabulary)
+   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+
+   # Encode some inputs
+   text_1 = "Who was Jim Henson ?"
+   text_2 = "Jim Henson was a puppeteer"
+   indexed_tokens_1 = tokenizer.encode(text_1)
+   indexed_tokens_2 = tokenizer.encode(text_2)
+
+   # Convert inputs to PyTorch tensors
+   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+   tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+
+Let's see how to use ``GPT2Model`` to get hidden states
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = GPT2Model.from_pretrained('gpt2')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       hidden_states_1, past = model(tokens_tensor_1)
+       # past can be used to reuse precomputed hidden state in a subsequent predictions
+       # (see beam-search examples in the run_gpt2.py example).
+       hidden_states_2, past = model(tokens_tensor_2, past=past)
+
+And how to use ``GPT2LMHeadModel``
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = GPT2LMHeadModel.from_pretrained('gpt2')
+   model.eval()
+
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+
+   # Predict all tokens
+   with torch.no_grad():
+       predictions_1, past = model(tokens_tensor_1)
+       # past can be used to reuse precomputed hidden state in a subsequent predictions
+       # (see beam-search examples in the run_gpt2.py example).
+       predictions_2, past = model(tokens_tensor_2, past=past)
+
+   # get the predicted last token
+   predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+   predicted_token = tokenizer.decode([predicted_index])
+
+And how to use ``GPT2DoubleHeadsModel``
+
+.. code-block:: python
+
+   # Load pre-trained model (weights)
+   model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
+   model.eval()
+
+   #  Prepare tokenized input
+   text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+   text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+   tokenized_text1 = tokenizer.tokenize(text1)
+   tokenized_text2 = tokenizer.tokenize(text2)
+   indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+   indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+   tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+   mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)
--- a/pytorch_pretrained_bert/modeling_bert.py
+++ b/pytorch_pretrained_bert/modeling_bert.py
@@ -150,27 +150,11 @@ ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}


 class BertConfig(PretrainedConfig):
-    """Configuration class to store the configuration of a `BertModel`.
-    """
-    pretrained_config_archive_map = PRETRAINED_CONFIG_ARCHIVE_MAP
+    r"""
+        :class:`~pytorch_pretrained_bert.BertConfig` is the configuration class to store the configuration of a
+        `BertModel`.

-    def __init__(self,
-                 vocab_size_or_config_json_file=30522,
-                 hidden_size=768,
-                 num_hidden_layers=12,
-                 num_attention_heads=12,
-                 intermediate_size=3072,
-                 hidden_act="gelu",
-                 hidden_dropout_prob=0.1,
-                 attention_probs_dropout_prob=0.1,
-                 max_position_embeddings=512,
-                 type_vocab_size=2,
-                 initializer_range=0.02,
-                 layer_norm_eps=1e-12,
-                 **kwargs):
-        """Constructs BertConfig.
-
-        Args:
+        Arguments:
            vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
            hidden_size: Size of the encoder layers and the pooler layer.
            num_hidden_layers: Number of hidden layers in the Transformer encoder.
@@ -192,6 +176,24 @@ class BertConfig(PretrainedConfig):
            initializer_range: The sttdev of the truncated_normal_initializer for
                initializing all weight matrices.
            layer_norm_eps: The epsilon used by LayerNorm.
+    """
+    pretrained_config_archive_map = PRETRAINED_CONFIG_ARCHIVE_MAP
+
+    def __init__(self,
+                 vocab_size_or_config_json_file=30522,
+                 hidden_size=768,
+                 num_hidden_layers=12,
+                 num_attention_heads=12,
+                 intermediate_size=3072,
+                 hidden_act="gelu",
+                 hidden_dropout_prob=0.1,
+                 attention_probs_dropout_prob=0.1,
+                 max_position_embeddings=512,
+                 type_vocab_size=2,
+                 initializer_range=0.02,
+                 layer_norm_eps=1e-12,
+                 **kwargs):
+        """Constructs BertConfig.
        """
        super(BertConfig, self).__init__(**kwargs)
        if isinstance(vocab_size_or_config_json_file, str) or (sys.version_info[0] == 2
@@ -565,53 +567,25 @@ class BertPreTrainedModel(PreTrainedModel):


 class BertModel(BertPreTrainedModel):
-    """BERT model ("Bidirectional Embedding Representations from a Transformer").
+    r"""BERT model ("Bidirectional Embedding Representations from a Transformer").

-    Params:
-        `config`: a BertConfig class instance with the configuration to build a new model
-        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
-        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
+    :class:`~pytorch_pretrained_bert.BertModel` is the basic BERT Transformer model with a layer of summed token, \
+    position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 \
+    for BERT-large). The model is instantiated with the following parameters.

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-
-    Outputs: Tuple of (encoded_layers, pooled_output)
-        `encoded_layers`: controled by `output_all_encoded_layers` argument:
-            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
-                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
-                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
-            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
-                to the last attention block of shape [batch_size, sequence_length, hidden_size],
-        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
-            classifier pretrained on top of the hidden state associated to the first character of the
-            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
+    Arguments:
+        config: a BertConfig class instance with the configuration to build a new model
+        output_attentions: If True, also output attentions weights computed by the model at each layer. Default: False
+        output_hidden_states: If True, also output hidden states computed by the model at each layer. Default: Fals
+
+
+    Example::

-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+        config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

-    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+        model = modeling.BertModel(config=config)

-    model = modeling.BertModel(config=config)
-    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
-    ```
    """
    def __init__(self, config):
        super(BertModel, self).__init__(config)
@@ -631,6 +605,58 @@ class BertModel(BertPreTrainedModel):
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+
+        Arguments:
+            input_ids: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the \
+                vocabulary(see the tokens pre-processing logic in the scripts `run_bert_extract_features.py`, \
+                `run_bert_classifier.py` and `run_bert_squad.py`)
+            token_type_ids: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token \
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to \
+                a `sentence B` token (see BERT paper for more details).
+            attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices \
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max \
+                input sequence length in the current batch. It's the mask that we typically use for attention when \
+                a batch has varying length sentences.
+            output_all_encoded_layers: boolean which controls the content of the `encoded_layers` output as described \
+            below. Default: `True`.
+            head_mask: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 \
+            and 1. It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 \
+            => head is not masked.
+
+
+        Returns:
+            A tuple composed of (encoded_layers, pooled_output). Encoded layers are controlled by the \
+            ``output_all_encoded_layers`` argument.
+
+            If ``output_all_encoded_layers`` is set to True, outputs a list of the full sequences of \
+            encoded-hidden-states at the end of each attention \
+            block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a\
+            torch.FloatTensor of size [batch_size, sequence_length, hidden_size].
+
+            If set to False, outputs only the full sequence of hidden-states corresponding \
+            to the last attention block of shape [batch_size, sequence_length, hidden_size].
+
+            ``pooled_output`` is a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a \
+            classifier pretrained on top of the hidden state associated to the first character of the \
+            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+
+            all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
+            # or
+            all_encoder_layers, pooled_output = model.forward(input_ids, token_type_ids, input_mask)
+
+
+        """
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
@@ -683,53 +709,17 @@ class BertForPreTraining(BertPreTrainedModel):
        - the masked language modeling head, and
        - the next sentence classification head.

-    Params:
+    Args:
        `config`: a BertConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
-            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
-            is only computed for the labels set in [0, ..., vocab_size]
-        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, 1].
-            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `masked_lm_labels` and `next_sentence_label` are not `None`:
-            Outputs the total_loss which is the sum of the masked language modeling loss and the next
-            sentence classification loss.
-        if `masked_lm_labels` or `next_sentence_label` is `None`:
-            Outputs a tuple comprising
-            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
-            - the next sentence classification logits of shape [batch_size, 2].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForPreTraining(config)
-    masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example ::
+
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        model = BertForPreTraining(config)
    """
    def __init__(self, config):
        super(BertForPreTraining, self).__init__(config)
@@ -741,6 +731,56 @@ class BertForPreTraining(BertPreTrainedModel):

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None,
                next_sentence_label=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+        Args:
+            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+                with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+                is only computed for the labels set in [0, ..., vocab_size]
+            `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
+                with indices selected in [0, 1].
+                0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+
+        Returns:
+            Either a torch.Tensor or tuple(torch.Tensor, torch.Tensor).
+
+            if ``masked_lm_labels`` and ``next_sentence_label`` are not ``None``, outputs the total_loss which is the \
+             sum of the masked language modeling loss and the next \
+            sentence classification loss.
+
+            if ``masked_lm_labels`` or ``next_sentence_label` is `None``, outputs a tuple comprising:
+                - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
+                - the next sentence classification logits of shape [batch_size, 2].
+
+        Example ::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+                num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+            model = BertForPreTraining(config)
+            masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
+            # or
+            masked_lm_logits_scores, seq_relationship_logits = model.forward(input_ids, token_type_ids, input_mask)
+        """
        outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)

        sequence_output, pooled_output = outputs[:2]
@@ -762,51 +802,17 @@ class BertForMaskedLM(BertPreTrainedModel):
    """BERT model with the masked language modeling head.
    This module comprises the BERT model followed by the masked language modeling head.

-    Params:
+    Args:
        `config`: a BertConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
-            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
-            is only computed for the labels set in [0, ..., vocab_size]
-        `head_mask`: an optional torch.LongTensor of shape [num_heads] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `masked_lm_labels` is  not `None`:
-            Outputs the masked language modeling loss.
-        if `masked_lm_labels` is `None`:
-            Outputs the masked language modeling logits of shape [batch_size, sequence_length, vocab_size].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForMaskedLM(config)
-    masked_lm_logits_scores = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example::
+
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        model = BertForMaskedLM(config)
    """
    def __init__(self, config):
        super(BertForMaskedLM, self).__init__(config)
@@ -817,6 +823,45 @@ class BertForMaskedLM(BertPreTrainedModel):
        self.apply(self.init_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, masked_lm_labels=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+        Args:
+            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `masked_lm_labels`: masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
+                with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
+                is only computed for the labels set in [0, ..., vocab_size]
+            `head_mask`: an optional torch.LongTensor of shape [num_heads] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+        Returns:
+            Masked language modeling loss if `masked_lm_labels` is specified, masked language modeling
+            logits of shape [batch_size, sequence_length, vocab_size] otherwise.
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            masked_lm_logits_scores = model(input_ids, token_type_ids, input_mask)
+            # or
+            masked_lm_logits_scores = model.forward(input_ids, token_type_ids, input_mask)
+        """
        outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)

        sequence_output = outputs[0]
@@ -835,48 +880,17 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
    """BERT model with next sentence prediction head.
    This module comprises the BERT model followed by the next sentence classification head.

-    Params:
+    Args:
        `config`: a BertConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, 1].
-            0 => next sentence is the continuation, 1 => next sentence is a random sentence.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `next_sentence_label` is not `None`:
-            Outputs the total_loss which is the sum of the masked language modeling loss and the next
-            sentence classification loss.
-        if `next_sentence_label` is `None`:
-            Outputs the next sentence classification logits of shape [batch_size, 2].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForNextSentencePrediction(config)
-    seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example::
+
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        model = BertForNextSentencePrediction(config)
    """
    def __init__(self, config):
        super(BertForNextSentencePrediction, self).__init__(config)
@@ -887,6 +901,44 @@ class BertForNextSentencePrediction(BertPreTrainedModel):
        self.apply(self.init_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, next_sentence_label=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+        Args:
+            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary(see the tokens pre-processing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `next_sentence_label`: next sentence classification loss: torch.LongTensor of shape [batch_size]
+                with indices selected in [0, 1].
+                0 => next sentence is the continuation, 1 => next sentence is a random sentence.
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between
+                0 and 1.It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked,
+                0.0 => head is not masked.
+
+        Returns:
+            If `next_sentence_label` is specified, outputs the total_loss which is the sum of the masked language \
+            modeling loss and the next sentence classification loss.
+            if `next_sentence_label` is `None`, outputs the next sentence classification logits of shape [batch_size, 2].
+
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
+            # or
+            seq_relationship_logits = model.forward(input_ids, token_type_ids, input_mask)
+        """
        outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
        pooled_output = outputs[1]

@@ -912,43 +964,14 @@ class BertForSequenceClassification(BertPreTrainedModel):
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
        `num_labels`: the number of classes for the classifier. Default = 2.

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, ..., num_labels].
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `labels` is not `None`:
-            Outputs the CrossEntropy classification loss of the output with the labels.
-        if `labels` is `None`:
-            Outputs the classification logits of shape [batch_size, num_labels].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    num_labels = 2
-
-    model = BertForSequenceClassification(config, num_labels)
-    logits = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example::
+
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        num_labels = 2
+
+        model = BertForSequenceClassification(config, num_labels)
    """
    def __init__(self, config):
        super(BertForSequenceClassification, self).__init__(config)
@@ -961,6 +984,40 @@ class BertForSequenceClassification(BertPreTrainedModel):
        self.apply(self.init_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+        Parameters:
+            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+                with indices selected in [0, ..., num_labels].
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+        Returns:
+            if `labels` is not `None`, outputs the CrossEntropy classification loss of the output with the labels.
+            if `labels` is `None`, outputs the classification logits of shape `[batch_size, num_labels]`.
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            logits = model(input_ids, token_type_ids, input_mask)
+            # or
+            logits = model.forward(input_ids, token_type_ids, input_mask)
+        """
        outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
        pooled_output = outputs[1]

@@ -984,48 +1041,24 @@ class BertForSequenceClassification(BertPreTrainedModel):

 class BertForMultipleChoice(BertPreTrainedModel):
    """BERT model for multiple choice tasks.
-    This module is composed of the BERT model with a linear layer on top of
-    the pooled output.
+    This module is composed of the BERT model with a linear layer on top of the pooled output.

-    Params:
+    Parameters:
        `config`: a BertConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, num_choices, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length]
-            with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A`
-            and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
-            with indices selected in [0, ..., num_choices].
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `labels` is not `None`:
-            Outputs the CrossEntropy classification loss of the output with the labels.
-        if `labels` is `None`:
-            Outputs the classification logits of shape [batch_size, num_labels].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
-    input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
-    token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForMultipleChoice(config)
-    logits = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example::
+
+        # Already been converted into WordPiece token ids
+        input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
+        input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
+        token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        model = BertForMultipleChoice(config)
+        logits = model(input_ids, token_type_ids, input_mask)
    """
    def __init__(self, config):
        super(BertForMultipleChoice, self).__init__(config)
@@ -1037,6 +1070,41 @@ class BertForMultipleChoice(BertPreTrainedModel):
        self.apply(self.init_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+        Parameters:
+            `input_ids`: a torch.LongTensor of shape [batch_size, num_choices, sequence_length]
+                with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length]
+                with the token types indices selected in [0, 1]. Type 0 corresponds to a `sentence A`
+                and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, num_choices, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
+                with indices selected in [0, ..., num_choices].
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+        Returns:
+            if `labels` is not `None`, outputs the CrossEntropy classification loss of the output with the labels.
+            if `labels` is `None`, outputs the classification logits of shape [batch_size, num_labels].
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[[31, 51, 99], [15, 5, 0]], [[12, 16, 42], [14, 28, 57]]])
+            input_mask = torch.LongTensor([[[1, 1, 1], [1, 1, 0]],[[1,1,0], [1, 0, 0]]])
+            token_type_ids = torch.LongTensor([[[0, 0, 1], [0, 1, 0]],[[0, 1, 1], [0, 0, 1]]])
+            config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+                num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+            model = BertForMultipleChoice(config)
+            logits = model(input_ids, token_type_ids, input_mask)
+        """
        """ Input shapes should be [bsz, num choices, seq length] """
        num_choices = input_ids.shape[1]

@@ -1065,49 +1133,20 @@ class BertForTokenClassification(BertPreTrainedModel):
    This module is composed of the BERT model with a linear layer on top of
    the full hidden state of the last layer.

-    Params:
+    Parameters:
        `config`: a BertConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
        `num_labels`: the number of classes for the classifier. Default = 2.

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
-            with indices selected in [0, ..., num_labels].
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `labels` is not `None`:
-            Outputs the CrossEntropy classification loss of the output with the labels.
-        if `labels` is `None`:
-            Outputs the classification logits of shape [batch_size, sequence_length, num_labels].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    num_labels = 2
-
-    model = BertForTokenClassification(config, num_labels)
-    logits = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example::
+
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        num_labels = 2
+
+        model = BertForTokenClassification(config, num_labels)
    """
    def __init__(self, config):
        super(BertForTokenClassification, self).__init__(config)
@@ -1120,6 +1159,40 @@ class BertForTokenClassification(BertPreTrainedModel):
        self.apply(self.init_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+
+        Parameters:
+            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary(see the tokens pre-processing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `labels`: labels for the classification output: torch.LongTensor of shape [batch_size, sequence_length]
+                with indices selected in [0, ..., num_labels].
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+        Returns:
+            if `labels` is not `None`, outputs the CrossEntropy classification loss of the output with the labels.
+            if `labels` is `None`, outputs the classification logits of shape [batch_size, sequence_length, num_labels].
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            logits = model(input_ids, token_type_ids, input_mask)
+            # or
+            logits = model.forward(input_ids, token_type_ids, input_mask)
+        """
        outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
        sequence_output = outputs[0]

@@ -1147,51 +1220,17 @@ class BertForQuestionAnswering(BertPreTrainedModel):
    This module is composed of the BERT model with a linear layer on top of
    the sequence output that computes start_logits and end_logits

-    Params:
+    Parameters:
        `config`: a BertConfig class instance with the configuration to build a new model
        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False

-    Inputs:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
-            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
-            into account for computing the loss.
-        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
-            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
-            into account for computing the loss.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-
-    Outputs:
-        if `start_positions` and `end_positions` are not `None`:
-            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
-        if `start_positions` or `end_positions` is `None`:
-            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
-            position tokens of shape [batch_size, sequence_length].
-
-    Example usage:
-    ```python
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-
-    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-
-    model = BertForQuestionAnswering(config)
-    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
-    ```
+    Example::
+
+        config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
+
+        model = BertForQuestionAnswering(config)
    """
    def __init__(self, config):
        super(BertForQuestionAnswering, self).__init__(config)
@@ -1204,6 +1243,42 @@ class BertForQuestionAnswering(BertPreTrainedModel):

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None,
                end_positions=None, head_mask=None):
+        """
+        Parameters:
+            `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+                with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+                `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+            `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
+                a `sentence B` token (see BERT paper for more details).
+            `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
+                input sequence length in the current batch. It's the mask that we typically use for attention when
+                a batch has varying length sentences.
+            `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
+                Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+                into account for computing the loss.
+            `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
+                Positions are clamped to the length of the sequence and position outside of the sequence are not taken
+                into account for computing the loss.
+            `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
+                It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
+
+        Returns:
+            if `start_positions` and `end_positions` are not `None`, outputs the total_loss which is the sum of the
+            CrossEntropy loss for the start and end token positions.
+            if `start_positions` or `end_positions` is `None`, outputs a tuple of start_logits, end_logits which are the
+            logits respectively for the start and end position tokens of shape [batch_size, sequence_length].
+
+        Example::
+
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+
+            start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
+        """
        outputs = self.bert(input_ids, token_type_ids, attention_mask, head_mask=head_mask)
        sequence_output = outputs[0]


--- a/pytorch_pretrained_bert/optimization.py
+++ b/pytorch_pretrained_bert/optimization.py
@@ -182,7 +182,8 @@ SCHEDULES = {

 class BertAdam(Optimizer):
    """Implements BERT version of Adam algorithm with weight decay fix.
-    Params:
+
+    Parameters:
        lr: learning rate
        warmup: portion of t_total for the warmup, -1  means no warmup. Default: -1
        t_total: total number of training steps for the learning

--- a/pytorch_pretrained_bert/tokenization_bert.py
+++ b/pytorch_pretrained_bert/tokenization_bert.py
@@ -84,24 +84,22 @@ def whitespace_tokenize(text):


 class BertTokenizer(object):
-    """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
+    r"""
+    Constructs a BertTokenizer.
+    :class:`~pytorch_pretrained_bert.BertTokenizer` runs end-to-end tokenization: punctuation splitting + wordpiece
+
+    Args:
+        vocab_file: Path to a one-wordpiece-per-line vocabulary file
+        do_lower_case: Whether to lower case the input. Only has an effect when do_wordpiece_only=False
+        do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+        max_len: An artificial maximum length to truncate tokenized sequences to; Effective maximum length is always the
+            minimum of this value (if specified) and the underlying BERT model's sequence length.
+        never_split: List of tokens which will never be split during tokenization. Only has an effect when
+            do_wordpiece_only=False
+    """

    def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True,
                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
-        """Constructs a BertTokenizer.
-
-        Args:
-          vocab_file: Path to a one-wordpiece-per-line vocabulary file
-          do_lower_case: Whether to lower case the input
-                         Only has an effect when do_wordpiece_only=False
-          do_basic_tokenize: Whether to do basic tokenization before wordpiece.
-          max_len: An artificial maximum length to truncate tokenized sequences to;
-                         Effective maximum length is always the minimum of this
-                         value (if specified) and the underlying BERT model's
-                         sequence length.
-          never_split: List of tokens which will never be split during tokenization.
-                         Only has an effect when do_wordpiece_only=False
-        """
        if not os.path.isfile(vocab_file):
            raise ValueError(
                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "