Initial folder structure for the documentation. A draft of documentation...

Initial folder structure for the documentation. A draft of documentation change has been made in the BertModel class.

Initial folder structure for the documentation. A draft of documentation...
Initial folder structure for the documentation. A draft of documentation change has been made in the BertModel class.
03de9686 · LysandreJik · e75c3f70 · 03de9686 · 03de9686 · 03de9686
Commit 03de9686 authored Jul 05, 2019 by LysandreJik
12 changed files
--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = _build
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+.PHONY: help Makefile
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
--- a/docs/README.md
+++ b/docs/README.md
+# Generating the documentation
+To generate the documentation, you first have to build it. Building it requires the package `sphinx` that you can 
+install using:
+```bash
+pip install -U sphinx
+```
+You would also need the custom installed [theme](https://github.com/readthedocs/sphinx_rtd_theme) by 
+[Read The Docs](https://readthedocs.org/). You can install it using the following command:
+```bash
+pip install sphinx_rtd_theme
+```
+Once you have setup `sphinx`, you can build the documentation by running the following command in the `/docs` folder:
+```bash
+make html
+```
+It should build the static app that will be available under `/docs/_build/html`
\ No newline at end of file
--- a/docs/source/cli.rst
+++ b/docs/source/cli.rst
+CLI
+================================================
+A command-line interface is provided to convert a TensorFlow checkpoint in a PyTorch dump of the ``BertForPreTraining`` class  (for BERT) or NumPy checkpoint in a PyTorch dump of the ``OpenAIGPTModel`` class  (for OpenAI GPT).
+BERT
+^^^^
+You can convert any TensorFlow checkpoint for BERT (in particular `the pre-trained models released by Google <https://github.com/google-research/bert#pre-trained-models>`_\ ) in a PyTorch save file by using the `\ ``convert_tf_checkpoint_to_pytorch.py`` <./pytorch_pretrained_bert/convert_tf_checkpoint_to_pytorch.py>`_ script.
+This CLI takes as input a TensorFlow checkpoint (three files starting with ``bert_model.ckpt``\ ) and the associated configuration file (\ ``bert_config.json``\ ), and creates a PyTorch model for this configuration, loads the weights from the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can be imported using ``torch.load()`` (see examples in `\ ``run_bert_extract_features.py`` <./examples/run_bert_extract_features.py>`_\ , `\ ``run_bert_classifier.py`` <./examples/run_bert_classifier.py>`_ and `\ ``run_bert_squad.py`` <./examples/run_bert_squad.py>`_\ ).
+You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow checkpoint (the three files starting with ``bert_model.ckpt``\ ) but be sure to keep the configuration file (\ ``bert_config.json``\ ) and the vocabulary file (\ ``vocab.txt``\ ) as these are needed for the PyTorch model too.
+To run this specific conversion script you will need to have TensorFlow and PyTorch installed (\ ``pip install tensorflow``\ ). The rest of the repository only requires PyTorch.
+Here is an example of the conversion process for a pre-trained ``BERT-Base Uncased`` model:
+.. code-block:: shell
+   export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
+   pytorch_pretrained_bert bert \
+     $BERT_BASE_DIR/bert_model.ckpt \
+     $BERT_BASE_DIR/bert_config.json \
+     $BERT_BASE_DIR/pytorch_model.bin
+You can download Google's pre-trained models for the conversion `here <https://github.com/google-research/bert#pre-trained-models>`__.
+OpenAI GPT
+^^^^^^^^^^
+Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint save as the same format than OpenAI pretrained model (see `here <https://github.com/openai/finetune-transformer-lm>`__\ )
+.. code-block:: shell
+   export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights
+   pytorch_pretrained_bert gpt \
+     $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     [OPENAI_GPT_CONFIG]
+Transformer-XL
+^^^^^^^^^^^^^^
+Here is an example of the conversion process for a pre-trained Transformer-XL model (see `here <https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models>`__\ )
+.. code-block:: shell
+   export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint
+   pytorch_pretrained_bert transfo_xl \
+     $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     [TRANSFO_XL_CONFIG]
+GPT-2
+^^^^^
+Here is an example of the conversion process for a pre-trained OpenAI's GPT-2 model.
+.. code-block:: shell
+   export GPT2_DIR=/path/to/gpt2/checkpoint
+   pytorch_pretrained_bert gpt2 \
+     $GPT2_DIR/model.ckpt \
+     $PYTORCH_DUMP_OUTPUT \
+     [GPT2_CONFIG]
+XLNet
+^^^^^
+Here is an example of the conversion process for a pre-trained XLNet model, fine-tuned on STS-B using the TensorFlow script:
+.. code-block:: shell
+   export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint
+   export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config
+   pytorch_pretrained_bert xlnet \
+     $TRANSFO_XL_CHECKPOINT_PATH \
+     $TRANSFO_XL_CONFIG_PATH \
+     $PYTORCH_DUMP_OUTPUT \
+     STS-B \
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
+# -*- coding: utf-8 -*-
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+# -- Path setup --------------------------------------------------------------
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+sys.path.insert(0, os.path.abspath('../..'))
+# -- Project information -----------------------------------------------------
+project = u'pytorch-transformers'
+copyright = u'2019, huggingface'
+author = u'huggingface'
+# The short X.Y version
+version = u''
+# The full version, including alpha/beta/rc tags
+release = u'1.0.0'
+# -- General configuration ---------------------------------------------------
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.coverage',
+    'sphinx.ext.napoleon'
+]
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = ['.rst', '.md']
+# source_suffix = '.rst'
+# The master toctree document.
+master_doc = 'index'
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store']
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+# -- Options for HTML output -------------------------------------------------
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+# html_sidebars = {}
+# -- Options for HTMLHelp output ---------------------------------------------
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'pytorch-transformersdoc'
+# -- Options for LaTeX output ------------------------------------------------
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'pytorch-transformers.tex', u'pytorch-transformers Documentation',
+     u'huggingface', 'manual'),
+]
+# -- Options for manual page output ------------------------------------------
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'pytorch-transformers', u'pytorch-transformers Documentation',
+     [author], 1)
+]
+# -- Options for Texinfo output ----------------------------------------------
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'pytorch-transformers', u'pytorch-transformers Documentation',
+     author, 'pytorch-transformers', 'One line description of project.',
+     'Miscellaneous'),
+]
+# -- Options for Epub output -------------------------------------------------
+# Bibliographic Dublin Core info.
+epub_title = project
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+# A unique identification for the text.
+#
+# epub_uid = ''
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+# -- Extension configuration -------------------------------------------------
--- a/docs/source/doc.rst
+++ b/docs/source/doc.rst
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
+Pytorch-Transformers: The Big & Extending Repository of pretrained Transformers
+================================================================================================================================================
+.. toctree::
+    :maxdepth: 2
+    installation
+    usage
+    doc
+    examples
+    notebooks
+    tpu
+    cli
+.. image:: https://circleci.com/gh/huggingface/pytorch-pretrained-BERT.svg?style=svg
+   :target: https://circleci.com/gh/huggingface/pytorch-pretrained-BERT
+   :alt: CircleCI
+This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
+* `Google's BERT model <https://github.com/google-research/bert>`_\ ,
+* `OpenAI's GPT model <https://github.com/openai/finetune-transformer-lm>`_\ ,
+* `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`_\ , and
+* `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`_.
+These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <#examples>`_ section below.
+Here are some information on these models:
+**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`_\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
+**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`_ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
+**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\ *, Zhilin Yang*\ , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`_ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
+**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\ *, Jeffrey Wu*\ , Rewon Child, David Luan, Dario Amodei\ ** and Ilya Sutskever**.
+This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`_ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
+Content
+-------
+.. list-table::
+   :header-rows: 1
+   * - Section
+     - Description
+   * - `Installation <#installation>`_
+     - How to install the package
+   * - `Overview <#overview>`_
+     - Overview of the package
+   * - `Usage <#usage>`_
+     - Quickstart examples
+   * - `Doc <#doc>`_
+     - Detailed documentation
+   * - `Examples <#examples>`_
+     - Detailed examples on how to fine-tune Bert
+   * - `Notebooks <#notebooks>`_
+     - Introduction on the provided Jupyter Notebooks
+   * - `TPU <#tpu>`_
+     - Notes on TPU support and pretraining scripts
+   * - `Command-line interface <#Command-line-interface>`_
+     - Convert a TensorFlow checkpoint in a PyTorch dump
+Overview
+--------
+This package comprises the following classes that can be imported in Python and are detailed in the `Doc <#doc>`_ section of this readme:
+*
+  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_ file):
+  * `BertModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L639>`_ - raw BERT Transformer model (\ **fully pre-trained**\ ),
+  * `BertForMaskedLM <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L793>`_ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
+  * `BertForNextSentencePrediction <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L854>`_ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
+  * `BertForPreTraining <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L722>`_ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
+  * `BertForSequenceClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L916>`_ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `BertForMultipleChoice <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L982>`_ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `BertForTokenClassification <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1051>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
+  * `BertForQuestionAnswering <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L1124>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
+*
+  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_ file):
+  * `OpenAIGPTModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L536>`_ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
+  * `OpenAIGPTLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L643>`_ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `OpenAIGPTDoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py#L722>`_ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+*
+  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ file):
+  * `TransfoXLModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L983>`_ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLLMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py#L1260>`_ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
+*
+  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py>`_ file):
+  * `GPT2Model <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L479>`_ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
+  * `GPT2LMHeadModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L559>`_ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `GPT2DoubleHeadsModel <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_gpt2.py#L624>`_ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+*
+  Tokenizers for **BERT** (using word-piece) (in the `tokenization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py>`_ file):
+  * ``BasicTokenizer`` - basic tokenization (punctuation splitting, lower casing, etc.),
+  * ``WordpieceTokenizer`` - WordPiece tokenization,
+  * ``BertTokenizer`` - perform end-to-end tokenization, i.e. basic tokenization followed by WordPiece tokenization.
+*
+  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_openai.py>`_ file):
+  * ``OpenAIGPTTokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
+*
+  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_transfo_xl.py>`_ file):
+  * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
+*
+  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization_gpt2.py>`_ file):
+  * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
+*
+  Optimizer for **BERT** (in the `optimization.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/pytorch_pretrained_bert/optimization.py>`_ file):
+  * ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
+*
+  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/optimization_openai.py>`_ file):
+  * ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
+*
+  Configuration classes for BERT, OpenAI GPT and Transformer-XL (in the respective `modeling.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py>`_\ , `modeling_openai.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_openai.py>`_\ , `modeling_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling_transfo_xl.py>`_ files):
+  * ``BertConfig`` - Configuration class to store the configuration of a ``BertModel`` with utilities to read and write from JSON configuration files.
+  * ``OpenAIGPTConfig`` - Configuration class to store the configuration of a ``OpenAIGPTModel`` with utilities to read and write from JSON configuration files.
+  * ``GPT2Config`` - Configuration class to store the configuration of a ``GPT2Model`` with utilities to read and write from JSON configuration files.
+  * ``TransfoXLConfig`` - Configuration class to store the configuration of a ``TransfoXLModel`` with utilities to read and write from JSON configuration files.
+The repository further comprises:
+*
+  Five examples on how to use **BERT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  * `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_extract_features.py>`_ - Show how to extract hidden states from an instance of ``BertModel``\ ,
+  * `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_classifier.py>`_ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task,
+  * `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_squad.py>`_ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks.
+  * `run_swag.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py>`_ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task.
+  * `simple_lm_finetuning.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/simple_lm_finetuning.py>`_ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus.
+*
+  One example on how to use **OpenAI GPT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  * `run_openai_gpt.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_openai_gpt.py>`_ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task.
+*
+  One example on how to use **Transformer-XL** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  * `run_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_transfo_xl.py>`_ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103.
+*
+  One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  * `run_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_gpt2.py>`_ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples).
+  These examples are detailed in the `Examples <#examples>`_ section of this readme.
+*
+  Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks>`_\ ):
+  * `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_ - Compare the hidden states predicted by ``BertModel``\ ,
+  * `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_ - Compare the spans predicted by  ``BertForQuestionAnswering`` instances,
+  * `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_ - Compare the predictions of the ``BertForPretraining`` instances.
+  These notebooks are detailed in the `Notebooks <#notebooks>`_ section of this readme.
+*
+  A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
+  This CLI is detailed in the `Command-line interface <#Command-line-interface>`_ section of this readme.
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
+Installation
+================================================
+This repo was tested on Python 2.7 and 3.5+ (examples are tested only on python 3.5+) and PyTorch 0.4.1/1.0.0
+With pip
+^^^^^^^^
+PyTorch pretrained bert can be installed by pip as follows:
+.. code-block:: bash
+   pip install pytorch-pretrained-bert
+If you want to reproduce the original tokenization process of the ``OpenAI GPT`` paper, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+.. code-block:: bash
+   pip install spacy ftfy==4.4.3
+   python -m spacy download en
+If you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
+From source
+^^^^^^^^^^^
+Clone the repository and run:
+.. code-block:: bash
+   pip install [--editable] .
+Here also, if you want to reproduce the original tokenization process of the ``OpenAI GPT`` model, you will need to install ``ftfy`` (limit to version 4.4.3 if you are using Python 2) and ``SpaCy`` :
+.. code-block:: bash
+   pip install spacy ftfy==4.4.3
+   python -m spacy download en
+Again, if you don't install ``ftfy`` and ``SpaCy``\ , the ``OpenAI GPT`` tokenizer will default to tokenize using BERT's ``BasicTokenizer`` followed by Byte-Pair Encoding (which should be fine for most usage).
+A series of tests is included in the `tests folder <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/tests>`_ and can be run using ``pytest`` (install pytest if needed: ``pip install pytest``\ ).
+You can run the tests with the command:
+.. code-block:: bash
+   python -m pytest -sv tests/
--- a/docs/source/notebooks.rst
+++ b/docs/source/notebooks.rst
+Notebooks
+================================================
+We include `three Jupyter Notebooks <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks>`_ that can be used to check that the predictions of the PyTorch model are identical to the predictions of the original TensorFlow model.
+*
+  The first NoteBook (\ `Comparing-TF-and-PT-models.ipynb <./notebooks/Comparing-TF-and-PT-models.ipynb>`_\ ) extracts the hidden states of a full sequence on each layers of the TensorFlow and the PyTorch models and computes the standard deviation between them. In the given example, we get a standard deviation of 1.5e-7 to 9e-7 on the various hidden state of the models.
+*
+  The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <./notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
+*
+  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <./notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
+Please follow the instructions given in the notebooks to run and modify them.
--- a/docs/source/tpu.rst
+++ b/docs/source/tpu.rst
+TPU
+================================================
+TPU support and pretraining scripts
+------------------------------------------------
+TPU are not supported by the current stable release of PyTorch (0.4.1). However, the next version of PyTorch (v1.0) should support training on TPU and is expected to be released soon (see the recent `official announcement <https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud>`_\ ).
+We will add TPU support when this next release is published.
+The original TensorFlow code further comprises two scripts for pre-training BERT: `create_pretraining_data.py <https://github.com/google-research/bert/blob/master/create_pretraining_data.py>`_ and `run_pretraining.py <https://github.com/google-research/bert/blob/master/run_pretraining.py>`_.
+Since, pre-training BERT is a particularly expensive operation that basically requires one or several TPUs to be completed in a reasonable amout of time (see details `here <https://github.com/google-research/bert#pre-training-with-bert>`_\ ) we have decided to wait for the inclusion of TPU support in PyTorch to convert these pre-training scripts.
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
+Usage
+================================================
+BERT
+^^^^
+Here is a quick-start example using ``BertTokenizer``\ , ``BertModel`` and ``BertForMaskedLM`` class with Google AI's pre-trained ``Bert base uncased`` model. See the `doc section <#doc>`_ below for all the details on these classes.
+First let's prepare a tokenized input with ``BertTokenizer``
+.. code-block:: python
+   import torch
+   from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+   # Load pre-trained model tokenizer (vocabulary)
+   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+   # Tokenized input
+   text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
+   tokenized_text = tokenizer.tokenize(text)
+   # Mask a token that we will try to predict back with `BertForMaskedLM`
+   masked_index = 8
+   tokenized_text[masked_index] = '[MASK]'
+   assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
+   # Convert token to vocabulary indices
+   indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+   # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
+   segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
+   # Convert inputs to PyTorch tensors
+   tokens_tensor = torch.tensor([indexed_tokens])
+   segments_tensors = torch.tensor([segments_ids])
+Let's see how to use ``BertModel`` to get hidden states
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = BertModel.from_pretrained('bert-base-uncased')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   segments_tensors = segments_tensors.to('cuda')
+   model.to('cuda')
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       encoded_layers, _ = model(tokens_tensor, segments_tensors)
+   # We have a hidden states for each of the 12 layers in model bert-base-uncased
+   assert len(encoded_layers) == 12
+And how to use ``BertForMaskedLM``
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = BertForMaskedLM.from_pretrained('bert-base-uncased')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   segments_tensors = segments_tensors.to('cuda')
+   model.to('cuda')
+   # Predict all tokens
+   with torch.no_grad():
+       predictions = model(tokens_tensor, segments_tensors)
+   # confirm we were able to predict 'henson'
+   predicted_index = torch.argmax(predictions[0, masked_index]).item()
+   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+   assert predicted_token == 'henson'
+OpenAI GPT
+^^^^^^^^^^
+Here is a quick-start example using ``OpenAIGPTTokenizer``\ , ``OpenAIGPTModel`` and ``OpenAIGPTLMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+First let's prepare a tokenized input with ``OpenAIGPTTokenizer``
+.. code-block:: python
+   import torch
+   from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+   # Load pre-trained model tokenizer (vocabulary)
+   tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
+   # Tokenized input
+   text = "Who was Jim Henson ? Jim Henson was a puppeteer"
+   tokenized_text = tokenizer.tokenize(text)
+   # Convert token to vocabulary indices
+   indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
+   # Convert inputs to PyTorch tensors
+   tokens_tensor = torch.tensor([indexed_tokens])
+Let's see how to use ``OpenAIGPTModel`` to get hidden states
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = OpenAIGPTModel.from_pretrained('openai-gpt')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   model.to('cuda')
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       hidden_states = model(tokens_tensor)
+And how to use ``OpenAIGPTLMHeadModel``
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor = tokens_tensor.to('cuda')
+   model.to('cuda')
+   # Predict all tokens
+   with torch.no_grad():
+       predictions = model(tokens_tensor)
+   # get the predicted last token
+   predicted_index = torch.argmax(predictions[0, -1, :]).item()
+   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+   assert predicted_token == '.</w>'
+And how to use ``OpenAIGPTDoubleHeadsModel``
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = OpenAIGPTDoubleHeadsModel.from_pretrained('openai-gpt')
+   model.eval()
+   #  Prepare tokenized input
+   text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+   text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+   tokenized_text1 = tokenizer.tokenize(text1)
+   tokenized_text2 = tokenizer.tokenize(text2)
+   indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+   indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+   tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+   mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       lm_logits, multiple_choice_logits = model(tokens_tensor, mc_token_ids)
+Transformer-XL
+^^^^^^^^^^^^^^
+Here is a quick-start example using ``TransfoXLTokenizer``\ , ``TransfoXLModel`` and ``TransfoXLModelLMHeadModel`` class with the Transformer-XL model pre-trained on WikiText-103. See the `doc section <#doc>`_ below for all the details on these classes.
+First let's prepare a tokenized input with ``TransfoXLTokenizer``
+.. code-block:: python
+   import torch
+   from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+   # Load pre-trained model tokenizer (vocabulary from wikitext 103)
+   tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
+   # Tokenized input
+   text_1 = "Who was Jim Henson ?"
+   text_2 = "Jim Henson was a puppeteer"
+   tokenized_text_1 = tokenizer.tokenize(text_1)
+   tokenized_text_2 = tokenizer.tokenize(text_2)
+   # Convert token to vocabulary indices
+   indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
+   indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
+   # Convert inputs to PyTorch tensors
+   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+   tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+Let's see how to use ``TransfoXLModel`` to get hidden states
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = TransfoXLModel.from_pretrained('transfo-xl-wt103')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+   with torch.no_grad():
+       # Predict hidden states features for each layer
+       hidden_states_1, mems_1 = model(tokens_tensor_1)
+       # We can re-use the memory cells in a subsequent call to attend a longer context
+       hidden_states_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
+And how to use ``TransfoXLLMHeadModel``
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+   with torch.no_grad():
+       # Predict all tokens
+       predictions_1, mems_1 = model(tokens_tensor_1)
+       # We can re-use the memory cells in a subsequent call to attend a longer context
+       predictions_2, mems_2 = model(tokens_tensor_2, mems=mems_1)
+   # get the predicted last token
+   predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+   predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+   assert predicted_token == 'who'
+OpenAI GPT-2
+^^^^^^^^^^^^
+Here is a quick-start example using ``GPT2Tokenizer``\ , ``GPT2Model`` and ``GPT2LMHeadModel`` class with OpenAI's pre-trained  model. See the `doc section <#doc>`_ below for all the details on these classes.
+First let's prepare a tokenized input with ``GPT2Tokenizer``
+.. code-block:: python
+   import torch
+   from pytorch_pretrained_bert import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel
+   # OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
+   import logging
+   logging.basicConfig(level=logging.INFO)
+   # Load pre-trained model tokenizer (vocabulary)
+   tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+   # Encode some inputs
+   text_1 = "Who was Jim Henson ?"
+   text_2 = "Jim Henson was a puppeteer"
+   indexed_tokens_1 = tokenizer.encode(text_1)
+   indexed_tokens_2 = tokenizer.encode(text_2)
+   # Convert inputs to PyTorch tensors
+   tokens_tensor_1 = torch.tensor([indexed_tokens_1])
+   tokens_tensor_2 = torch.tensor([indexed_tokens_2])
+Let's see how to use ``GPT2Model`` to get hidden states
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = GPT2Model.from_pretrained('gpt2')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       hidden_states_1, past = model(tokens_tensor_1)
+       # past can be used to reuse precomputed hidden state in a subsequent predictions
+       # (see beam-search examples in the run_gpt2.py example).
+       hidden_states_2, past = model(tokens_tensor_2, past=past)
+And how to use ``GPT2LMHeadModel``
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = GPT2LMHeadModel.from_pretrained('gpt2')
+   model.eval()
+   # If you have a GPU, put everything on cuda
+   tokens_tensor_1 = tokens_tensor_1.to('cuda')
+   tokens_tensor_2 = tokens_tensor_2.to('cuda')
+   model.to('cuda')
+   # Predict all tokens
+   with torch.no_grad():
+       predictions_1, past = model(tokens_tensor_1)
+       # past can be used to reuse precomputed hidden state in a subsequent predictions
+       # (see beam-search examples in the run_gpt2.py example).
+       predictions_2, past = model(tokens_tensor_2, past=past)
+   # get the predicted last token
+   predicted_index = torch.argmax(predictions_2[0, -1, :]).item()
+   predicted_token = tokenizer.decode([predicted_index])
+And how to use ``GPT2DoubleHeadsModel``
+.. code-block:: python
+   # Load pre-trained model (weights)
+   model = GPT2DoubleHeadsModel.from_pretrained('gpt2')
+   model.eval()
+   #  Prepare tokenized input
+   text1 = "Who was Jim Henson ? Jim Henson was a puppeteer"
+   text2 = "Who was Jim Henson ? Jim Henson was a mysterious young man"
+   tokenized_text1 = tokenizer.tokenize(text1)
+   tokenized_text2 = tokenizer.tokenize(text2)
+   indexed_tokens1 = tokenizer.convert_tokens_to_ids(tokenized_text1)
+   indexed_tokens2 = tokenizer.convert_tokens_to_ids(tokenized_text2)
+   tokens_tensor = torch.tensor([[indexed_tokens1, indexed_tokens2]])
+   mc_token_ids = torch.LongTensor([[len(tokenized_text1)-1, len(tokenized_text2)-1]])
+   # Predict hidden states features for each layer
+   with torch.no_grad():
+       lm_logits, multiple_choice_logits, past = model(tokens_tensor, mc_token_ids)
--- a/pytorch_pretrained_bert/modeling_bert.py
+++ b/pytorch_pretrained_bert/modeling_bert.py
@@ -565,53 +565,25 @@ class BertPreTrainedModel(PreTrainedModel):
 class BertModel(BertPreTrainedModel):
-    """BERT model ("Bidirectional Embedding Representations from a Transformer").
+    r"""BERT model ("Bidirectional Embedding Representations from a Transformer").
-    Params:
+    :class:`~pytorch_pretrained_bert.BertModel` is the basic BERT Transformer model with a layer of summed token, \
-        `config`: a BertConfig class instance with the configuration to build a new model
+    position and sequence embeddings followed by a series of identical self-attention blocks (12 for BERT-base, 24 \
-        `output_attentions`: If True, also output attentions weights computed by the model at each layer. Default: False
+    for BERT-large). The model is instantiated with the following parameters.
-        `output_hidden_states`: If True, also output hidden states computed by the model at each layer. Default: False
-    Inputs:
+    Arguments:
-        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
+        config: a BertConfig class instance with the configuration to build a new model
-            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
+        output_attentions: If True, also output attentions weights computed by the model at each layer. Default: False
-            `run_bert_extract_features.py`, `run_bert_classifier.py` and `run_bert_squad.py`)
+        output_hidden_states: If True, also output hidden states computed by the model at each layer. Default: Fals
-        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
-            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
-            a `sentence B` token (see BERT paper for more details).
-        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
-            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
-            input sequence length in the current batch. It's the mask that we typically use for attention when
-            a batch has varying length sentences.
-        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
-        `head_mask`: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 and 1.
-            It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 => head is not masked.
-    Outputs: Tuple of (encoded_layers, pooled_output)
+    Example::
-        `encoded_layers`: controled by `output_all_encoded_layers` argument:
-            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
-                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
-                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
-            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
-                to the last attention block of shape [batch_size, sequence_length, hidden_size],
-        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
-            classifier pretrained on top of the hidden state associated to the first character of the
-            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
-    Example usage:
+        config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-    ```python
+            num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-    # Already been converted into WordPiece token ids
-    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
-    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
-    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
-    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
+        model = modeling.BertModel(config=config)
-        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
-    model = modeling.BertModel(config=config)
-    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
-    ```
    """
    def __init__(self, config):
        super(BertModel, self).__init__(config)
@@ -631,6 +603,58 @@ class BertModel(BertPreTrainedModel):
            self.encoder.layer[layer].attention.prune_heads(heads)
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, head_mask=None):
+        """
+        Performs a model forward pass. Can be called by calling the class directly, once it has been instantiated.
+        Arguments:
+            input_ids: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the \
+                vocabulary(see the tokens pre-processing logic in the scripts `run_bert_extract_features.py`, \
+                `run_bert_classifier.py` and `run_bert_squad.py`)
+            token_type_ids: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token \
+                types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to \
+                a `sentence B` token (see BERT paper for more details).
+            attention_mask: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices \
+                selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max \
+                input sequence length in the current batch. It's the mask that we typically use for attention when \
+                a batch has varying length sentences.
+            output_all_encoded_layers: boolean which controls the content of the `encoded_layers` output as described \
+            below. Default: `True`.
+            head_mask: an optional torch.Tensor of shape [num_heads] or [num_layers, num_heads] with indices between 0 \
+            and 1. It's a mask to be used to nullify some heads of the transformer. 1.0 => head is fully masked, 0.0 \
+            => head is not masked.
+        Returns:
+            A tuple composed of (encoded_layers, pooled_output). Encoded layers are controlled by the \
+            ``output_all_encoded_layers`` argument.
+            If ``output_all_encoded_layers`` is set to True, outputs a list of the full sequences of \
+            encoded-hidden-states at the end of each attention \
+            block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each encoded-hidden-state is a\
+            torch.FloatTensor of size [batch_size, sequence_length, hidden_size].
+            If set to False, outputs only the full sequence of hidden-states corresponding \
+            to the last attention block of shape [batch_size, sequence_length, hidden_size].
+            ``pooled_output`` is a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a \
+            classifier pretrained on top of the hidden state associated to the first character of the \
+            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
+        Example::
+            # Already been converted into WordPiece token ids
+            input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
+            input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
+            token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
+            all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
+            # or
+            all_encoder_layers, pooled_output = model.forward(input_ids, token_type_ids, input_mask)
+        """
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None: