Models doc (#7345)

* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Models doc (#7345)
* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
3323146e · Sylvain Gugger · GitHub · 58405a52 · 3323146e · 3323146e
Unverified Commit 3323146e authored Sep 23, 2020 by Sylvain Gugger Committed by GitHub Sep 23, 2020
20 changed files
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
 Tokenizer summary
-----------------
+-----------------------------------------------------------------------------------------------------------------------
 In this page, we will have a closer look at tokenization. As we saw in
 :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
 are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
 specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
 :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
 :ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
 Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
 algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
 using :ref:`WordPiece <wordpiece>`.
 Introduction to tokenization
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
 instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
 this text is just to split it by spaces, which would give:
-::
+.. code-block::
    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
 This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
 will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
 into account. This would give:
-::
+.. code-block::
    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
 which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
 it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
 part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
 into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
 perform properly if you don't use the exact same rules as the persons who pretrained it.
 `spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
 rule-based tokenizers. On the text above, they'd output something like:
-::
+.. code-block::
    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
 Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
 sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
 you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
 :doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
 size of 267,735!
 A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
 TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
 transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
 language.
 So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
 While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
 as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
 all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
 Subword tokenization
-^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
 should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
 decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
 form (almost) arbitrarily long complex words by stringing together some subwords.
 This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
 subwords. This also enables the model to process words it has never seen before, by decomposing them into
 subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
 this:
 .. code-block::
    >>> from transformers import BertTokenizer
    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    >>> tokenizer.tokenize("I have a new GPU!")
    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
 Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
 vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
 means that the rest of the token should be attached to the previous one, without space (for when we need to decode
 predictions and reverse the tokenization).
 Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
 .. code-block::
    >>> from transformers import XLNetTokenizer
    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
 We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
 Transformers has been split into "Transform" and "ers".
 Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
 training which is usually done on the corpus the corresponding model will be trained on.
 .. _byte-pair-encoding:
 Byte-Pair Encoding
-~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
 splitting the training data into words, which can be a simple space tokenization
 (:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
 (:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
 :doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
 It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
 vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
 Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
 word):
-::
+.. code-block::
    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
 Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
-::
+.. code-block::
    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
 We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
 times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
 `10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
 then it adds 'ug' to the vocabulary. Our corpus then becomes
-::
+.. code-block::
    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
 and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
 and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
 to the vocabulary.
 At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
 represented as
-::
+.. code-block::
    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
 If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
 were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
 ``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
 base corpus uses all of them), but to special characters like emojis.
 As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
 to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
 and chose to stop the training of the tokenizer at 40,000 merges.
 Byte-level BPE
-^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
 all unicode characters, the
 `GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
 introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
 additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
 token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
 .. _wordpiece:
 WordPiece
-=========
+=======================================================================================================================
 WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
 :doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
 `this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
 on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
 progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
 frequent but the one that will maximize the likelihood on the corpus once merged.
 What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
 having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
 subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
 sure it's `worth it`.
 .. _unigram:
 Unigram
-=======
+=======================================================================================================================
 Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
 Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
 from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
 progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
 with :ref:`SentencePiece <sentencepiece>`.
 More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
 for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
 sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
 all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
 reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
 BPE or WordPiece).
 Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
 tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
 vocabulary
-::
+.. code-block::
    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
 we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
 one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
 training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
 tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
 of the tokenization according to their probabilities).
 Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
 words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
 tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
 .. math::
    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
 .. _sentencepiece:
 SentencePiece
-=============
+=======================================================================================================================
 All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
 all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
 pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
 SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
 includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
 That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
 the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
 all of them together and replace '▁' with space.
 All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
 :doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
 Training and fine-tuning
-========================
+=======================================================================================================================
 Model classes in 🤗 Transformers are designed to be compatible with native
 PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
@@ -24,7 +24,7 @@ Sections:
 .. _pytorch:
 Fine-tuning in native PyTorch
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Model classes in 🤗 Transformers that don't begin with ``TF`` are
 `PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
@@ -141,7 +141,7 @@ with features like mixed precision and easy tensorboard logging.
 Freezing the encoder
--------------------
+-----------------------------------------------------------------------------------------------------------------------
 In some cases, you might be interested in keeping the weights of the
 pre-trained encoder frozen and optimizing only the weights of the head
@@ -158,7 +158,7 @@ submodule on any task-specific model in the library:
 .. _tensorflow:
 Fine-tuning in native TensorFlow 2
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
 TensorFlow models can be instantiated with
@@ -210,7 +210,7 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
 .. _trainer:
 Trainer
-^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 We also provide a simple but feature-complete training and evaluation
 interface through :func:`~transformers.Trainer` and
@@ -303,7 +303,7 @@ launching tensorboard in your specified ``logging_dir`` directory.
 .. _additional-resources:
 Additional resources
-^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
  which uses ``Trainer`` for IMDb sentiment classification.

--- a/src/transformers/configuration_albert.py
+++ b/src/transformers/configuration_albert.py
@@ -32,54 +32,55 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class AlbertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel` or a
-    It is used to instantiate an ALBERT model according to the specified arguments, defining the model
+    :class:`~transformers.TFAlbertModel`. It is used to instantiate an ALBERT model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
+    configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30000):
+        vocab_size (:obj:`int`, `optional`, defaults to 30000):
-            Vocabulary size of the ALBERT model. Defines the different tokens that
+            Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.AlbertModel` or
-        embedding_size (:obj:`int`, optional, defaults to 128):
+            :class:`~transformers.TFAlbertModel`.
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
            Dimensionality of vocabulary embeddings.
-        hidden_size (:obj:`int`, optional, defaults to 4096):
+        hidden_size (:obj:`int`, `optional`, defaults to 4096):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_hidden_groups (:obj:`int`, optional, defaults to 1):
+        num_hidden_groups (:obj:`int`, `optional`, defaults to 1):
            Number of groups for the hidden layers, parameters in the same group are shared.
-        num_attention_heads (:obj:`int`, optional, defaults to 64):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 64):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 16384):
+        intermediate_size (:obj:`int`, `optional`, defaults to 16384):
-            The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            The dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
-        inner_group_num (:obj:`int`, optional, defaults to 1):
+        inner_group_num (:obj:`int`, `optional`, defaults to 1):
            The number of inner repetition of attention and ffn.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu_new"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0):
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something
            large (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            :class:`~transformers.TFAlbertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        classifier_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for attached classifiers.
-    Example::
+    Examples::
        >>> from transformers import AlbertConfig, AlbertModel
        >>> # Initializing an ALBERT-xxlarge style configuration

--- a/src/transformers/configuration_bert.py
+++ b/src/transformers/configuration_bert.py
@@ -50,10 +50,10 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class BertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
-    It is used to instantiate an BERT model according to the specified arguments, defining the model
+    :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
+    configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -61,37 +61,39 @@ class BertConfig(PretrainedConfig):
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or
-        hidden_size (:obj:`int`, optional, defaults to 768):
+            :class:`~transformers.TFBertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            :class:`~transformers.TFBertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If True, use gradient checkpointing to save memory at the expense of slower backward pass.
-    Example::
+    Examples::
        >>> from transformers import BertModel, BertConfig

--- a/src/transformers/configuration_bert_generation.py
+++ b/src/transformers/configuration_bert_generation.py
@@ -19,18 +19,18 @@ from .configuration_utils import PretrainedConfig
 class BertGenerationConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertGenerationPreTrainedModel`.
+    This is the configuration class to store the configuration of a
-    It is used to instantiate a BertGenerationConfig model according to the specified arguments, defining the model architecture.
+    :class:`~transformers.BertGenerationPreTrainedModel`. It is used to instantiate a BertGeneration model according to
+    the specified arguments, defining the model architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 50358):
-            Vocabulary size of the BertGeneration model. Defines the different tokens that
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertGeneration`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BertGeneration`.
        hidden_size (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
@@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig):
        num_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
@@ -56,7 +56,7 @@ class BertGenerationConfig(PretrainedConfig):
        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.
-    Example::
+    Examples::
        >>> from transformers import BertGenerationConfig, BertGenerationEncoder

--- a/src/transformers/configuration_ctrl.py
+++ b/src/transformers/configuration_ctrl.py
@@ -25,44 +25,45 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h
 class CTRLConfig(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel` or a
-    It is used to instantiate an CTRL model according to the specified arguments, defining the model
+    :class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
+    configuration to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 246534):
+        vocab_size (:obj:`int`, `optional`, defaults to 246534):
-            Vocabulary size of the CTRL model. Defines the different tokens that
+            Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.CTRLModel` or
-        n_positions (:obj:`int`, optional, defaults to 256):
+            :class:`~transformers.TFCTRLModel`.
+        n_positions (:obj:`int`, `optional`, defaults to 256):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 256):
+        n_ctx (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 1280):
+        n_embd (:obj:`int`, `optional`, defaults to 1280):
            Dimensionality of the embeddings and hidden states.
-        dff (:obj:`int`, optional, defaults to 8192):
+        dff (:obj:`int`, `optional`, defaults to 8192):
-            Dimensionality of the inner dimension of the FFN.
+            Dimensionality of the inner dimension of the feed forward networks (FFN).
-        n_layer (:obj:`int`, optional, defaults to 48):
+        n_layer (:obj:`int`, `optional`, defaults to 48):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-6):
            The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-    Example::
+    Examples::
        >>> from transformers import CTRLModel, CTRLConfig

--- a/src/transformers/configuration_distilbert.py
+++ b/src/transformers/configuration_distilbert.py
@@ -33,50 +33,51 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class DistilBertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
-    It is used to instantiate a DistilBERT model according to the specified arguments, defining the model
+    :class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
+    configuration to that of the DistilBERT
+    `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the DistilBERT model. Defines the different tokens that
+            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+            :class:`~transformers.TFDistilBertModel`.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):
+        sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
            Whether to use sinusoidal positional embeddings.
-        n_layers (:obj:`int`, optional, defaults to 6):
+        n_layers (:obj:`int`, `optional`, defaults to 6):
            Number of hidden layers in the Transformer encoder.
-        n_heads (:obj:`int`, optional, defaults to 12):
+        n_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        dim (:obj:`int`, optional, defaults to 768):
+        dim (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        hidden_dim (:obj:`int`, optional, defaults to 3072):
+        hidden_dim (:obj:`int`, `optional`, defaults to 3072):
-            The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        activation (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        qa_dropout (:obj:`float`, optional, defaults to 0.1):
+        qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilities used in the question answering model
            :class:`~transformers.DistilBertForQuestionAnswering`.
-        seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):
+        seq_classif_dropout (:obj:`float`, `optional`, defaults to 0.2):
            The dropout probabilities used in the sequence classification and the multiple choice model
            :class:`~transformers.DistilBertForSequenceClassification`.
-    Example::
+    Examples::
        >>> from transformers import DistilBertModel, DistilBertConfig

--- a/src/transformers/configuration_dpr.py
+++ b/src/transformers/configuration_dpr.py
@@ -32,8 +32,12 @@ class DPRConfig(PretrainedConfig):
    :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a
    `DPRModel`.
-    This is the configuration class to store the configuration of a `DPRContextEncoder`, `DPRQuestionEncoder`, or a `DPRReader`.
+    This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
-    It is used to instantiate the components of the DPR model.
+    :class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
+    components of the DPR model.
+    This class is a subclass of :class:`~transformers.BertConfig`. Please check the
+    superclass for the documentation of all kwargs.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):

--- a/src/transformers/configuration_electra.py
+++ b/src/transformers/configuration_electra.py
@@ -33,11 +33,11 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class ElectraConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
-    It is used to instantiate an ELECTRA model according to the specified arguments, defining the model
+    :class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__
+    configuration to that of the ELECTRA
-    architecture.
+    `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -45,59 +45,61 @@ class ElectraConfig(PretrainedConfig):
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the ELECTRA model. Defines the different tokens that
+            Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ElectraModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ElectraModel` or
-        embedding_size (:obj:`int`, optional, defaults to 128):
+            :class:`~transformers.TFElectraModel`.
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
            Dimensionality of the encoder layers and the pooler layer.
-        hidden_size (:obj:`int`, optional, defaults to 256):
+        hidden_size (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 4):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 1024):
+        intermediate_size (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.ElectraModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            :class:`~transformers.TFElectraModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        summary_type (:obj:`string`, optional, defaults to "first"):
+        summary_type (:obj:`str`, `optional`, defaults to :obj:`"first"`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Is one of the following options:
+            Has to be one of the following options:
-                - 'last' => take the last token hidden state (like XLNet)
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
-                - 'first' => take the first token hidden state (like Bert)
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
-                - 'mean' => take the mean of all tokens hidden states
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
-                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
-                - 'attn' => Not implemented now, use multi-head attention
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Add a projection after the vector extraction
+            Whether or not to add a projection after the vector extraction.
-        summary_activation (:obj:`string` or :obj:`None`, optional):
+        summary_activation (:obj:`str`, `optional`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.ElectraForMultipleChoice`.
-            'gelu' => add a gelu activation to the output, Other => no activation.
+            Pass :obj:`"gelu"` for a gelu activation to the output, any other value will result in no activation.
-        summary_last_dropout (:obj:`float`, optional, defaults to 0.0):
+        summary_last_dropout (:obj:`float`, `optional`, defaults to 0.0):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Add a dropout after the projection and activation
+            The dropout ratio to be used after the projection and activation.
-    Example::
+    Examples::
        >>> from transformers import ElectraModel, ElectraConfig

--- a/src/transformers/configuration_encoder_decoder.py
+++ b/src/transformers/configuration_encoder_decoder.py
@@ -25,22 +25,24 @@ logger = logging.get_logger(__name__)
 class EncoderDecoderConfig(PretrainedConfig):
    r"""
-    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.
+    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a
+    :class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
+    specified arguments, defining the encoder and decoder configs.
-    It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig`
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
-    and can be used to control the model outputs.
+    for more information.
-    See the documentation for :class:`~transformers.PretrainedConfig` for more information.
    Args:
        kwargs (`optional`):
-            Remaining dictionary of keyword arguments. Notably:
+            Dictionary of keyword arguments. Notably:
-                encoder (:class:`PretrainedConfig`, optional, defaults to `None`):
-                    An instance of a configuration object that defines the encoder config.
-                decoder (:class:`PretrainedConfig`, optional, defaults to `None`):
-                    An instance of a configuration object that defines the decoder config.
-    Example::
+                - **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                  object that defines the encoder config.
+                - **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                  object that defines the decoder config.
+    Examples::
        >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel

--- a/src/transformers/configuration_flaubert.py
+++ b/src/transformers/configuration_flaubert.py
@@ -30,11 +30,9 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class FlaubertConfig(XLMConfig):
    """
-    Configuration class to store the configuration of a `FlaubertModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.FlaubertModel` or a
-    This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`.
+    :class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
-    It is used to instantiate an XLM model according to the specified arguments, defining the model
+    arguments, defining the model architecture.
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -47,95 +45,95 @@ class FlaubertConfig(XLMConfig):
        layerdrop (:obj:`float`, `optional`, defaults to 0.0):
            Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand
            with Structured Dropout. ICLR 2020)
-        vocab_size (:obj:`int`, optional, defaults to 30145):
+        vocab_size (:obj:`int`, `optional`, defaults to 30145):
-            Vocabulary size of the Flaubert model. Defines the different tokens that
+            Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FlaubertModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
-        emb_dim (:obj:`int`, optional, defaults to 2048):
+            :class:`~transformers.TFFlaubertModel`.
+        emb_dim (:obj:`int`, `optional`, defaults to 2048):
            Dimensionality of the encoder layers and the pooler layer.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected
            layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for the attention mechanism
-        gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):
+        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            The non-linear activation function (function or string) in the
+            Whether or not to use a `gelu` actibation instead of `relu`.
-            encoder and pooler. If set to `True`, "gelu" will be used instead of "relu".
+        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
-        sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):
+            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
-            Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.
+        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
-        causal (:obj:`boolean`, optional, defaults to :obj:`False`):
+            Whether or not the model shoul behave in a causal manner.
-            Set this to `True` for the model to behave in a causal manner.
            Causal models use a triangular attention mask in order to only attend to the left-side context instead
            if a bidirectional context.
-        asm (:obj:`boolean`, optional, defaults to :obj:`False`):
+        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction
+            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
            layer.
-        n_langs (:obj:`int`, optional, defaults to 1):
+        n_langs (:obj:`int`, `optional`, defaults to 1):
            The number of languages the model handles. Set to 1 for monolingual models.
-        use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)
+        use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
            Whether to use language embeddings. Some models use additional language embeddings, see
            `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
            for information on how to use them.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might
            ever be used with. Typically set this to something large just in case
            (e.g., 512 or 1024 or 2048).
-        embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):
+        embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
            The standard deviation of the truncated_normal_initializer for
            initializing the embedding matrices.
-        init_std (:obj:`int`, optional, defaults to 50257):
+        init_std (:obj:`int`, `optional`, defaults to 50257):
            The standard deviation of the truncated_normal_initializer for
            initializing all weight matrices except the embedding matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        bos_index (:obj:`int`, optional, defaults to 0):
+        bos_index (:obj:`int`, `optional`, defaults to 0):
            The index of the beginning of sentence token in the vocabulary.
-        eos_index (:obj:`int`, optional, defaults to 1):
+        eos_index (:obj:`int`, `optional`, defaults to 1):
            The index of the end of sentence token in the vocabulary.
-        pad_index (:obj:`int`, optional, defaults to 2):
+        pad_index (:obj:`int`, `optional`, defaults to 2):
            The index of the padding token in the vocabulary.
-        unk_index (:obj:`int`, optional, defaults to 3):
+        unk_index (:obj:`int`, `optional`, defaults to 3):
            The index of the unknown token in the vocabulary.
-        mask_index (:obj:`int`, optional, defaults to 5):
+        mask_index (:obj:`int`, `optional`, defaults to 5):
            The index of the masking token in the vocabulary.
-        is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):
+        is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
+            Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
-        summary_type (:obj:`string`, optional, defaults to "first"):
+        summary_type (:obj:`string`, `optional`, defaults to "first"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.XLMForSequenceClassification`.
-            Is one of the following options:
+            Has to be one of the following options:
-            - 'last' => take the last token hidden state (like XLNet)
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
-            - 'first' => take the first token hidden state (like Bert)
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
-            - 'mean' => take the mean of all tokens hidden states
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
-            - 'attn' => Not implemented now, use multi-head attention
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a projection after the vector extraction
+            Whether or not to add a projection after the vector extraction.
-        summary_activation (:obj:`string` or :obj:`None`, optional):
+        summary_activation (:obj:`str`, `optional`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
-            :class:`~transformers.XLMForSequenceClassification`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Used in the sequence classification and multiple choice models.
-            :class:`~transformers.XLMForSequenceClassification`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Used in the sequence classification and multiple choice models.
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a dropout before the projection and activation
+            The dropout ratio to be used after the projection and activation.
-        start_n_top (:obj:`int`, optional, defaults to 5):
+        start_n_top (:obj:`int`, `optional`, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
+            Used in the SQuAD evaluation script.
-        end_n_top (:obj:`int`, optional, defaults to 5):
+        end_n_top (:obj:`int`, `optional`, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
+            Used in the SQuAD evaluation script.
-        mask_token_id (:obj:`int`, optional, defaults to 0):
+        mask_token_id (:obj:`int`, `optional`, defaults to 0):
            Model agnostic parameter to identify masked tokens when generating text in an MLM context.
-        lang_id (:obj:`int`, optional, defaults to 1):
+        lang_id (:obj:`int`, `optional`, defaults to 1):
            The ID of the language used by the model. This parameter is used when generating
            text in a given language.
    """

--- a/src/transformers/configuration_fsmt.py
+++ b/src/transformers/configuration_fsmt.py
@@ -18,7 +18,6 @@
 import copy
 from .configuration_utils import PretrainedConfig
-from .file_utils import add_start_docstrings_to_callable
 from .utils import logging
@@ -27,33 +26,54 @@ logger = logging.get_logger(__name__)
 FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
-FSMT_CONFIG_ARGS_DOC = r"""
+class DecoderConfig(PretrainedConfig):
+    r"""
+    Configuration class for FSMT's decoder specific things.
+    note: this is a private helper class
+    """
+    model_type = "fsmt_decoder"
+    def __init__(self, vocab_size=0, bos_token_id=0):
+        super().__init__()
+        self.vocab_size = vocab_size
+        self.bos_token_id = bos_token_id
+class FSMTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
+    instantiate a FSMT model according to the specified arguments, defining the model architecture.
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
    Args:
        langs (:obj:`List[str]`):
-            source language, target_language (e.g. ['en', 'ru'])
+            A list with source language and target_language (e.g., ['en', 'ru']).
        src_vocab_size (:obj:`int`):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward
+            Vocabulary size of the encoder. Defines the number of different tokens that can be represented by the
-            method in the encoder.
+            :obj:`inputs_ids` passed to the forward method in the encoder.
        tgt_vocab_size (:obj:`int`):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward
+            Vocabulary size of the decoder. Defines the number of different tokens that can be represented by the
-            method in the decoder.
+            :obj:`inputs_ids` passed to the forward method in the decoder.
        d_model (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the layers and the pooler layer.
        encoder_layers (:obj:`int`, `optional`, defaults to 12):
-            Number of encoder layers, 16 for pegasus, 6 for bart-base and marian
+            Number of encoder layers.
        decoder_layers (:obj:`int`, `optional`, defaults to 12):
-            Number of decoder layers, 16 for pegasus, 6 for bart-base and marian
+            Number of decoder layers.
        encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer decoder.
        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
-        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to "relu"):
+        activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@@ -74,7 +94,7 @@ FSMT_CONFIG_ARGS_DOC = r"""
        eos_token_id (:obj:`int`, `optional`, defaults to 2)
            End of stream token id.
        decoder_start_token_id (:obj:`int`, `optional`):
-            This model starts decoding with `eos_token_id`
+            This model starts decoding with :obj:`eos_token_id`
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
            Google "layerdrop arxiv", as its not explainable in one line.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
@@ -92,26 +112,14 @@ FSMT_CONFIG_ARGS_DOC = r"""
        early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
            Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop
            the beam search when at least ``num_beams`` sentences are finished per batch or not.
-"""
+        Examples::
-class DecoderConfig(PretrainedConfig):
+            >>> from transformers import FSMTConfig, FSMTModel
-    r"""
-    Configuration class for FSMT's decoder specific things.
-    note: this is a private helper class
-    """
-    model_type = "fsmt_decoder"
-    def __init__(self, vocab_size=0, bos_token_id=0):
-        super().__init__()
-        self.vocab_size = vocab_size
-        self.bos_token_id = bos_token_id
+            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
+            >>> model = FSMTModel(config)
-@add_start_docstrings_to_callable(FSMT_CONFIG_ARGS_DOC)
-class FSMTConfig(PretrainedConfig):
-    r"""
-    Configuration class for FSMT.
    """
    model_type = "fsmt"
@@ -149,17 +157,6 @@ class FSMTConfig(PretrainedConfig):
        early_stopping=False,
        **common_kwargs
    ):
-        r"""
-        :class:`~transformers.FSMTConfig` is the configuration class for `FSMTModel`.
-        Examples::
-            >>> from transformers import FSMTConfig, FSMTModel
-            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
-            >>> model = FSMTModel(config)
-        """
        if "hidden_size" in common_kwargs:
            raise ValueError("hidden size is called d_model")
        super().__init__(

--- a/src/transformers/configuration_funnel.py
+++ b/src/transformers/configuration_funnel.py
@@ -36,20 +36,21 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class FunnelConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel` or a
-    It is used to instantiate an Funnel Transformer model according to the specified arguments, defining the model
+    :class:`~transformers.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the Funnel Transformer `funnel-transformer/small <https://huggingface.co/funnel-transformer/small>`__ architecture.
+    configuration to that of the Funnel Transformer `funnel-transformer/small
+    <https://huggingface.co/funnel-transformer/small>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the Funnel transformer. Defines the different tokens that
+            Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FunnelModel`.
+            by the :obj:`inputs_ids` passed when calling :class:`~transformers.FunnelModel` or
+            :class:`~transformers.TFFunnelModel`.
        block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
            The sizes of the blocks used in the model.
        block_repeats (:obj:`List[int]`, `optional`):
@@ -77,7 +78,8 @@ class FunnelConfig(PretrainedConfig):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 3):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.FunnelModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
+            :class:`~transformers.TFFunnelModel`.
        initializer_range (:obj:`float`, `optional`, defaults to 0.1):
            The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
            layers.

--- a/src/transformers/configuration_gpt2.py
+++ b/src/transformers/configuration_gpt2.py
@@ -32,10 +32,10 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class GPT2Config(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model`.
+    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
-    It is used to instantiate an GPT-2 model according to the specified arguments, defining the model
+    :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
+    configuration to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -43,60 +43,66 @@ class GPT2Config(PretrainedConfig):
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 50257):
+        vocab_size (:obj:`int`, `optional`, defaults to 50257):
-            Vocabulary size of the GPT-2 model. Defines the different tokens that
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.GPT2Model`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
-        n_positions (:obj:`int`, optional, defaults to 1024):
+            :class:`~transformers.TFGPT2Model`.
+        n_positions (:obj:`int`, `optional`, defaults to 1024):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 1024):
+        n_ctx (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 768):
+        n_embd (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 12):
+        n_head (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        n_inner (:obj:`int`, optional, defaults to None):
+        n_inner (:obj:`int`, `optional`, defaults to None):
            Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
-        activation_function (:obj:`str`, optional, defaults to 'gelu'):
+        activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu"`):
-            Activation function selected in the list ["relu", "swish", "gelu", "tanh", "gelu_new"].
+            Activation function, to be selected in the list :obj:`["relu", "swish", "gelu", "tanh", "gelu_new"]`.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
            The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        summary_type (:obj:`string`, optional, defaults to "cls_index"):
+        summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            Has to be one of the following options:
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
            Argument used when doing sequence summary. Used in for the multiple choice head in
            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Is one of the following options:
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
-            - 'last' => take the last token hidden state (like XLNet)
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            - 'first' => take the first token hidden state (like Bert)
+            Argument used when doing sequence summary, used in the models
-            - 'mean' => take the mean of all tokens hidden states
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary, used in the models
-            :class:`~transformers.GPT2DoubleHeadsModel`.
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
+            The dropout ratio to be used after the projection and activation.
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Add a dropout before the projection and activation
    Example::

--- a/src/transformers/configuration_longformer.py
+++ b/src/transformers/configuration_longformer.py
@@ -33,6 +33,10 @@ LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class LongformerConfig(RobertaConfig):
    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel` or a
+    :class:`~transformers.TFLongformerModel`. It is used to instantiate a Longformer model according to the specified
+    arguments, defining the model architecture.
    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`.
    It is used to instantiate an Longformer model according to the specified arguments, defining the model
    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
@@ -42,8 +46,8 @@ class LongformerConfig(RobertaConfig):
    It reuses the same defaults. Please check the parent class for more information.
    Args:
-        attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):
+        attention_window (:obj:`int` or :obj:`List[int]`, `optional`, defaults to 512):
-            Size of an attention window around each token. If :obj:`int`, use the same size for all layers.
+            Size of an attention window around each token. If an :obj:`int`, use the same size for all layers.
            To specify a different window size for each layer, use a :obj:`List[int]` where
            ``len(attention_window) == num_hidden_layers``.

--- a/src/transformers/configuration_lxmert.py
+++ b/src/transformers/configuration_lxmert.py
@@ -29,83 +29,91 @@ LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class LxmertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.LxmertModel` or a
-    It is used to instantiate an Lxmert model according to the specified arguments, defining the model
+    :class:`~transformers.TFLxmertModel`. It is used to instantiate a LXMERT model according to the specified
-    architecture.
+    arguments, defining the model architecture.
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
+            Vocabulary size of the LXMERT model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.LxmertModel` or
-        hidden_size (:obj:`int`, optional, defaults to 768):
+            :class:`~transformers.TFLxmertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        r_layers (:obj:`int`, optional, defaults to 5):
+        r_layers (:obj:`int`, `optional`, defaults to 5):
            Number of hidden layers in the Transformer visual encoder.
-        l_layers (:obj:`int`, optional, defaults to 9):
+        l_layers (:obj:`int`, `optional`, defaults to 9):
            Number of hidden layers in the Transformer language encoder.
-        x_layers (:obj:`int`, optional, defaults to 5):
+        x_layers (:obj:`int`, `optional`, defaults to 5):
            Number of hidden layers in the Transformer cross modality encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 5):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 5):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        visual_feat_dim (:obj:`int`, optional, defaults to 2048):
+        visual_feat_dim (:obj:`int`, `optional`, defaults to 2048):
            This represents the last dimension of the pooled-object features used as input for the model,
            representing the size of each object feature itself.
-        visual_pos_dim (:obj:`int`, optional, defaults to 4):
+        visual_pos_dim (:obj:`int`, `optional`, defaults to 4):
            This represents the number of spacial features that are mixed into the visual features.
            The default is set to 4 because most commonly this will represent the location of a bounding box.
-            i.e. (x, y, width, height)
+            i.e., (x, y, width, height)
-        visual_loss_normalizer (:obj:`float`, optional, defaults to 1/15):
+        visual_loss_normalizer (:obj:`float`, `optional`, defaults to 1/15):
            This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
            one decided to train with multiple vision-based loss objectives.
-        num_qa_labels (:obj:`int`, optional, defaults to 9500):
+        num_qa_labels (:obj:`int`, `optional`, defaults to 9500):
-            This represents the total number of different question answering (QA) labels there are. If using more than one dataset with QA,
+            This represents the total number of different question answering (QA) labels there are. If using more than
-            the user will need to account for the total number of labels that all of the datasets have in total.
+            one dataset with QA, the user will need to account for the total number of labels that all of the datasets
-        num_object_labels (:obj:`int`, optional, defaults to 1600):
+            have in total.
-            This represents the total number of semantically unique objects that lxmert will be able to classify a pooled-object feature
+        num_object_labels (:obj:`int`, `optional`, defaults to 1600):
-            as belonging too.
+            This represents the total number of semantically unique objects that lxmert will be able to classify a
-        num_attr_labels (:obj:`int`, optional, defaults to 400):
+            pooled-object feature as belonging too.
-            This represents the total number of semantically unique attributes that lxmert will be able to classify a pooled-object feature
+        num_attr_labels (:obj:`int`, `optional`, defaults to 400):
-            as possessing.
+            This represents the total number of semantically unique attributes that lxmert will be able to classify a
-        task_matched (:obj:`bool`, optional, defaults to :obj:`True`):
+            pooled-object feature as possessing.
-            This task is used for sentence-image matching. If the sentence correctly describes the image the label will be 1.
+        task_matched (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            If the sentence does not correctly describe the image, the label will be 0.
+            This task is used for sentence-image matching. If the sentence correctly describes the image the label
-        task_mask_lm (:obj:`bool`, optional, defaults to :obj:`True`):
+            will be 1. If the sentence does not correctly describe the image, the label will be 0.
-            This task is the defacto masked langauge modeling used in pretraining models such as BERT.
+        task_mask_lm (:obj:`bool`, `optional`, defaults to :obj:`True`):
-        task_obj_predict (:obj:`bool`, optional, defaults to :obj:`True`):
+            Whether or not to add masked language modeling (as used in pretraining models such as BERT) to the loss
-            This task is set to true if the user would like to perform one of the following loss objectives:
+            objective.
-            object predicition, atrribute predicition, feature regression
+        task_obj_predict (:obj:`bool`, `optional`, defaults to :obj:`True`):
-        task_qa (:obj:`bool`, optional, defaults to :obj:`True`):
+            Whether or not to add object predicition, attribute predicition and feature regression to the loss
-            This task specifies whether or not Lxmert will calculate the question-asnwering loss objective
+            objective.
-        visual_obj_loss (:obj:`bool`, optional, defaults to :obj:`True`):
+        task_qa (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the object-prediction loss objective
+            Whether or not to add the question-asnwering loss to the objective
-        visual_attr_loss (:obj:`bool`, optional, defaults to :obj:`True`):
+        visual_obj_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the attribute-prediction loss objective
+            Whether or not to calculate the object-prediction loss objective
-        visual_feat_loss (:obj:`bool`, optional, defaults to :obj:`True`):
+        visual_attr_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the feature-regression loss objective
+            Whether or not to calculate the attribute-prediction loss objective
-        output_attentions (:obj:`bool`, optional, defaults to :obj:`False`):
+        visual_feat_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
-                if True, the vision, langauge, and cross-modality layers will be returned
+            Whether or not to calculate the feature-regression loss objective
-        output_hidden_states (:obj:`bool`, optional, defaults to :obj:`False`):
+        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
-                if True, final cross-modality hidden states for language and vision features will be returned
+            Whether or not the model should return the attentions from the vision, langauge, and cross-modality
+            layers should be returned.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model should return the hidden states from the vision, langauge, and cross-modality
+            layers should be returned.
    """
    model_type = "lxmert"

--- a/src/transformers/configuration_mmbt.py
+++ b/src/transformers/configuration_mmbt.py
@@ -22,15 +22,16 @@ logger = logging.get_logger(__name__)
 class MMBTConfig(object):
-    """Configuration class to store the configuration of a `MMBT Model`.
+    """
+    This is the configuration class to store the configuration of a :class:`~transformers.MMBTModel`. It is used to
+    instantiate a MMBT model according to the specified arguments, defining the model architecture.
    Args:
-        config (:obj:`~transformers.PreTrainedConfig`):
+        config (:class:`~transformers.PreTrainedConfig`):
-            Config of the underlying Transformer models. Its values are
+            Config of the underlying Transformer models. Its values are copied over to use a single config.
-            copied over to use a single config.
+        num_labels (:obj:`int`, `optional`):
-        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
            Size of final Linear layer for classification.
-        modal_hidden_size (:obj:`int`, optional, defautls to 2048):
+        modal_hidden_size (:obj:`int`, `optional`, defautls to 2048):
            Embedding dimension of the non-text modality encoder.
    """

--- a/src/transformers/configuration_mobilebert.py
+++ b/src/transformers/configuration_mobilebert.py
@@ -25,9 +25,9 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class MobileBertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel` or a
-    It is used to instantiate a MobileBERT model according to the specified arguments, defining the model
+    :class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
-    architecture.
+    arguments, defining the model architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -35,54 +35,56 @@ class MobileBertConfig(PretrainedConfig):
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the MobileBERT model. Defines the different tokens that
+            Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.MobileBertModel`.
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.MobileBertModel` or
-        hidden_size (:obj:`int`, optional, defaults to 512):
+            :class:`~transformers.TFMobileBertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 512):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 24):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 4):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 512):
+        intermediate_size (:obj:`int`, `optional`, defaults to 512):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
+        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0):
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.MobileBertModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            or :class:`~transformers.TFMobileBertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        pad_token_id (:obj:`int`, optional, defaults to 0):
+        pad_token_id (:obj:`int`, `optional`, defaults to 0):
            The ID of the token in the word embedding to use as padding.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
            The dimension of the word embedding vectors.
-        trigram_input (:obj:`bool`, optional, defaults to :obj:`True`):
+        trigram_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Use a convolution of trigram as input.
-        use_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
+        use_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether to use bottleneck in BERT.
-        intra_bottleneck_size (:obj:`int`, optional, defaults to 128):
+        intra_bottleneck_size (:obj:`int`, `optional`, defaults to 128):
            Size of bottleneck layer output.
-        use_bottleneck_attention (:obj:`bool`, optional, defaults to :obj:`False`):
+        use_bottleneck_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to use attention inputs from the bottleneck transformation.
-        key_query_shared_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
+        key_query_shared_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether to use the same linear transformation for query&key in the bottleneck.
-        num_feedforward_networks (:obj:`int`, optional, defaults to 4):
+        num_feedforward_networks (:obj:`int`, `optional`, defaults to 4):
            Number of FFNs in a block.
-        normalization_type (:obj:`str`, optional, defaults to "no_norm"):
+        normalization_type (:obj:`str`, `optional`, defaults to :obj:`"no_norm"`):
-            The normalization type in BERT.
+            The normalization type in MobileBERT.
-    Example:
+    Examples:
        >>> from transformers import MobileBertModel, MobileBertConfig

--- a/src/transformers/configuration_openai.py
+++ b/src/transformers/configuration_openai.py
@@ -28,73 +28,79 @@ OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class OpenAIGPTConfig(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel` or a
-    It is used to instantiate an GPT model according to the specified arguments, defining the model
+    :class:`~transformers.TFOpenAIGPTModel`. It is used to instantiate a GPT model according to the specified
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
-    the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
+    configuration to that of the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 40478):
+        vocab_size (:obj:`int`, `optional`, defaults to 40478):
-            Vocabulary size of the GPT model. Defines the different tokens that
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
+            :obj:`inputs_ids` passed when calling :class:`~transformers.OpenAIGPTModel` or
-        n_positions (:obj:`int`, optional, defaults to 512):
+            :class:`~transformers.TFOpenAIGPTModel`.
+        n_positions (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 512):
+        n_ctx (:obj:`int`, `optional`, defaults to 512):
            Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 768):
+        n_embd (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 12):
+        n_head (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        afn (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        afn (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
            The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):
+        predict_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            Whether special tokens should be predicted when the model is has a language modeling head.
+            Whether or not special tokens should be predicted when the model has a language modeling head.
-        summary_type (:obj:`string`, optional, defaults to "cls_index"):
+        summary_type (:obj:`str`, `optional`, defaults to :obj:`"cls_index"`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Argument used when doing sequence summary, used in the models
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Is one of the following options:
+            Has to be one of the following options:
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
-            - 'mean' => take the mean of all tokens hidden states
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
-            - 'attn' => Not implemented now, use multi-head attention
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+            Argument used when doing sequence summary, used in the models
-            Add a projection after the vector extraction
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            Whether or not to add a projection after the vector extraction.
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+        summary_activation (:obj:`str`, `optional`):
-            'tanh' => add a tanh activation to the output, Other => no activation.
+            Argument used when doing sequence summary, used in the models
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
+            Argument used when doing sequence summary, used in the models
-            Argument used when doing sequence summary. Used in for the multiple choice head in
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Add a dropout before the projection and activation
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
-    Example::
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+            The dropout ratio to be used after the projection and activation.
+    Examples::
        >>> from transformers import OpenAIGPTConfig, OpenAIGPTModel

--- a/src/transformers/configuration_reformer.py
+++ b/src/transformers/configuration_reformer.py
@@ -29,96 +29,120 @@ REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
 class ReformerConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`.
+    This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`. It is used to
-    It is used to instantiate an Reformer model according to the specified arguments, defining the model
+    instantiate a Reformer model according to the specified arguments, defining the model architecture.
-    architecture.
    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.
    Args:
-        attention_head_size (:obj:`int`, optional, defaults to 64):
+        attention_head_size (:obj:`int`, `optional`, defaults to 64):
            Dimensionality of the projected key, query and value vectors
-        attn_layers (:obj:`list(str)`, optional, defaults to ["local", "lsh", "local", "lsh", "local", "lsh"]):
+        attn_layers (:obj:`List[str]`, `optional`, defaults to :obj:`["local", "lsh", "local", "lsh", "local", "lsh"]`):
            List of attention layer types in ascending order. It can be chosen between a
-            LSHSelfAttention layer ("lsh") and a LocalSelfAttention layer ("local").
+            LSHSelfAttention layer (:obj:`"lsh"`) and a LocalSelfAttention layer (:obj:`"local"`).
-            For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .
-            For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .
+            For more information on LSHSelfAttention layer, see `LSH Self Attention
-        axial_pos_embds (:obj:`bool`, optional, defaults to :obj:`True`):
+            <reformer.html#lsh-self-attention>`__. For more information on LocalSelfAttention layer, see `Local Self
-            If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__
+            Attention <reformer.html#local-sensitive-hashing-self-attention>`__.
-        axial_norm_std (:obj:`float`, optional, defaluts to 1.0):
+        axial_pos_embds (:obj:`bool`, `optional`, defaults to :obj:`True`):
-            The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.
+            Whether or not to use axial position embeddings. For more information on how axial position embeddings
-        axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):
+            work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
-            The position dims of the axial position encodings.
+        axial_norm_std (:obj:`float`, `optional`, defaults to 1.0):
-            During training the product of the position dims has to equal the sequence length.
+            The standard deviation of the normal_initializer for initializing the weight matrices of the axial
-            For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
+            positional encodings.
-        axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):
+        axial_pos_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 64]`):
-            The embedding dims of the axial position encodings.
+            The position dims of the axial position encodings. During training the product of the position dims has to
-            The sum of the embedding dims has to equal the hidden size.
+            be equal to the sequence length.
-            For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
-        chunk_size_lm_head (:obj:`int`, optional, defaults to 0):
+            For more information on how axial position embeddings work, see `Axial Position Encodings
+            <reformer.html#axial-positional-encodings>`__.
+        axial_pos_embds_dim (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 192]`):
+            The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
+            hidden size.
+            For more information on how axial position embeddings work, see `Axial Position Encodings
+            <reformer.html#axial-positional-encodings>`__.
+        chunk_size_lm_head (:obj:`int`, `optional`, defaults to 0):
            The chunk size of the final language model feed forward head layer.
            A chunk size of 0 means that the feed forward layer is not chunked.
            A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.
-            For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .
-        eos_token_id (:obj:`int`, optional, defaults to 2):
+            For more information on feed forward chunking, see `How does Feed Forward Chunking work?
-            The token id for the <EOS> token.
+            <../glossary.html#feed-forward-chunking>`__.
-        feed_forward_size (:obj:`int`, optional, defaults to 512):
+        eos_token_id (:obj:`int`, `optional`, defaults to 2):
-            Dimensionality of the "feed_forward" (i.e., feed-forward) layer in the residual attention block.
+            The token id for the end-of-sentence token.
-        hash_seed (:obj:`int`, optional, defaults to `None`):
+        feed_forward_size (:obj:`int`, `optional`, defaults to 512):
-            Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.
+            Dimensionality of the feed_forward layer in the residual attention block.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
+        hash_seed (:obj:`int`, `optional`):
-            The non-linear activation function (function or string) in the feed forward layer in the residual attention block.
+            Seed that can be used to make local sensitive hashing in :obj:`LSHSelfAttention` deterministic. This should
-            If string, "gelu", "relu", "swish", "gelu_new" and "gelu_fast" are supported.
+            only be set for testing purposed. For evaluation and training purposes :obj:`hash_seed` should be left as
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):
+            :obj:`None` to ensure fully random rotations in local sensitive hashing scheme.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
+            The non-linear activation function (function or string) in the feed forward layer in the residual attention
+            block.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.05):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        hidden_size (:obj:`int`, optional, defaults to 256):
+        hidden_size (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the output hidden states of the residual attention blocks.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        is_decoder (:obj:`bool`, optional, defaults to :obj:`False`):
+        is_decoder (:obj:`bool`, `optional`, defaults to :obj:`False`):
-            If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.
+            Whether ot not to use a causal mask in addition to the :obj:`attention_mask` passed to
-            When using the Reformer for causal language modeling, `is_decoder` is set to `True`.
+            :class:`~transformers.ReformerModel`. When using the Reformer for causal language modeling, this argument
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+            should be set to :obj:`True`.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        local_chunk_length (:obj:`int`, optional, defaults to 64):
+        local_chunk_length (:obj:`int`, `optional`, defaults to 64):
-            Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
+            Length of chunk which attends to itself in :obj:`LocalSelfAttention`. Chunking reduces memory complexity
-        local_num_chunks_before (:obj:`int`, optional, defaults to 1):
+            from sequence length x sequence length (self attention) to
-            Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.
+            chunk length x chunk length x sequence length / chunk length (chunked self attention).
-        local_num_chunks_after (:obj:`int`, optional, defaults to 0):
+        local_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
-            Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.
+            Number of previous neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer to itself.
-        local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        local_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
-            The dropout ratio for the attention probabilities in LocalSelfAttention.
+            Number of following neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer in addition to
-        lsh_attn_chunk_length (:obj:`int`, optional, defaults to 64):
+            itself.
-            Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
+        local_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
-        lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):
+            The dropout ratio for the attention probabilities in :obj:`LocalSelfAttention`.
-            Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.
+        lsh_attn_chunk_length (:obj:`int`, `optional`, defaults to 64):
-        lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):
+            Length of chunk which attends to itself in :obj:`LSHSelfAttention`. Chunking reduces memory complexity from
-            Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.
+            sequence length x sequence length (self attention) to
-        lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            chunk length x chunk length x sequence length / chunk length (chunked self attention).
-            The dropout ratio for the attention probabilities in LSHSelfAttention.
+        lsh_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
-        max_position_embeddings (:obj:`int`, optional, defaults to 4096):
+            Number of previous neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
+        lsh_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
+            Number of following neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
+        lsh_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities in :obj:`LSHSelfAttention`.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 4096):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):
+        num_buckets (:obj:`int` or :obj:`List[int]`, `optional`):
-            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.
+            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
-            The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.
+            Each query key vector is hashed into a hash in :obj:`1, ..., num_buckets`.
-            The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.
+            The number of buckets can also be factorized into a list for improved memory complexity. In this case, each
-        num_hashes (:obj:`int`, optional, defaults to 1):
+            query key vector is hashed into a hash in
-            Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.
+            :obj:`1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if :obj:`num_buckets` is
-            The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.
+            factorized into two factors.
-        pad_token_id (:obj:`int`, optional, defaults to 0):
+            The number of buckets (or the product the factors) should approximately equal
-            The token id for the <PAD> token.
+            sequence length / lsh_chunk_length. If :obj:`num_buckets` not set, a good value is calculated on the fly.
-        vocab_size (:obj:`int`, optional, defaults to 320):
+        num_hashes (:obj:`int`, `optional`, defaults to 1):
-            Vocabulary size of the Reformer model. Defines the different tokens that
+            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme.
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ReformerModel`.
+            The higher :obj:`num_hashes`, the more accurate the :obj:`LSHSelfAttention` becomes, but also the more
+            memory and time intensive the hashing becomes.
+        pad_token_id (:obj:`int`, `optional`, defaults to 0):
+            The token id for the padding token.
+        vocab_size (:obj:`int`, `optional`, defaults to 320):\
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
        tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to tie input and output embeddings.
-    Example::
+    Examples::
        >>> from transformers import ReformerModel, ReformerConfig