Models doc (#7345)

* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Models doc (#7345)
* Clean up model documentation * Formatting * Preparation work * Long lines * Main work on rst files * Cleanup all config files * Syntax fix * Clean all tokenizers * Work on first models * Models beginning * FaluBERT * All PyTorch models * All models * Long lines again * Fixes * More fixes * Update docs/source/model_doc/bert.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update docs/source/model_doc/electra.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Last fixes Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
3323146e · Sylvain Gugger · GitHub · 58405a52 · 3323146e · 3323146e
Unverified Commit 3323146e authored Sep 23, 2020 by Sylvain Gugger Committed by GitHub Sep 23, 2020
20 changed files
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
-Tokenizer summary
-----------------
-
-In this page, we will have a closer look at tokenization. As we saw in
-:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
-are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
-specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
-:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
-:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
-
-Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
-algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
-using :ref:`WordPiece <wordpiece>`.
-
-Introduction to tokenization
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
-instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
-this text is just to split it by spaces, which would give:
-
-::
-
-    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
-
-This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
-will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
-into account. This would give:
-
-::
-
-    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-
-which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
-it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
-part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
-into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
-perform properly if you don't use the exact same rules as the persons who pretrained it.
-
-`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
-rule-based tokenizers. On the text above, they'd output something like:
-
-::
-
-    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
-
-Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
-sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
-you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
-:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
-size of 267,735!
-
-A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
-TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
-transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
-language.
-
-So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
-While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
-as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
-all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
-
-Subword tokenization
-^^^^^^^^^^^^^^^^^^^^
-
-Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
-should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
-decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
-form (almost) arbitrarily long complex words by stringing together some subwords.
-
-This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
-subwords. This also enables the model to process words it has never seen before, by decomposing them into
-subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
-this:
-
-.. code-block::
-
-    >>> from transformers import BertTokenizer
-    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-    >>> tokenizer.tokenize("I have a new GPU!")
-    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
-
-Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
-vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
-means that the rest of the token should be attached to the previous one, without space (for when we need to decode
-predictions and reverse the tokenization).
-
-Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
-
-.. code-block::
-
-    >>> from transformers import XLNetTokenizer
-    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
-    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
-    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
-
-We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
-Transformers has been split into "Transform" and "ers".
-
-Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
-training which is usually done on the corpus the corresponding model will be trained on.
-
-.. _byte-pair-encoding:
-
-Byte-Pair Encoding
-~~~~~~~~~~~~~~~~~~
-
-Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
-splitting the training data into words, which can be a simple space tokenization
-(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
-(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
-
-:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
-
-It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
-vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
-
-Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
-word):
-
-::
-
-    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
-
-Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
-
-::
-
-    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
-
-We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
-times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
-`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
-then it adds 'ug' to the vocabulary. Our corpus then becomes
-
-::
-
-    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
-
-and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
-and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
-to the vocabulary.
-
-At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
-represented as
-
-::
-
-    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
-
-If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
-were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
-``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
-base corpus uses all of them), but to special characters like emojis.
-
-As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
-to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
-and chose to stop the training of the tokenizer at 40,000 merges.
-
-Byte-level BPE
-^^^^^^^^^^^^^^
-
-To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
-all unicode characters, the
-`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
-introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
-additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
-token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
-256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
-
-.. _wordpiece:
-
-WordPiece
-=========
-
-WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
-:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
-`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
-on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
-progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
-frequent but the one that will maximize the likelihood on the corpus once merged.
-
-What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
-having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
-subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
-sure it's `worth it`.
-
-.. _unigram:
-
-Unigram
-=======
-
-Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
-Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
-from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
-progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
-with :ref:`SentencePiece <sentencepiece>`.
-
-More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
-for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
-sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
-all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
-reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
-BPE or WordPiece).
-
-Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
-tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
-vocabulary
-
-::
-
-    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
-
-we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
-one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
-training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
-tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
-of the tokenization according to their probabilities).
-
-Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
-words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
-tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
-
-.. math::
-    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
-
-.. _sentencepiece:
-
-SentencePiece
-=============
-
-All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
-all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
-pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
-SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
-includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
-
-That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
-the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
-all of them together and replace '▁' with space.
-
-All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
-:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
+Tokenizer summary
+-----------------------------------------------------------------------------------------------------------------------
+
+In this page, we will have a closer look at tokenization. As we saw in
+:doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or subwords, which then
+are converted to ids. The second part is pretty straightforward, here we will focus on the first part. More
+specifically, we will look at the three main different kinds of tokenizers used in 🤗 Transformers:
+:ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>` and
+:ref:`SentencePiece <sentencepiece>`, and provide examples of models using each of those.
+
+Note that on each model page, you can look at the documentation of the associated tokenizer to know which of those
+algorithms the pretrained model used. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see it's
+using :ref:`WordPiece <wordpiece>`.
+
+Introduction to tokenization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Splitting a text in smaller chunks is a task that's harder than it looks, and there are multiple ways of doing it. For
+instance, let's look at the sentence "Don't you love 🤗 Transformers? We sure do." A first simple way of tokenizing
+this text is just to split it by spaces, which would give:
+
+.. code-block::
+
+    ["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
+
+This is a nice first step, but if we look at the tokens "Transformers?" or "do.", we can see we can do better. Those
+will be different than the tokens "Transformers" and "do" for our model, so we should probably take the punctuation
+into account. This would give:
+
+.. code-block::
+
+    ["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+
+which is better already. One thing that is annoying though is how it dealt with "Don't". "Don't" stands for do not, so
+it should probably be better tokenized as ``["Do", "n't"]``. This is where things start getting more complicated, and
+part of the reason each kind of model has its own tokenizer class. Depending on the rules we apply to split our texts
+into tokens, we'll get different tokenized versions of the same text. And of course, a given pretrained model won't
+perform properly if you don't use the exact same rules as the persons who pretrained it.
+
+`spaCy <https://spacy.io/>`__ and `Moses <http://www.statmt.org/moses/?n=Development.GetStarted>`__ are two popular
+rule-based tokenizers. On the text above, they'd output something like:
+
+.. code-block::
+
+    ["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
+
+Space/punctuation-tokenization and rule-based tokenization are both examples of word tokenization, which is splitting a
+sentence into words. While it's the most intuitive way to separate texts in smaller chunks, it can have a problem when
+you have a huge corpus: it usually yields a very big vocabulary (the set of all unique tokens used).
+:doc:`Transformer XL <model_doc/transformerxl>` for instance uses space/punctuation-tokenization, and has a vocabulary
+size of 267,735!
+
+A huge vocabulary size means a huge embedding matrix at the start of the model, which will cause memory problems.
+TransformerXL deals with it by using a special kind of embeddings called adaptive embeddings, but in general,
+transformers models rarely have a vocabulary size greater than 50,000, especially if they are trained on a single
+language.
+
+So if tokenizing on words is unsatisfactory, we could go on the opposite direction and simply tokenize on characters.
+While it's very simple and would save a lot of memory, this doesn't allow the model to learn representations of texts
+as meaningful as when using a word tokenization, leading to a loss of performance. So to get the best of both worlds,
+all transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
+
+Subword tokenization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Subword tokenization algorithms rely on the principle that most common words should be left as is, but rare words
+should be decomposed in meaningful subword units. For instance "annoyingly" might be considered a rare word and
+decomposed as "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can
+form (almost) arbitrarily long complex words by stringing together some subwords.
+
+This allows the model to keep a reasonable vocabulary while still learning useful representations for common words or
+subwords. This also enables the model to process words it has never seen before, by decomposing them into
+subwords it knows. For instance, the base :class:`~transformers.BertTokenizer` will tokenize "I have a new GPU!" like
+this:
+
+.. code-block::
+
+    >>> from transformers import BertTokenizer
+    >>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    >>> tokenizer.tokenize("I have a new GPU!")
+    ['i', 'have', 'a', 'new', 'gp', '##u', '!']
+
+Since we are considering the uncased model, the sentence was lowercased first. Then all the words were present in the
+vocabulary of the tokenizer, except for "gpu", so the tokenizer split it in subwords it knows: "gp" and "##u". The "##"
+means that the rest of the token should be attached to the previous one, without space (for when we need to decode
+predictions and reverse the tokenization).
+
+Another example is when we use the base :class:`~transformers.XLNetTokenizer` to tokenize our previous text:
+
+.. code-block::
+
+    >>> from transformers import XLNetTokenizer
+    >>> tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
+    >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
+    ['▁Don', "'", 't', '▁you', '▁love', '▁', '🤗', '▁', 'Transform', 'ers', '?', '▁We', '▁sure', '▁do', '.']
+
+We'll get back to the meaning of those '▁' when we look at :ref:`SentencePiece <sentencepiece>` but you can see
+Transformers has been split into "Transform" and "ers".
+
+Let's now look at how the different subword tokenization algorithms work. Note that they all rely on some form of
+training which is usually done on the corpus the corresponding model will be trained on.
+
+.. _byte-pair-encoding:
+
+Byte-Pair Encoding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Byte-Pair Encoding was introduced in `this paper <https://arxiv.org/abs/1508.07909>`__. It relies on a pretokenizer
+splitting the training data into words, which can be a simple space tokenization
+(:doc:`GPT-2 <model_doc/gpt2>` and :doc:`Roberta <model_doc/roberta>` uses this for instance) or a rule-based tokenizer
+(:doc:`XLM <model_doc/xlm>` use Moses for most languages, as does :doc:`FlauBERT <model_doc/flaubert>`),
+
+:doc:`GPT <model_doc/gpt>` uses Spacy and ftfy, and counts the frequency of each word in the training corpus.
+
+It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the
+vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
+
+Let's say that after the pre-tokenization we have the following words (the number indicating the frequency of each
+word):
+
+.. code-block::
+
+    ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
+
+Then the base vocabulary is ['b', 'g', 'h', 'n', 'p', 's', 'u'] and all our words are first split by character:
+
+.. code-block::
+
+    ('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)
+
+We then take each pair of symbols and look at the most frequent. For instance 'hu' is present `10 + 5 = 15` times (10
+times in the 10 occurrences of 'hug', 5 times in the 5 occurrences of 'hugs'). The most frequent here is 'ug', present
+`10 + 5 + 5 = 20` times in total. So the first merge rule the tokenizer learns is to group all 'u' and 'g' together
+then it adds 'ug' to the vocabulary. Our corpus then becomes
+
+.. code-block::
+
+    ('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)
+
+and we continue by looking at the next most common pair of symbols. It's 'un', present 16 times, so we merge those two
+and add 'un' to the vocabulary. Then it's 'hug' (as 'h' + 'ug'), present 15 times, so we merge those two and add 'hug'
+to the vocabulary.
+
+At this stage, the vocabulary is ``['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']`` and our corpus is
+represented as
+
+.. code-block::
+
+    ('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)
+
+If we stop there, the tokenizer can apply the rules it learned to new words (as long as they don't contain characters that
+were not in the base vocabulary). For instance 'bug' would be tokenized as ``['b', 'ug']`` but mug would be tokenized as
+``['<unk>', 'ug']`` since the 'm' is not in the base vocabulary. This doesn't happen to letters in general (since the
+base corpus uses all of them), but to special characters like emojis.
+
+As we said before, the vocabulary size (which is the base vocabulary size + the number of merges) is a hyperparameter
+to choose. For instance :doc:`GPT <model_doc/gpt>` has a vocabulary size of 40,478 since they have 478 base characters
+and chose to stop the training of the tokenizer at 40,000 merges.
+
+Byte-level BPE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To deal with the fact the base vocabulary needs to get all base characters, which can be quite big if one allows for
+all unicode characters, the
+`GPT-2 paper <https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`__
+introduces a clever trick, which is to use bytes as the base vocabulary (which gives a size of 256). With some
+additional rules to deal with punctuation, this manages to be able to tokenize every text without needing an unknown
+token. For instance, the :doc:`GPT-2 model <model_doc/gpt>` has a vocabulary size of 50,257, which corresponds to the
+256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.
+
+.. _wordpiece:
+
+WordPiece
+=======================================================================================================================
+
+WordPiece is the subword tokenization algorithm used for :doc:`BERT <model_doc/bert>` (as well as
+:doc:`DistilBERT <model_doc/distilbert>` and :doc:`Electra <model_doc/electra>`) and was outlined in
+`this paper <https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf>`__. It relies
+on the same base as BPE, which is to initialize the vocabulary to every character present in the corpus and
+progressively learn a given number of merge rules, the difference is that it doesn't choose the pair that is the most
+frequent but the one that will maximize the likelihood on the corpus once merged.
+
+What does this mean? Well, in the previous example, it means we would only merge 'u' and 'g' if the probability of
+having 'ug' divided by the probability of having 'u' then 'g' is greater than for any other pair of symbols. It's
+subtly different from what BPE does in the sense that it evaluates what it "loses" by merging two symbols and makes
+sure it's `worth it`.
+
+.. _unigram:
+
+Unigram
+=======================================================================================================================
+
+Unigram is a subword tokenization algorithm introduced in `this paper <https://arxiv.org/pdf/1804.10959.pdf>`__.
+Instead of starting with a group of base symbols and learning merges with some rule, like BPE or WordPiece, it starts
+from a large vocabulary (for instance, all pretokenized words and the most common substrings) that it will trim down
+progressively. It's not used directly for any of the pretrained models in the library, but it's used in conjunction
+with :ref:`SentencePiece <sentencepiece>`.
+
+More specifically, at a given step, unigram computes a loss from the corpus we have and the current vocabulary, then,
+for each subword, evaluate how much the loss would augment if the subword was removed from the vocabulary. It then
+sorts the subwords by this quantity (that represents how worse the loss becomes if the token is removed) and removes
+all the worst p tokens (for instance p could be 10% or 20%). It then repeats the process until the vocabulary has
+reached the desired size, always keeping the base characters (to be able to tokenize any word written with them, like
+BPE or WordPiece).
+
+Contrary to BPE and WordPiece that work out rules in a certain order that you can then apply in the same order when
+tokenizing new text, Unigram will have several ways of tokenizing a new text. For instance, if it ends up with the
+vocabulary
+
+.. code-block::
+
+    ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug']
+
+we had before, it could tokenize "hugs" as ``['hug', 's']``, ``['h', 'ug', 's']`` or ``['h', 'u', 'g', 's']``. So which
+one choose? On top of saving the vocabulary, the trained tokenizer will save the probability of each token in the
+training corpus. You can then give a probability to each tokenization (which is the product of the probabilities of the
+tokens forming it) and pick the most likely one (or if you want to apply some data augmentation, you could sample one
+of the tokenization according to their probabilities).
+
+Those probabilities define the loss that trains the tokenizer: if our corpus consists of the
+words :math:`x_{1}, \dots, x_{N}` and if for the word :math:`x_{i}` we note :math:`S(x_{i})` the set of all possible
+tokenizations of :math:`x_{i}` (with the current vocabulary), then the loss is defined as
+
+.. math::
+    \mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )
+
+.. _sentencepiece:
+
+SentencePiece
+=======================================================================================================================
+
+All the methods we have been looking at so far required some form of pretokenization, which has a central problem: not
+all languages use spaces to separate words. This is a problem :doc:`XLM <model_doc/xlm>` solves by using specific
+pretokenizers for each of those languages (in this case, Chinese, Japanese and Thai). To solve this problem,
+SentencePiece (introduced in `this paper <https://arxiv.org/pdf/1808.06226.pdf>`__) treats the input as a raw stream,
+includes the space in the set of characters to use, then uses BPE or unigram to construct the appropriate vocabulary.
+
+That's why in the example we saw before using :class:`~transformers.XLNetTokenizer` (which uses SentencePiece), we had
+the '▁' character, that represents space. Decoding a tokenized text is then super easy: we just have to concatenate
+all of them together and replace '▁' with space.
+
+All transformers models in the library that use SentencePiece use it with unigram. Examples of models using it are
+:doc:`ALBERT <model_doc/albert>`, :doc:`XLNet <model_doc/xlnet>` or the :doc:`Marian framework <model_doc/marian>`.
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
 Training and fine-tuning
-========================
+=======================================================================================================================

 Model classes in 🤗 Transformers are designed to be compatible with native
 PyTorch and TensorFlow 2 and can be used seemlessly with either. In this
@@ -24,7 +24,7 @@ Sections:
 .. _pytorch:

 Fine-tuning in native PyTorch
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Model classes in 🤗 Transformers that don't begin with ``TF`` are
 `PyTorch Modules <https://pytorch.org/docs/master/generated/torch.nn.Module.html>`_,
@@ -141,7 +141,7 @@ with features like mixed precision and easy tensorboard logging.


 Freezing the encoder
--------------------
+-----------------------------------------------------------------------------------------------------------------------

 In some cases, you might be interested in keeping the weights of the
 pre-trained encoder frozen and optimizing only the weights of the head
@@ -158,7 +158,7 @@ submodule on any task-specific model in the library:
 .. _tensorflow:

 Fine-tuning in native TensorFlow 2
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Models can also be trained natively in TensorFlow 2. Just as with PyTorch,
 TensorFlow models can be instantiated with
@@ -210,7 +210,7 @@ can even save the model and then reload it as a PyTorch model (or vice-versa):
 .. _trainer:

 Trainer
-^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 We also provide a simple but feature-complete training and evaluation
 interface through :func:`~transformers.Trainer` and
@@ -303,7 +303,7 @@ launching tensorboard in your specified ``logging_dir`` directory.
 .. _additional-resources:

 Additional resources
-^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 - `A lightweight colab demo <https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing>`_
  which uses ``Trainer`` for IMDb sentiment classification.

--- a/src/transformers/configuration_albert.py
+++ b/src/transformers/configuration_albert.py
@@ -32,54 +32,55 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class AlbertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel`.
-    It is used to instantiate an ALBERT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.AlbertModel` or a
+    :class:`~transformers.TFAlbertModel`. It is used to instantiate an ALBERT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the ALBERT `xxlarge <https://huggingface.co/albert-xxlarge-v2>`__ architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

-
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30000):
-            Vocabulary size of the ALBERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.AlbertModel`.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        vocab_size (:obj:`int`, `optional`, defaults to 30000):
+            Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.AlbertModel` or
+            :class:`~transformers.TFAlbertModel`.
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
            Dimensionality of vocabulary embeddings.
-        hidden_size (:obj:`int`, optional, defaults to 4096):
+        hidden_size (:obj:`int`, `optional`, defaults to 4096):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_hidden_groups (:obj:`int`, optional, defaults to 1):
+        num_hidden_groups (:obj:`int`, `optional`, defaults to 1):
            Number of groups for the hidden layers, parameters in the same group are shared.
-        num_attention_heads (:obj:`int`, optional, defaults to 64):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 64):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 16384):
-            The dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        inner_group_num (:obj:`int`, optional, defaults to 1):
+        intermediate_size (:obj:`int`, `optional`, defaults to 16384):
+            The dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        inner_group_num (:obj:`int`, `optional`, defaults to 1):
            The number of inner repetition of attention and ffn.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu_new"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu_new"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something
            large (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.AlbertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.AlbertModel` or
+            :class:`~transformers.TFAlbertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        classifier_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        classifier_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for attached classifiers.

-    Example::
+    Examples::

        >>> from transformers import AlbertConfig, AlbertModel
        >>> # Initializing an ALBERT-xxlarge style configuration

--- a/src/transformers/configuration_bert.py
+++ b/src/transformers/configuration_bert.py
@@ -50,10 +50,10 @@ BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class BertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
-    It is used to instantiate an BERT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.BertModel` or a
+    :class:`~transformers.TFBertModel`. It is used to instantiate a BERT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the BERT `bert-base-uncased <https://huggingface.co/bert-base-uncased>`__ architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -61,37 +61,39 @@ class BertConfig(PretrainedConfig):


    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 768):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BertModel` or
+            :class:`~transformers.TFBertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.BertModel` or
+            :class:`~transformers.TFBertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If True, use gradient checkpointing to save memory at the expense of slower backward pass.

-    Example::
+    Examples::

        >>> from transformers import BertModel, BertConfig


--- a/src/transformers/configuration_bert_generation.py
+++ b/src/transformers/configuration_bert_generation.py
@@ -19,18 +19,18 @@ from .configuration_utils import PretrainedConfig

 class BertGenerationConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertGenerationPreTrainedModel`.
-    It is used to instantiate a BertGenerationConfig model according to the specified arguments, defining the model architecture.
+    This is the configuration class to store the configuration of a
+    :class:`~transformers.BertGenerationPreTrainedModel`. It is used to instantiate a BertGeneration model according to
+    the specified arguments, defining the model architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

-
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 50358):
-            Vocabulary size of the BertGeneration model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertGeneration`.
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.BertGeneration`.
        hidden_size (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
@@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig):
        num_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+            Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
@@ -56,7 +56,7 @@ class BertGenerationConfig(PretrainedConfig):
        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
            If :obj:`True`, use gradient checkpointing to save memory at the expense of slower backward pass.

-    Example::
+    Examples::

        >>> from transformers import BertGenerationConfig, BertGenerationEncoder


--- a/src/transformers/configuration_ctrl.py
+++ b/src/transformers/configuration_ctrl.py
@@ -25,44 +25,45 @@ CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://s3.amazonaws.com/models.h

 class CTRLConfig(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel`.
-    It is used to instantiate an CTRL model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.
+    This is the configuration class to store the configuration of a :class:`~transformers.CTRLModel` or a
+    :class:`~transformers.TFCTRLModel`. It is used to instantiate a CTRL model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `ctrl <https://huggingface.co/ctrl>`__ architecture from SalesForce.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

    Args:
-        vocab_size (:obj:`int`, optional, defaults to 246534):
-            Vocabulary size of the CTRL model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
-        n_positions (:obj:`int`, optional, defaults to 256):
+        vocab_size (:obj:`int`, `optional`, defaults to 246534):
+            Vocabulary size of the CTRL model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.CTRLModel` or
+            :class:`~transformers.TFCTRLModel`.
+        n_positions (:obj:`int`, `optional`, defaults to 256):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 256):
+        n_ctx (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 1280):
+        n_embd (:obj:`int`, `optional`, defaults to 1280):
            Dimensionality of the embeddings and hidden states.
-        dff (:obj:`int`, optional, defaults to 8192):
-            Dimensionality of the inner dimension of the FFN.
-        n_layer (:obj:`int`, optional, defaults to 48):
+        dff (:obj:`int`, `optional`, defaults to 8192):
+            Dimensionality of the inner dimension of the feed forward networks (FFN).
+        n_layer (:obj:`int`, `optional`, defaults to 48):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-6):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-6):
            The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

-    Example::
+    Examples::

        >>> from transformers import CTRLModel, CTRLConfig


--- a/src/transformers/configuration_distilbert.py
+++ b/src/transformers/configuration_distilbert.py
@@ -33,50 +33,51 @@ DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class DistilBertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel`.
-    It is used to instantiate a DistilBERT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the DistilBERT `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.DistilBertModel` or a
+    :class:`~transformers.TFDistilBertModel`. It is used to instantiate a DistilBERT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the DistilBERT
+    `distilbert-base-uncased <https://huggingface.co/distilbert-base-uncased>`__ architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

-
    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the DistilBERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the DistilBERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.DistilBertModel` or
+            :class:`~transformers.TFDistilBertModel`.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        sinusoidal_pos_embds (:obj:`boolean`, optional, defaults to :obj:`False`):
+        sinusoidal_pos_embds (:obj:`boolean`, `optional`, defaults to :obj:`False`):
            Whether to use sinusoidal positional embeddings.
-        n_layers (:obj:`int`, optional, defaults to 6):
+        n_layers (:obj:`int`, `optional`, defaults to 6):
            Number of hidden layers in the Transformer encoder.
-        n_heads (:obj:`int`, optional, defaults to 12):
+        n_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        dim (:obj:`int`, optional, defaults to 768):
+        dim (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        hidden_dim (:obj:`int`, optional, defaults to 3072):
-            The size of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        hidden_dim (:obj:`int`, `optional`, defaults to 3072):
+            The size of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        activation (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        activation (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        qa_dropout (:obj:`float`, optional, defaults to 0.1):
+        qa_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilities used in the question answering model
            :class:`~transformers.DistilBertForQuestionAnswering`.
-        seq_classif_dropout (:obj:`float`, optional, defaults to 0.2):
+        seq_classif_dropout (:obj:`float`, `optional`, defaults to 0.2):
            The dropout probabilities used in the sequence classification and the multiple choice model
            :class:`~transformers.DistilBertForSequenceClassification`.

-    Example::
+    Examples::

        >>> from transformers import DistilBertModel, DistilBertConfig


--- a/src/transformers/configuration_dpr.py
+++ b/src/transformers/configuration_dpr.py
@@ -32,8 +32,12 @@ class DPRConfig(PretrainedConfig):
    :class:`~transformers.DPRConfig` is the configuration class to store the configuration of a
    `DPRModel`.

-    This is the configuration class to store the configuration of a `DPRContextEncoder`, `DPRQuestionEncoder`, or a `DPRReader`.
-    It is used to instantiate the components of the DPR model.
+    This is the configuration class to store the configuration of a :class:`~transformers.DPRContextEncoder`,
+    :class:`~transformers.DPRQuestionEncoder`, or a :class:`~transformers.DPRReader`. It is used to instantiate the
+    components of the DPR model.
+
+    This class is a subclass of :class:`~transformers.BertConfig`. Please check the
+    superclass for the documentation of all kwargs.

    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):

--- a/src/transformers/configuration_electra.py
+++ b/src/transformers/configuration_electra.py
@@ -33,11 +33,11 @@ ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class ElectraConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel`.
-    It is used to instantiate an ELECTRA model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the ELECTRA `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.ElectraModel` or a
+    :class:`~transformers.TFElectraModel`. It is used to instantiate a ELECTRA model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the ELECTRA
+    `google/electra-small-discriminator <https://huggingface.co/google/electra-small-discriminator>`__ architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -45,59 +45,61 @@ class ElectraConfig(PretrainedConfig):


    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the ELECTRA model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ElectraModel`.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the ELECTRA model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ElectraModel` or
+            :class:`~transformers.TFElectraModel`.
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
            Dimensionality of the encoder layers and the pooler layer.
-        hidden_size (:obj:`int`, optional, defaults to 256):
+        hidden_size (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 12):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 4):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 1024):
+        intermediate_size (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.ElectraModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.ElectraModel` or
+            :class:`~transformers.TFElectraModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        summary_type (:obj:`string`, optional, defaults to "first"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Is one of the following options:
-
-                - 'last' => take the last token hidden state (like XLNet)
-                - 'first' => take the first token hidden state (like Bert)
-                - 'mean' => take the mean of all tokens hidden states
-                - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-                - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            'gelu' => add a gelu activation to the output, Other => no activation.
-        summary_last_dropout (:obj:`float`, optional, defaults to 0.0):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.ElectraForMultipleChoice`.
-            Add a dropout after the projection and activation
-
-    Example::
+        summary_type (:obj:`str`, `optional`, defaults to :obj:`"first"`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass :obj:`"gelu"` for a gelu activation to the output, any other value will result in no activation.
+        summary_last_dropout (:obj:`float`, `optional`, defaults to 0.0):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+
+    Examples::

        >>> from transformers import ElectraModel, ElectraConfig


--- a/src/transformers/configuration_encoder_decoder.py
+++ b/src/transformers/configuration_encoder_decoder.py
@@ -25,22 +25,24 @@ logger = logging.get_logger(__name__)

 class EncoderDecoderConfig(PretrainedConfig):
    r"""
-    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a `EncoderDecoderModel`.
+    :class:`~transformers.EncoderDecoderConfig` is the configuration class to store the configuration of a
+    :class:`~transformers.EncoderDecoderModel`. It is used to instantiate an Encoder Decoder model according to the
+    specified arguments, defining the encoder and decoder configs.

-    It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.
-    Configuration objects inherit from  :class:`~transformers.PretrainedConfig`
-    and can be used to control the model outputs.
-    See the documentation for :class:`~transformers.PretrainedConfig` for more information.
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.

    Args:
        kwargs (`optional`):
-            Remaining dictionary of keyword arguments. Notably:
-                encoder (:class:`PretrainedConfig`, optional, defaults to `None`):
-                    An instance of a configuration object that defines the encoder config.
-                decoder (:class:`PretrainedConfig`, optional, defaults to `None`):
-                    An instance of a configuration object that defines the decoder config.
+            Dictionary of keyword arguments. Notably:

-    Example::
+                - **encoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                  object that defines the encoder config.
+                - **decoder** (:class:`~transformers.PretrainedConfig`, `optional`) -- An instance of a configuration
+                  object that defines the decoder config.
+
+    Examples::

        >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel


--- a/src/transformers/configuration_flaubert.py
+++ b/src/transformers/configuration_flaubert.py
@@ -30,11 +30,9 @@ FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class FlaubertConfig(XLMConfig):
    """
-    Configuration class to store the configuration of a `FlaubertModel`.
-    This is the configuration class to store the configuration of a :class:`~transformers.XLMModel`.
-    It is used to instantiate an XLM model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `xlm-mlm-en-2048 <https://huggingface.co/xlm-mlm-en-2048>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.FlaubertModel` or a
+    :class:`~transformers.TFFlaubertModel`. It is used to instantiate a FlauBERT model according to the specified
+    arguments, defining the model architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -47,95 +45,95 @@ class FlaubertConfig(XLMConfig):
        layerdrop (:obj:`float`, `optional`, defaults to 0.0):
            Probability to drop layers during training (Fan et al., Reducing Transformer Depth on Demand
            with Structured Dropout. ICLR 2020)
-        vocab_size (:obj:`int`, optional, defaults to 30145):
-            Vocabulary size of the Flaubert model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FlaubertModel`.
-        emb_dim (:obj:`int`, optional, defaults to 2048):
+        vocab_size (:obj:`int`, `optional`, defaults to 30145):
+            Vocabulary size of the FlauBERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.FlaubertModel` or
+            :class:`~transformers.TFFlaubertModel`.
+        emb_dim (:obj:`int`, `optional`, defaults to 2048):
            Dimensionality of the encoder layers and the pooler layer.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 16):
+        n_head (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
-        dropout (:obj:`float`, optional, defaults to 0.1):
+        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected
            layers in the embeddings, encoder, and pooler.
-        attention_dropout (:obj:`float`, optional, defaults to 0.1):
+        attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for the attention mechanism
-        gelu_activation (:obj:`boolean`, optional, defaults to :obj:`True`):
-            The non-linear activation function (function or string) in the
-            encoder and pooler. If set to `True`, "gelu" will be used instead of "relu".
-        sinusoidal_embeddings (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use sinusoidal positional embeddings instead of absolute positional embeddings.
-        causal (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Set this to `True` for the model to behave in a causal manner.
+        gelu_activation (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to use a `gelu` actibation instead of `relu`.
+        sinusoidal_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use sinusoidal positional embeddings instead of absolute positional embeddings.
+        causal (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model shoul behave in a causal manner.
            Causal models use a triangular attention mask in order to only attend to the left-side context instead
            if a bidirectional context.
-        asm (:obj:`boolean`, optional, defaults to :obj:`False`):
-            Whether to use an adaptive log softmax projection layer instead of a linear layer for the prediction
+        asm (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not to use an adaptive log softmax projection layer instead of a linear layer for the prediction
            layer.
-        n_langs (:obj:`int`, optional, defaults to 1):
+        n_langs (:obj:`int`, `optional`, defaults to 1):
            The number of languages the model handles. Set to 1 for monolingual models.
-        use_lang_emb (:obj:`boolean`, optional, defaults to :obj:`True`)
+        use_lang_emb (:obj:`bool`, `optional`, defaults to :obj:`True`)
            Whether to use language embeddings. Some models use additional language embeddings, see
            `the multilingual models page <http://huggingface.co/transformers/multilingual.html#xlm-language-embeddings>`__
            for information on how to use them.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might
            ever be used with. Typically set this to something large just in case
            (e.g., 512 or 1024 or 2048).
-        embed_init_std (:obj:`float`, optional, defaults to 2048^-0.5):
+        embed_init_std (:obj:`float`, `optional`, defaults to 2048^-0.5):
            The standard deviation of the truncated_normal_initializer for
            initializing the embedding matrices.
-        init_std (:obj:`int`, optional, defaults to 50257):
+        init_std (:obj:`int`, `optional`, defaults to 50257):
            The standard deviation of the truncated_normal_initializer for
            initializing all weight matrices except the embedding matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        bos_index (:obj:`int`, optional, defaults to 0):
+        bos_index (:obj:`int`, `optional`, defaults to 0):
            The index of the beginning of sentence token in the vocabulary.
-        eos_index (:obj:`int`, optional, defaults to 1):
+        eos_index (:obj:`int`, `optional`, defaults to 1):
            The index of the end of sentence token in the vocabulary.
-        pad_index (:obj:`int`, optional, defaults to 2):
+        pad_index (:obj:`int`, `optional`, defaults to 2):
            The index of the padding token in the vocabulary.
-        unk_index (:obj:`int`, optional, defaults to 3):
+        unk_index (:obj:`int`, `optional`, defaults to 3):
            The index of the unknown token in the vocabulary.
-        mask_index (:obj:`int`, optional, defaults to 5):
+        mask_index (:obj:`int`, `optional`, defaults to 5):
            The index of the masking token in the vocabulary.
-        is_encoder(:obj:`boolean`, optional, defaults to :obj:`True`):
-            Whether the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
-        summary_type (:obj:`string`, optional, defaults to "first"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Is one of the following options:
-
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.XLMForSequenceClassification`.
-            Add a dropout before the projection and activation
-        start_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        end_n_top (:obj:`int`, optional, defaults to 5):
-            Used in the SQuAD evaluation script for XLM and XLNet.
-        mask_token_id (:obj:`int`, optional, defaults to 0):
+        is_encoder(:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al.
+        summary_type (:obj:`string`, `optional`, defaults to "first"):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary. Used in the sequence classification and multiple choice models.
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Used in the sequence classification and multiple choice models.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Used in the sequence classification and multiple choice models.
+
+            The dropout ratio to be used after the projection and activation.
+        start_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        end_n_top (:obj:`int`, `optional`, defaults to 5):
+            Used in the SQuAD evaluation script.
+        mask_token_id (:obj:`int`, `optional`, defaults to 0):
            Model agnostic parameter to identify masked tokens when generating text in an MLM context.
-        lang_id (:obj:`int`, optional, defaults to 1):
+        lang_id (:obj:`int`, `optional`, defaults to 1):
            The ID of the language used by the model. This parameter is used when generating
            text in a given language.
    """

--- a/src/transformers/configuration_fsmt.py
+++ b/src/transformers/configuration_fsmt.py
@@ -18,7 +18,6 @@
 import copy

 from .configuration_utils import PretrainedConfig
-from .file_utils import add_start_docstrings_to_callable
 from .utils import logging


@@ -27,33 +26,54 @@ logger = logging.get_logger(__name__)
 FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP = {}


-FSMT_CONFIG_ARGS_DOC = r"""
+class DecoderConfig(PretrainedConfig):
+    r"""
+    Configuration class for FSMT's decoder specific things.
+    note: this is a private helper class
+    """
+    model_type = "fsmt_decoder"
+
+    def __init__(self, vocab_size=0, bos_token_id=0):
+        super().__init__()
+        self.vocab_size = vocab_size
+        self.bos_token_id = bos_token_id
+
+
+class FSMTConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.FSMTModel`. It is used to
+    instantiate a FSMT model according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.
+
    Args:
        langs (:obj:`List[str]`):
-            source language, target_language (e.g. ['en', 'ru'])
+            A list with source language and target_language (e.g., ['en', 'ru']).
        src_vocab_size (:obj:`int`):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward
-            method in the encoder.
+            Vocabulary size of the encoder. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed to the forward method in the encoder.
        tgt_vocab_size (:obj:`int`):
-            defines the different tokens that can be represented by `inputs_ids` passed to the forward
-            method in the decoder.
+            Vocabulary size of the decoder. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed to the forward method in the decoder.
        d_model (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the layers and the pooler layer.
        encoder_layers (:obj:`int`, `optional`, defaults to 12):
-            Number of encoder layers, 16 for pegasus, 6 for bart-base and marian
+            Number of encoder layers.
        decoder_layers (:obj:`int`, `optional`, defaults to 12):
-            Number of decoder layers, 16 for pegasus, 6 for bart-base and marian
+            Number of decoder layers.
        encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
            Number of attention heads for each attention layer in the Transformer decoder.
        decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
        encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder.
-        activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to "relu"):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        activation_function (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
        dropout (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
@@ -74,7 +94,7 @@ FSMT_CONFIG_ARGS_DOC = r"""
        eos_token_id (:obj:`int`, `optional`, defaults to 2)
            End of stream token id.
        decoder_start_token_id (:obj:`int`, `optional`):
-            This model starts decoding with `eos_token_id`
+            This model starts decoding with :obj:`eos_token_id`
        encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
            Google "layerdrop arxiv", as its not explainable in one line.
        decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
@@ -92,26 +112,14 @@ FSMT_CONFIG_ARGS_DOC = r"""
        early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
            Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop
            the beam search when at least ``num_beams`` sentences are finished per batch or not.
-"""

+        Examples::

-class DecoderConfig(PretrainedConfig):
-    r"""
-    Configuration class for FSMT's decoder specific things.
-    note: this is a private helper class
-    """
-    model_type = "fsmt_decoder"
-
-    def __init__(self, vocab_size=0, bos_token_id=0):
-        super().__init__()
-        self.vocab_size = vocab_size
-        self.bos_token_id = bos_token_id
+            >>> from transformers import FSMTConfig, FSMTModel

+            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
+            >>> model = FSMTModel(config)

-@add_start_docstrings_to_callable(FSMT_CONFIG_ARGS_DOC)
-class FSMTConfig(PretrainedConfig):
-    r"""
-    Configuration class for FSMT.
    """
    model_type = "fsmt"

@@ -149,17 +157,6 @@ class FSMTConfig(PretrainedConfig):
        early_stopping=False,
        **common_kwargs
    ):
-        r"""
-        :class:`~transformers.FSMTConfig` is the configuration class for `FSMTModel`.
-
-        Examples::
-
-            >>> from transformers import FSMTConfig, FSMTModel
-
-            >>> config = FSMTConfig.from_pretrained('facebook/wmt19-en-ru')
-            >>> model = FSMTModel(config)
-
-        """
        if "hidden_size" in common_kwargs:
            raise ValueError("hidden size is called d_model")
        super().__init__(

--- a/src/transformers/configuration_funnel.py
+++ b/src/transformers/configuration_funnel.py
@@ -36,20 +36,21 @@ FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class FunnelConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel`.
-    It is used to instantiate an Funnel Transformer model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the Funnel Transformer `funnel-transformer/small <https://huggingface.co/funnel-transformer/small>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.FunnelModel` or a
+    :class:`~transformers.TFBertModel`. It is used to instantiate a Funnel Transformer model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the Funnel Transformer `funnel-transformer/small
+    <https://huggingface.co/funnel-transformer/small>`__ architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

-
    Args:
        vocab_size (:obj:`int`, `optional`, defaults to 30522):
-            Vocabulary size of the Funnel transformer. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.FunnelModel`.
+            Vocabulary size of the Funnel transformer. Defines the number of different tokens that can be represented
+            by the :obj:`inputs_ids` passed when calling :class:`~transformers.FunnelModel` or
+            :class:`~transformers.TFFunnelModel`.
        block_sizes (:obj:`List[int]`, `optional`, defaults to :obj:`[4, 4, 4]`):
            The sizes of the blocks used in the model.
        block_repeats (:obj:`List[int]`, `optional`):
@@ -77,7 +78,8 @@ class FunnelConfig(PretrainedConfig):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (:obj:`int`, `optional`, defaults to 3):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.FunnelModel`.
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.FunnelModel` or
+            :class:`~transformers.TFFunnelModel`.
        initializer_range (:obj:`float`, `optional`, defaults to 0.1):
            The standard deviation of the `uniform initializer` for initializing all weight matrices in attention
            layers.

--- a/src/transformers/configuration_gpt2.py
+++ b/src/transformers/configuration_gpt2.py
@@ -32,10 +32,10 @@ GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class GPT2Config(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model`.
-    It is used to instantiate an GPT-2 model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.GPT2Model` or a
+    :class:`~transformers.TFGPT2Model`. It is used to instantiate a GPT-2 model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the GPT-2 `small <https://huggingface.co/gpt2>`__ architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -43,60 +43,66 @@ class GPT2Config(PretrainedConfig):


    Args:
-        vocab_size (:obj:`int`, optional, defaults to 50257):
-            Vocabulary size of the GPT-2 model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.GPT2Model`.
-        n_positions (:obj:`int`, optional, defaults to 1024):
+        vocab_size (:obj:`int`, `optional`, defaults to 50257):
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.GPT2Model` or
+            :class:`~transformers.TFGPT2Model`.
+        n_positions (:obj:`int`, `optional`, defaults to 1024):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 1024):
+        n_ctx (:obj:`int`, `optional`, defaults to 1024):
            Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 768):
+        n_embd (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 12):
+        n_head (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        n_inner (:obj:`int`, optional, defaults to None):
+        n_inner (:obj:`int`, `optional`, defaults to None):
            Dimensionality of the inner feed-forward layers. :obj:`None` will set it to 4 times n_embd
-        activation_function (:obj:`str`, optional, defaults to 'gelu'):
-            Activation function selected in the list ["relu", "swish", "gelu", "tanh", "gelu_new"].
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+        activation_function (:obj:`str`, `optional`, defaults to :obj:`"gelu"`):
+            Activation function, to be selected in the list :obj:`["relu", "swish", "gelu", "tanh", "gelu_new"]`.
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
            The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        summary_type (:obj:`string`, optional, defaults to "cls_index"):
+        summary_type (:obj:`string`, `optional`, defaults to :obj:`"cls_index"`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
            Argument used when doing sequence summary. Used in for the multiple choice head in
            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Is one of the following options:
-
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.GPT2DoubleHeadsModel`.
-            Add a dropout before the projection and activation
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.GPT2DoubleHeadsModel` and :class:`~transformers.TFGPT2DoubleHeadsModel`.
+
+            The dropout ratio to be used after the projection and activation.

    Example::


--- a/src/transformers/configuration_longformer.py
+++ b/src/transformers/configuration_longformer.py
@@ -33,6 +33,10 @@ LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class LongformerConfig(RobertaConfig):
    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel` or a
+    :class:`~transformers.TFLongformerModel`. It is used to instantiate a Longformer model according to the specified
+    arguments, defining the model architecture.
+
    This is the configuration class to store the configuration of a :class:`~transformers.LongformerModel`.
    It is used to instantiate an Longformer model according to the specified arguments, defining the model
    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
@@ -42,8 +46,8 @@ class LongformerConfig(RobertaConfig):
    It reuses the same defaults. Please check the parent class for more information.

    Args:
-        attention_window (:obj:`int` or :obj:`List[int]`, optional, defaults to 512):
-            Size of an attention window around each token. If :obj:`int`, use the same size for all layers.
+        attention_window (:obj:`int` or :obj:`List[int]`, `optional`, defaults to 512):
+            Size of an attention window around each token. If an :obj:`int`, use the same size for all layers.
            To specify a different window size for each layer, use a :obj:`List[int]` where
            ``len(attention_window) == num_hidden_layers``.


--- a/src/transformers/configuration_lxmert.py
+++ b/src/transformers/configuration_lxmert.py
@@ -29,83 +29,91 @@ LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class LxmertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
-    It is used to instantiate an Lxmert model according to the specified arguments, defining the model
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.LxmertModel` or a
+    :class:`~transformers.TFLxmertModel`. It is used to instantiate a LXMERT model according to the specified
+    arguments, defining the model architecture.
+
+    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
+    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
+    for more information.


    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the BERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 768):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the LXMERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.LxmertModel` or
+            :class:`~transformers.TFLxmertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
-        r_layers (:obj:`int`, optional, defaults to 5):
+        r_layers (:obj:`int`, `optional`, defaults to 5):
            Number of hidden layers in the Transformer visual encoder.
-        l_layers (:obj:`int`, optional, defaults to 9):
+        l_layers (:obj:`int`, `optional`, defaults to 9):
            Number of hidden layers in the Transformer language encoder.
-        x_layers (:obj:`int`, optional, defaults to 5):
+        x_layers (:obj:`int`, `optional`, defaults to 5):
            Number of hidden layers in the Transformer cross modality encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 5):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 5):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 3072):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        visual_feat_dim (:obj:`int`, optional, defaults to 2048):
+        visual_feat_dim (:obj:`int`, `optional`, defaults to 2048):
            This represents the last dimension of the pooled-object features used as input for the model,
            representing the size of each object feature itself.
-        visual_pos_dim (:obj:`int`, optional, defaults to 4):
+        visual_pos_dim (:obj:`int`, `optional`, defaults to 4):
            This represents the number of spacial features that are mixed into the visual features.
            The default is set to 4 because most commonly this will represent the location of a bounding box.
-            i.e. (x, y, width, height)
-        visual_loss_normalizer (:obj:`float`, optional, defaults to 1/15):
+            i.e., (x, y, width, height)
+        visual_loss_normalizer (:obj:`float`, `optional`, defaults to 1/15):
            This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
            one decided to train with multiple vision-based loss objectives.
-        num_qa_labels (:obj:`int`, optional, defaults to 9500):
-            This represents the total number of different question answering (QA) labels there are. If using more than one dataset with QA,
-            the user will need to account for the total number of labels that all of the datasets have in total.
-        num_object_labels (:obj:`int`, optional, defaults to 1600):
-            This represents the total number of semantically unique objects that lxmert will be able to classify a pooled-object feature
-            as belonging too.
-        num_attr_labels (:obj:`int`, optional, defaults to 400):
-            This represents the total number of semantically unique attributes that lxmert will be able to classify a pooled-object feature
-            as possessing.
-        task_matched (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task is used for sentence-image matching. If the sentence correctly describes the image the label will be 1.
-            If the sentence does not correctly describe the image, the label will be 0.
-        task_mask_lm (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task is the defacto masked langauge modeling used in pretraining models such as BERT.
-        task_obj_predict (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task is set to true if the user would like to perform one of the following loss objectives:
-            object predicition, atrribute predicition, feature regression
-        task_qa (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the question-asnwering loss objective
-        visual_obj_loss (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the object-prediction loss objective
-        visual_attr_loss (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the attribute-prediction loss objective
-        visual_feat_loss (:obj:`bool`, optional, defaults to :obj:`True`):
-            This task specifies whether or not Lxmert will calculate the feature-regression loss objective
-        output_attentions (:obj:`bool`, optional, defaults to :obj:`False`):
-                if True, the vision, langauge, and cross-modality layers will be returned
-        output_hidden_states (:obj:`bool`, optional, defaults to :obj:`False`):
-                if True, final cross-modality hidden states for language and vision features will be returned
-
+        num_qa_labels (:obj:`int`, `optional`, defaults to 9500):
+            This represents the total number of different question answering (QA) labels there are. If using more than
+            one dataset with QA, the user will need to account for the total number of labels that all of the datasets
+            have in total.
+        num_object_labels (:obj:`int`, `optional`, defaults to 1600):
+            This represents the total number of semantically unique objects that lxmert will be able to classify a
+            pooled-object feature as belonging too.
+        num_attr_labels (:obj:`int`, `optional`, defaults to 400):
+            This represents the total number of semantically unique attributes that lxmert will be able to classify a
+            pooled-object feature as possessing.
+        task_matched (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            This task is used for sentence-image matching. If the sentence correctly describes the image the label
+            will be 1. If the sentence does not correctly describe the image, the label will be 0.
+        task_mask_lm (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to add masked language modeling (as used in pretraining models such as BERT) to the loss
+            objective.
+        task_obj_predict (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to add object predicition, attribute predicition and feature regression to the loss
+            objective.
+        task_qa (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to add the question-asnwering loss to the objective
+        visual_obj_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to calculate the object-prediction loss objective
+        visual_attr_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to calculate the attribute-prediction loss objective
+        visual_feat_loss (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to calculate the feature-regression loss objective
+        output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model should return the attentions from the vision, langauge, and cross-modality
+            layers should be returned.
+        output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether or not the model should return the hidden states from the vision, langauge, and cross-modality
+            layers should be returned.
    """

    model_type = "lxmert"

--- a/src/transformers/configuration_mmbt.py
+++ b/src/transformers/configuration_mmbt.py
@@ -22,15 +22,16 @@ logger = logging.get_logger(__name__)


 class MMBTConfig(object):
-    """Configuration class to store the configuration of a `MMBT Model`.
+    """
+    This is the configuration class to store the configuration of a :class:`~transformers.MMBTModel`. It is used to
+    instantiate a MMBT model according to the specified arguments, defining the model architecture.

    Args:
-        config (:obj:`~transformers.PreTrainedConfig`):
-            Config of the underlying Transformer models. Its values are
-            copied over to use a single config.
-        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
+        config (:class:`~transformers.PreTrainedConfig`):
+            Config of the underlying Transformer models. Its values are copied over to use a single config.
+        num_labels (:obj:`int`, `optional`):
            Size of final Linear layer for classification.
-        modal_hidden_size (:obj:`int`, optional, defautls to 2048):
+        modal_hidden_size (:obj:`int`, `optional`, defautls to 2048):
            Embedding dimension of the non-text modality encoder.
    """


--- a/src/transformers/configuration_mobilebert.py
+++ b/src/transformers/configuration_mobilebert.py
@@ -25,9 +25,9 @@ MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class MobileBertConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel`.
-    It is used to instantiate a MobileBERT model according to the specified arguments, defining the model
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.MobileBertModel` or a
+    :class:`~transformers.TFMobileBertModel`. It is used to instantiate a MobileBERT model according to the specified
+    arguments, defining the model architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
@@ -35,54 +35,56 @@ class MobileBertConfig(PretrainedConfig):


    Args:
-        vocab_size (:obj:`int`, optional, defaults to 30522):
-            Vocabulary size of the MobileBERT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.MobileBertModel`.
-        hidden_size (:obj:`int`, optional, defaults to 512):
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the MobileBERT model. Defines the number of different tokens that can be represented by
+            the :obj:`inputs_ids` passed when calling :class:`~transformers.MobileBertModel` or
+            :class:`~transformers.TFMobileBertModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 512):
            Dimensionality of the encoder layers and the pooler layer.
-        num_hidden_layers (:obj:`int`, optional, defaults to 24):
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 24):
            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (:obj:`int`, optional, defaults to 4):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 4):
            Number of attention heads for each attention layer in the Transformer encoder.
-        intermediate_size (:obj:`int`, optional, defaults to 512):
-            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
+        intermediate_size (:obj:`int`, `optional`, defaults to 512):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.0):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention probabilities.
-        max_position_embeddings (:obj:`int`, optional, defaults to 512):
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        type_vocab_size (:obj:`int`, optional, defaults to 2):
-            The vocabulary size of the `token_type_ids` passed into :class:`~transformers.MobileBertModel`.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        type_vocab_size (:obj:`int`, `optional`, defaults to 2):
+            The vocabulary size of the :obj:`token_type_ids` passed when calling :class:`~transformers.MobileBertModel`
+            or :class:`~transformers.TFMobileBertModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.

-        pad_token_id (:obj:`int`, optional, defaults to 0):
+        pad_token_id (:obj:`int`, `optional`, defaults to 0):
            The ID of the token in the word embedding to use as padding.
-        embedding_size (:obj:`int`, optional, defaults to 128):
+        embedding_size (:obj:`int`, `optional`, defaults to 128):
            The dimension of the word embedding vectors.
-        trigram_input (:obj:`bool`, optional, defaults to :obj:`True`):
+        trigram_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Use a convolution of trigram as input.
-        use_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
+        use_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether to use bottleneck in BERT.
-        intra_bottleneck_size (:obj:`int`, optional, defaults to 128):
+        intra_bottleneck_size (:obj:`int`, `optional`, defaults to 128):
            Size of bottleneck layer output.
-        use_bottleneck_attention (:obj:`bool`, optional, defaults to :obj:`False`):
+        use_bottleneck_attention (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to use attention inputs from the bottleneck transformation.
-        key_query_shared_bottleneck (:obj:`bool`, optional, defaults to :obj:`True`):
+        key_query_shared_bottleneck (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether to use the same linear transformation for query&key in the bottleneck.
-        num_feedforward_networks (:obj:`int`, optional, defaults to 4):
+        num_feedforward_networks (:obj:`int`, `optional`, defaults to 4):
            Number of FFNs in a block.
-        normalization_type (:obj:`str`, optional, defaults to "no_norm"):
-            The normalization type in BERT.
+        normalization_type (:obj:`str`, `optional`, defaults to :obj:`"no_norm"`):
+            The normalization type in MobileBERT.

-    Example:
+    Examples:

        >>> from transformers import MobileBertModel, MobileBertConfig


--- a/src/transformers/configuration_openai.py
+++ b/src/transformers/configuration_openai.py
@@ -28,73 +28,79 @@ OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class OpenAIGPTConfig(PretrainedConfig):
    """
-    This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel`.
-    It is used to instantiate an GPT model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
-    the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.
+    This is the configuration class to store the configuration of a :class:`~transformers.OpenAIGPTModel` or a
+    :class:`~transformers.TFOpenAIGPTModel`. It is used to instantiate a GPT model according to the specified
+    arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar
+    configuration to that of the `GPT <https://huggingface.co/openai-gpt>`__ architecture from OpenAI.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

    Args:
-        vocab_size (:obj:`int`, optional, defaults to 40478):
-            Vocabulary size of the GPT model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.CTRLModel`.
-        n_positions (:obj:`int`, optional, defaults to 512):
+        vocab_size (:obj:`int`, `optional`, defaults to 40478):
+            Vocabulary size of the GPT-2 model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.OpenAIGPTModel` or
+            :class:`~transformers.TFOpenAIGPTModel`.
+        n_positions (:obj:`int`, `optional`, defaults to 512):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        n_ctx (:obj:`int`, optional, defaults to 512):
+        n_ctx (:obj:`int`, `optional`, defaults to 512):
            Dimensionality of the causal mask (usually same as n_positions).
-        n_embd (:obj:`int`, optional, defaults to 768):
+        n_embd (:obj:`int`, `optional`, defaults to 768):
            Dimensionality of the embeddings and hidden states.
-        n_layer (:obj:`int`, optional, defaults to 12):
+        n_layer (:obj:`int`, `optional`, defaults to 12):
            Number of hidden layers in the Transformer encoder.
-        n_head (:obj:`int`, optional, defaults to 12):
+        n_head (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        afn (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
+        afn (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
-            If string, "gelu", "relu", "swish" and "gelu_new" are supported.
-        resid_pdrop (:obj:`float`, optional, defaults to 0.1):
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        resid_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (:obj:`int`, optional, defaults to 0.1):
+        embd_pdrop (:obj:`int`, `optional`, defaults to 0.1):
            The dropout ratio for the embeddings.
-        attn_pdrop (:obj:`float`, optional, defaults to 0.1):
+        attn_pdrop (:obj:`float`, `optional`, defaults to 0.1):
            The dropout ratio for the attention.
-        layer_norm_epsilon (:obj:`float`, optional, defaults to 1e-5):
+        layer_norm_epsilon (:obj:`float`, `optional`, defaults to 1e-5):
            The epsilon to use in the layer normalization layers
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        predict_special_tokens (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Whether special tokens should be predicted when the model is has a language modeling head.
-        summary_type (:obj:`string`, optional, defaults to "cls_index"):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Is one of the following options:
-
-            - 'last' => take the last token hidden state (like XLNet)
-            - 'first' => take the first token hidden state (like Bert)
-            - 'mean' => take the mean of all tokens hidden states
-            - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2)
-            - 'attn' => Not implemented now, use multi-head attention
-        summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Add a projection after the vector extraction
-        summary_activation (:obj:`string` or :obj:`None`, optional):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            'tanh' => add a tanh activation to the output, Other => no activation.
-        summary_proj_to_labels (:obj:`boolean`, optional, defaults to :obj:`True`):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            If True, the projection outputs to config.num_labels classes (otherwise to hidden_size). Default: False.
-        summary_first_dropout (:obj:`float`, optional, defaults to 0.1):
-            Argument used when doing sequence summary. Used in for the multiple choice head in
-            :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
-            Add a dropout before the projection and activation
-
-    Example::
+        predict_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not special tokens should be predicted when the model has a language modeling head.
+        summary_type (:obj:`str`, `optional`, defaults to :obj:`"cls_index"`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Has to be one of the following options:
+
+                - :obj:`"last"`: Take the last token hidden state (like XLNet).
+                - :obj:`"first"`: Take the first token hidden state (like BERT).
+                - :obj:`"mean"`: Take the mean of all tokens hidden states.
+                - :obj:`"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2).
+                - :obj:`"attn"`: Not implemented now, use multi-head attention.
+        summary_use_proj (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Whether or not to add a projection after the vector extraction.
+        summary_activation (:obj:`str`, `optional`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Pass :obj:`"tanh"` for a tanh activation to the output, any other value will result in no activation.
+        summary_proj_to_labels (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            Whether the projection outputs should have :obj:`config.num_labels` or :obj:`config.hidden_size` classes.
+        summary_first_dropout (:obj:`float`, `optional`, defaults to 0.1):
+            Argument used when doing sequence summary, used in the models
+            :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
+
+            The dropout ratio to be used after the projection and activation.
+
+    Examples::

        >>> from transformers import OpenAIGPTConfig, OpenAIGPTModel


--- a/src/transformers/configuration_reformer.py
+++ b/src/transformers/configuration_reformer.py
@@ -29,96 +29,120 @@ REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {

 class ReformerConfig(PretrainedConfig):
    r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`.
-    It is used to instantiate an Reformer model according to the specified arguments, defining the model
-    architecture.
+    This is the configuration class to store the configuration of a :class:`~transformers.ReformerModel`. It is used to
+    instantiate a Reformer model according to the specified arguments, defining the model architecture.

    Configuration objects inherit from  :class:`~transformers.PretrainedConfig` and can be used
    to control the model outputs. Read the documentation from  :class:`~transformers.PretrainedConfig`
    for more information.

    Args:
-        attention_head_size (:obj:`int`, optional, defaults to 64):
+        attention_head_size (:obj:`int`, `optional`, defaults to 64):
            Dimensionality of the projected key, query and value vectors
-        attn_layers (:obj:`list(str)`, optional, defaults to ["local", "lsh", "local", "lsh", "local", "lsh"]):
+        attn_layers (:obj:`List[str]`, `optional`, defaults to :obj:`["local", "lsh", "local", "lsh", "local", "lsh"]`):
            List of attention layer types in ascending order. It can be chosen between a
-            LSHSelfAttention layer ("lsh") and a LocalSelfAttention layer ("local").
-            For more information on LSHSelfAttention layer, see `LSH Self Attention <reformer.html#lsh-self-attention>`__ .
-            For more information on LocalSelfAttention layer, see `Local Self Attention <reformer.html#local-sensitive-hashing-self-attention>`__ .
-        axial_pos_embds (:obj:`bool`, optional, defaults to :obj:`True`):
-            If `True` use axial position embeddings. For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__
-        axial_norm_std (:obj:`float`, optional, defaluts to 1.0):
-            The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings.
-        axial_pos_shape (:obj:`list(int)`, optional, defaults to `[64, 64]`):
-            The position dims of the axial position encodings.
-            During training the product of the position dims has to equal the sequence length.
-            For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
-        axial_pos_embds_dim (:obj:`list(int)`, optional, defaults to `[64, 192]`):
-            The embedding dims of the axial position encodings.
-            The sum of the embedding dims has to equal the hidden size.
-            For more information on how axial position embeddings work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
-        chunk_size_lm_head (:obj:`int`, optional, defaults to 0):
+            LSHSelfAttention layer (:obj:`"lsh"`) and a LocalSelfAttention layer (:obj:`"local"`).
+
+            For more information on LSHSelfAttention layer, see `LSH Self Attention
+            <reformer.html#lsh-self-attention>`__. For more information on LocalSelfAttention layer, see `Local Self
+            Attention <reformer.html#local-sensitive-hashing-self-attention>`__.
+        axial_pos_embds (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to use axial position embeddings. For more information on how axial position embeddings
+            work, see `Axial Position Encodings <reformer.html#axial-positional-encodings>`__.
+        axial_norm_std (:obj:`float`, `optional`, defaults to 1.0):
+            The standard deviation of the normal_initializer for initializing the weight matrices of the axial
+            positional encodings.
+        axial_pos_shape (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 64]`):
+            The position dims of the axial position encodings. During training the product of the position dims has to
+            be equal to the sequence length.
+
+            For more information on how axial position embeddings work, see `Axial Position Encodings
+            <reformer.html#axial-positional-encodings>`__.
+        axial_pos_embds_dim (:obj:`List[int]`, `optional`, defaults to :obj:`[64, 192]`):
+            The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
+            hidden size.
+
+            For more information on how axial position embeddings work, see `Axial Position Encodings
+            <reformer.html#axial-positional-encodings>`__.
+        chunk_size_lm_head (:obj:`int`, `optional`, defaults to 0):
            The chunk size of the final language model feed forward head layer.
            A chunk size of 0 means that the feed forward layer is not chunked.
            A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time.
-            For more information on feed forward chunking, see `How does Feed Forward Chunking work? <../glossary.html#feed-forward-chunking>`__ .
-        eos_token_id (:obj:`int`, optional, defaults to 2):
-            The token id for the <EOS> token.
-        feed_forward_size (:obj:`int`, optional, defaults to 512):
-            Dimensionality of the "feed_forward" (i.e., feed-forward) layer in the residual attention block.
-        hash_seed (:obj:`int`, optional, defaults to `None`):
-            Seed that can be used to make local sensitive hashing in LSHSelfAttention deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be set to `None` to ensure fully random rotations in local sensitive hashing scheme.
-        hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "relu"):
-            The non-linear activation function (function or string) in the feed forward layer in the residual attention block.
-            If string, "gelu", "relu", "swish", "gelu_new" and "gelu_fast" are supported.
-        hidden_dropout_prob (:obj:`float`, optional, defaults to 0.05):
+
+            For more information on feed forward chunking, see `How does Feed Forward Chunking work?
+            <../glossary.html#feed-forward-chunking>`__.
+        eos_token_id (:obj:`int`, `optional`, defaults to 2):
+            The token id for the end-of-sentence token.
+        feed_forward_size (:obj:`int`, `optional`, defaults to 512):
+            Dimensionality of the feed_forward layer in the residual attention block.
+        hash_seed (:obj:`int`, `optional`):
+            Seed that can be used to make local sensitive hashing in :obj:`LSHSelfAttention` deterministic. This should
+            only be set for testing purposed. For evaluation and training purposes :obj:`hash_seed` should be left as
+            :obj:`None` to ensure fully random rotations in local sensitive hashing scheme.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"relu"`):
+            The non-linear activation function (function or string) in the feed forward layer in the residual attention
+            block.
+            If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.05):
            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        hidden_size (:obj:`int`, optional, defaults to 256):
+        hidden_size (:obj:`int`, `optional`, defaults to 256):
            Dimensionality of the output hidden states of the residual attention blocks.
-        initializer_range (:obj:`float`, optional, defaults to 0.02):
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        is_decoder (:obj:`bool`, optional, defaults to :obj:`False`):
-            If `is_decoder` is True, a causal mask is used in addition to `attention_mask`.
-            When using the Reformer for causal language modeling, `is_decoder` is set to `True`.
-        layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
+        is_decoder (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether ot not to use a causal mask in addition to the :obj:`attention_mask` passed to
+            :class:`~transformers.ReformerModel`. When using the Reformer for causal language modeling, this argument
+            should be set to :obj:`True`.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
-        local_chunk_length (:obj:`int`, optional, defaults to 64):
-            Length of chunk which attends to itself in LocalSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
-        local_num_chunks_before (:obj:`int`, optional, defaults to 1):
-            Number of previous neighbouring chunks to attend to in LocalSelfAttention layer to itself.
-        local_num_chunks_after (:obj:`int`, optional, defaults to 0):
-            Number of following neighbouring chunks to attend to in LocalSelfAttention layer in addition to itself.
-        local_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
-            The dropout ratio for the attention probabilities in LocalSelfAttention.
-        lsh_attn_chunk_length (:obj:`int`, optional, defaults to 64):
-            Length of chunk which attends to itself in LSHSelfAttention. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention).
-        lsh_num_chunks_before (:obj:`int`, optional, defaults to 1):
-            Number of previous neighbouring chunks to attend to in LSHSelfAttention layer to itself.
-        lsh_num_chunks_after (:obj:`int`, optional, defaults to 0):
-            Number of following neighbouring chunks to attend to in LSHSelfAttention layer to itself.
-        lsh_attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
-            The dropout ratio for the attention probabilities in LSHSelfAttention.
-        max_position_embeddings (:obj:`int`, optional, defaults to 4096):
+        local_chunk_length (:obj:`int`, `optional`, defaults to 64):
+            Length of chunk which attends to itself in :obj:`LocalSelfAttention`. Chunking reduces memory complexity
+            from sequence length x sequence length (self attention) to
+            chunk length x chunk length x sequence length / chunk length (chunked self attention).
+        local_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
+            Number of previous neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer to itself.
+        local_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
+            Number of following neighbouring chunks to attend to in :obj:`LocalSelfAttention` layer in addition to
+            itself.
+        local_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities in :obj:`LocalSelfAttention`.
+        lsh_attn_chunk_length (:obj:`int`, `optional`, defaults to 64):
+            Length of chunk which attends to itself in :obj:`LSHSelfAttention`. Chunking reduces memory complexity from
+            sequence length x sequence length (self attention) to
+            chunk length x chunk length x sequence length / chunk length (chunked self attention).
+        lsh_num_chunks_before (:obj:`int`, `optional`, defaults to 1):
+            Number of previous neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
+        lsh_num_chunks_after (:obj:`int`, `optional`, defaults to 0):
+            Number of following neighbouring chunks to attend to in :obj:`LSHSelfAttention` layer to itself.
+        lsh_attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities in :obj:`LSHSelfAttention`.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 4096):
            The maximum sequence length that this model might ever be used with.
            Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-        num_attention_heads (:obj:`int`, optional, defaults to 12):
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
-        num_buckets (:obj:`int` or :obj:`list(int)`, optional, defaults to `None`):
-            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`.
-            The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors.
-            The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` is set to `None`, a good value for `num_buckets` is calculated on the fly.
-        num_hashes (:obj:`int`, optional, defaults to 1):
-            Number of hashing rounds (e.g. number of random rotations) in Local Sensitive Hashing scheme.
-            The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes.
-        pad_token_id (:obj:`int`, optional, defaults to 0):
-            The token id for the <PAD> token.
-        vocab_size (:obj:`int`, optional, defaults to 320):
-            Vocabulary size of the Reformer model. Defines the different tokens that
-            can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.ReformerModel`.
+        num_buckets (:obj:`int` or :obj:`List[int]`, `optional`):
+            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
+            Each query key vector is hashed into a hash in :obj:`1, ..., num_buckets`.
+            The number of buckets can also be factorized into a list for improved memory complexity. In this case, each
+            query key vector is hashed into a hash in
+            :obj:`1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if :obj:`num_buckets` is
+            factorized into two factors.
+            The number of buckets (or the product the factors) should approximately equal
+            sequence length / lsh_chunk_length. If :obj:`num_buckets` not set, a good value is calculated on the fly.
+        num_hashes (:obj:`int`, `optional`, defaults to 1):
+            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme.
+            The higher :obj:`num_hashes`, the more accurate the :obj:`LSHSelfAttention` becomes, but also the more
+            memory and time intensive the hashing becomes.
+        pad_token_id (:obj:`int`, `optional`, defaults to 0):
+            The token id for the padding token.
+        vocab_size (:obj:`int`, `optional`, defaults to 320):\
+            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
        tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
            Whether to tie input and output embeddings.

-    Example::
+    Examples::

        >>> from transformers import ReformerModel, ReformerConfig